This notebook pulls historical temperature data from the DWD server and formats it for future use in other projects. The data is delivered in a hourly frequencs in a .zip file for each of the available weather stations. To use the data, we need everythin in a single .csv-file, all stations side-by-side. Also, we need the daily average.
To reduce computing time, we also crop all data earlier than 2007.
Files should be executed in the following pipeline:
Now we need to import everything that we have extracted. This operation is going to take some time (aprox 20 mins). If you want to save time, you can just delete a few of the .csv-files in the 'import' folder. The script works as well with only a few files.
The files are imported into a single df, stripped of unnecessary columns and filtered by date. Then we set a DateTimeIndex and concatenate them into the main_df. Because the loop takes a long time, we output some status messages, to ensure the process is still running.
Then we display some infos of the main_df so we can ensure that there are no errors, mainly to ensure all data-types are recognized correctly. Also, we drop duplicate entries, in case some of the .csv files were copied.
For the final step, we unstack the main_df and save it to a .csv and a .pkl file for the next step. Also, we display some output to get a grasp of what is going on.
import numpy as np
import pandas as pd
from IPython.display import clear_output
from pathlib import Path
import glob
import_files = glob.glob('import/*')
out_file = Path.cwd() / "export_uncleaned" / "to_clean"
#msum_file= Path.cwd() / "export" / "monatssumme.csv"
obsolete_columns = [
'QN_9',
'RF_TU',
'eor'
]
main_df = pd.DataFrame()
i = 1
for file in import_files:
# Read in the next file
df = pd.read_csv(file, delimiter=";")
# Prepare the df befor merging (Drop obsolete, convert to datetime, filter to date, set index)
df.drop(columns=obsolete_columns, inplace=True)
df["MESS_DATUM"] = pd.to_datetime(df["MESS_DATUM"], format="%Y%m%d%H")
df = df[df['MESS_DATUM']>= "2007-01-01"]
df.set_index(['MESS_DATUM', 'STATIONS_ID'], inplace=True)
# Merge to the main_df
main_df = pd.concat([main_df, df])
# Display some status messages
clear_output(wait=True)
display('Finished file: {}'.format(file), 'This is file {}'.format(i))
display('Shape of the main_df is: {}'.format(main_df.shape))
i+=1
# Check if all types are correct
display(main_df['TT_TU'].apply(lambda x: type(x).__name__).value_counts())
# Make sure that to files or observations a duplicates, eg. scan the index for duplicate entries.
# The ~ is a bitwise operation, meaning it flips all bits.
main_df = main_df[~main_df.index.duplicated(keep='last')]
# Unstack the main_df
main_df = main_df.unstack('STATIONS_ID')
display('Shape of the main_df is: {}'.format(main_df.shape))
# Save main_df to a .csv file and a pickle to continue working in the next cell.
main_df.to_pickle(Path(out_file).with_suffix('.pkl'))
main_df.to_csv(Path(out_file).with_suffix('.csv'), sep=";")
display(main_df.head())
display(main_df.describe())
'Finished file: import/produkt_tu_stunde_20041101_20191231_00078.txt'
'This is file 10'
'Shape of the main_df is: (771356, 1)'
float 771356 Name: TT_TU, dtype: int64
'Shape of the main_df is: (113952, 9)'
TT_TU | |||||||||
---|---|---|---|---|---|---|---|---|---|
STATIONS_ID | 3 | 44 | 71 | 73 | 78 | 91 | 96 | 102 | 125 |
MESS_DATUM | |||||||||
2007-01-01 00:00:00 | 11.4 | NaN | NaN | NaN | 11.0 | 9.4 | NaN | 9.7 | NaN |
2007-01-01 01:00:00 | 12.0 | NaN | NaN | NaN | 11.4 | 9.6 | NaN | 10.4 | NaN |
2007-01-01 02:00:00 | 12.3 | NaN | NaN | NaN | 9.4 | 10.0 | NaN | 9.9 | NaN |
2007-01-01 03:00:00 | 11.5 | NaN | NaN | NaN | 9.3 | 9.7 | NaN | 9.5 | NaN |
2007-01-01 04:00:00 | 9.6 | NaN | NaN | NaN | 8.6 | 10.2 | NaN | 8.9 | NaN |
TT_TU | |||||||||
---|---|---|---|---|---|---|---|---|---|
STATIONS_ID | 3 | 44 | 71 | 73 | 78 | 91 | 96 | 102 | 125 |
count | 37224.000000 | 111003.000000 | 88391.000000 | 111471.000000 | 113950.000000 | 113950.000000 | 6399.000000 | 106379.000000 | 82589.000000 |
mean | 10.103922 | 9.933213 | 8.399764 | 7.501486 | 9.872268 | 9.199869 | 12.730255 | 10.149991 | 1.045942 |
std | 7.200001 | 14.445973 | 8.779766 | 47.537112 | 7.281215 | 8.400713 | 23.189555 | 10.728030 | 86.520406 |
min | -13.600000 | -999.000000 | -999.000000 | -999.000000 | -16.200000 | -999.000000 | -999.000000 | -999.000000 | -999.000000 |
25% | 5.000000 | 4.900000 | 2.200000 | 2.800000 | 4.700000 | 3.400000 | 7.250000 | 5.700000 | 1.800000 |
50% | 9.900000 | 10.000000 | 8.300000 | 9.300000 | 9.700000 | 8.900000 | 13.200000 | 10.200000 | 8.200000 |
75% | 15.300000 | 15.200000 | 14.200000 | 15.800000 | 15.000000 | 14.700000 | 18.500000 | 15.200000 | 14.500000 |
max | 36.200000 | 37.000000 | 33.700000 | 36.700000 | 39.000000 | 36.900000 | 37.900000 | 33.400000 | 33.700000 |