Notebook

Import temperature data from the DWD and process it¶

This notebook pulls historical temperature data from the DWD server and formats it for future use in other projects. The data is delivered in a hourly frequencs in a .zip file for each of the available weather stations. To use the data, we need everythin in a single .csv-file, all stations side-by-side. Also, we need the daily average.

To reduce computing time, we also crop all data earlier than 2007.

Files should be executed in the following pipeline:

1-dwd_konverter_download
2-dwd_konverter_extract
3-dwd_konverter_build_df
4-dwd_konverter_final_processing

3.) Import the .csv files into pandas and concat into a single df¶

Now we need to import everything that we have extracted. This operation is going to take some time (aprox 20 mins). If you want to save time, you can just delete a few of the .csv-files in the 'import' folder. The script works as well with only a few files.

Process individual files¶

The files are imported into a single df, stripped of unnecessary columns and filtered by date. Then we set a DateTimeIndex and concatenate them into the main_df. Because the loop takes a long time, we output some status messages, to ensure the process is still running.

Process the concatenated main_df¶

Then we display some infos of the main_df so we can ensure that there are no errors, mainly to ensure all data-types are recognized correctly. Also, we drop duplicate entries, in case some of the .csv files were copied.

Unstack and export¶

For the final step, we unstack the main_df and save it to a .csv and a .pkl file for the next step. Also, we display some output to get a grasp of what is going on.

In [1]:

import numpy as np
import pandas as pd
from IPython.display import clear_output

from pathlib import Path
import glob


import_files = glob.glob('import/*')
out_file = Path.cwd() / "export_uncleaned" / "to_clean"
#msum_file=  Path.cwd() / "export" / "monatssumme.csv"

obsolete_columns = [
    'QN_9',
    'RF_TU',
    'eor'
]

main_df = pd.DataFrame()
i = 1

for file in import_files:

    # Read in the next file
    df = pd.read_csv(file, delimiter=";")
    # Prepare the df befor merging (Drop obsolete, convert to datetime, filter to date, set index)
    df.drop(columns=obsolete_columns, inplace=True)
    df["MESS_DATUM"] = pd.to_datetime(df["MESS_DATUM"], format="%Y%m%d%H")
    df = df[df['MESS_DATUM']>= "2007-01-01"]
    df.set_index(['MESS_DATUM', 'STATIONS_ID'], inplace=True)
    
    # Merge to the main_df
    main_df = pd.concat([main_df, df])
    
    # Display some status messages
    clear_output(wait=True)
    display('Finished file: {}'.format(file), 'This is file {}'.format(i))
    display('Shape of the main_df is: {}'.format(main_df.shape))
    i+=1

# Check if all types are correct
display(main_df['TT_TU'].apply(lambda x: type(x).__name__).value_counts())
    
# Make sure that to files or observations a duplicates, eg. scan the index for duplicate entries.
# The ~ is a bitwise operation, meaning it flips all bits. 
main_df = main_df[~main_df.index.duplicated(keep='last')]


# Unstack the main_df
main_df = main_df.unstack('STATIONS_ID')
display('Shape of the main_df is: {}'.format(main_df.shape))

# Save main_df to a .csv file and a pickle to continue working in the next cell. 
main_df.to_pickle(Path(out_file).with_suffix('.pkl'))
main_df.to_csv(Path(out_file).with_suffix('.csv'), sep=";")

display(main_df.head())
display(main_df.describe())

'Finished file: import/produkt_tu_stunde_20041101_20191231_00078.txt'

'This is file 10'

'Shape of the main_df is: (771356, 1)'

float    771356
Name: TT_TU, dtype: int64

'Shape of the main_df is: (113952, 9)'

	TT_TU
STATIONS_ID	3	44	71	73	78	91	96	102	125
MESS_DATUM
2007-01-01 00:00:00	11.4	NaN	NaN	NaN	11.0	9.4	NaN	9.7	NaN
2007-01-01 01:00:00	12.0	NaN	NaN	NaN	11.4	9.6	NaN	10.4	NaN
2007-01-01 02:00:00	12.3	NaN	NaN	NaN	9.4	10.0	NaN	9.9	NaN
2007-01-01 03:00:00	11.5	NaN	NaN	NaN	9.3	9.7	NaN	9.5	NaN
2007-01-01 04:00:00	9.6	NaN	NaN	NaN	8.6	10.2	NaN	8.9	NaN

	TT_TU
STATIONS_ID	3	44	71	73	78	91	96	102	125
count	37224.000000	111003.000000	88391.000000	111471.000000	113950.000000	113950.000000	6399.000000	106379.000000	82589.000000
mean	10.103922	9.933213	8.399764	7.501486	9.872268	9.199869	12.730255	10.149991	1.045942
std	7.200001	14.445973	8.779766	47.537112	7.281215	8.400713	23.189555	10.728030	86.520406
min	-13.600000	-999.000000	-999.000000	-999.000000	-16.200000	-999.000000	-999.000000	-999.000000	-999.000000
25%	5.000000	4.900000	2.200000	2.800000	4.700000	3.400000	7.250000	5.700000	1.800000
50%	9.900000	10.000000	8.300000	9.300000	9.700000	8.900000	13.200000	10.200000	8.200000
75%	15.300000	15.200000	14.200000	15.800000	15.000000	14.700000	18.500000	15.200000	14.500000
max	36.200000	37.000000	33.700000	36.700000	39.000000	36.900000	37.900000	33.400000	33.700000

In [ ]: