Time series: Processing Notebook
This Notebook is part of the Time series Data Package of Open Power System Data.

Introductory Notes

This Notebook handles missing data, performs calculations and aggragations and creates the output files.

Settings

Set version number and recent changes

Executing this script till the end will create a new version of the data package. The Version number specifies the local directory for the data
We include a note on what has been changed.

In [ ]:
version = '2018-03-13'
changes = '''Include data for 2017 where available from primary sources and
include hourly load from ENTSO-E Power Statistics'''

Import Python libraries

This section: load libraries and set up a log.

Note that the download module makes use of the pycountry library that is not part of Anaconda. Install it with with pip install pycountry from the command line.

In [ ]:
# Python modules
from datetime import datetime, date, timedelta, time
import pandas as pd
import numpy as np
import logging
import json
import sqlite3
import yaml
import itertools
import os
import pytz
from shutil import copyfile

# Skripts from time-series repository
from timeseries_scripts.read import read
from timeseries_scripts.download import download
from timeseries_scripts.imputation import find_nan
from timeseries_scripts.imputation import resample_markers, glue_markers, mark_own_calc
from timeseries_scripts.make_json import make_json, get_sha_hash

# Reload modules with execution of any code, to avoid having to restart
# the kernel after editing timeseries_scripts
%load_ext autoreload
%autoreload 2

# speed up tab completion in Jupyter Notebook
%config Completer.use_jedi = False

Display options

In [ ]:
# Allow pretty-display of multiple variables
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Adjust the way pandas DataFrames a re displayed to fit more columns
pd.reset_option('display.max_colwidth')
pd.options.display.max_columns=60
#pd.options.display.max_colwidth=5

Set directories

In [ ]:
# make sure the working directory is this file's directory
try:
    os.chdir(home_path)
except NameError:
    home_path = os.path.realpath('.')
    
# optionally, set a different directory to store outputs and raw data,
# which will take up around 15 GB of disk space
use_external_dir = True
if use_external_dir:
    save_path = os.path.join('C:', os.sep, 'time series')
else:
    save_path = home_path

sources_yaml_path = os.path.join(home_path, 'input', 'sources.yml')
areas_csv_path = os.path.join(home_path, 'input', 'areas.csv')
data_path = os.path.join(save_path, version, 'original_data')
out_path = os.path.join(save_path, version)
temp_path = os.path.join(save_path, 'temp')
os.makedirs(out_path, exist_ok=True)
os.makedirs(temp_path, exist_ok=True)

# change to temp directory
os.chdir(temp_path)
os.getcwd()

Set up a log

In [ ]:
today = datetime.today().strftime('%Y-%m-%d')

FORMAT = '%(asctime)s %(levelname)s %(message)s'
DATEFORMAT = '%Y-%m-%d %H:%M:%S'
formatter = logging.Formatter(fmt=FORMAT, datefmt=DATEFORMAT)
logging.basicConfig(level=logging.INFO,
                    format=FORMAT,
                    datefmt=DATEFORMAT)

logfile = logging.FileHandler(today + '.log')
logfile.setFormatter(formatter)
logfile.setLevel(logging.INFO)
logger = logging.getLogger(__name__)
logger.addHandler(logfile)

logger.setLevel(logging.INFO)
# For more detailed logging messages, replace 'INFO' with 'DEBUG'
# (May slow down computation).

Select timerange

This section: select the time range and the data sources for download and read. Default: all data sources implemented, full time range available.

Source parameters are specified in input/sources.yml, which describes, for each source, the datasets (such as wind and solar generation) alongside all the parameters necessary to execute the downloads.

The option to perform downloading and reading of subsets is for testing only. To be able to run the script succesfully until the end, all sources have to be included, or otherwise the script will run into errors (i.e. the step where aggregate German timeseries are caculated requires data from all four German TSOs to be loaded).

In order to do this, specify the beginning and end of the interval for which to attempt the download.

Type None to download all available data.

In [ ]:
start_from_user = date(2006, 1, 1)
end_from_user = date(2017, 12, 31)

Select download source

Instead of downloading from the sources, the complete raw data can be downloaded as a zip file from the OPSD Server. Advantages are:

  • much faster download
  • back up of raw data in case it is deleted from the server at the original source

In order to do this, specify an archive version to use the raw data from that version that has been cached on the OPSD server as input. All data from that version will be downloaded - timerange and subset will be ignored.

Type None to download directly from the original sources.

In [ ]:
archive_version = None  # i.e. '2016-07-14'

Select subset

Optionally, specify a subset to download/read.
The next cell prints the available sources and datasets.

In [ ]:
with open(sources_yaml_path, 'r') as f:
    sources = yaml.load(f.read())
for k, v in sources.items():
    print(yaml.dump({k: list(v.keys())}, default_flow_style=False))

Copy from its output and paste to following cell to get the right format.
Type subset = None to include all data.

In [ ]:
subset = yaml.load('''
insert_source_here:
- insert_dataset1_from_that_source_here
- insert_dataset2_here
more_sources:
- more_data_sets
''')
subset = None  # to include all sources

# need to exclude Elia data due to unclear copyright situation
exclude = yaml.load('''
- Elia
''')

Now eliminate sources and variables not in subset.

In [ ]:
with open(sources_yaml_path, 'r') as f:
    sources = yaml.load(f.read())
if subset:  # eliminate sources and variables not in subset
    sources = {source_name: {k: v
                             for k, v in sources[source_name].items()
                             if k in variable_list}
               for source_name, variable_list in subset.items()}
if exclude:  # eliminate sources and variables in exclude
    sources = {source_name: variable_dict
               for source_name, variable_dict in sources.items()
               if not source_name in exclude}

Download

This section: download data. Takes about 1 hour to run for the complete data set (subset=None).

First, a data directory is created on your local computer. Then, download parameters for each data source are defined, including the URL. These parameters are then turned into a YAML-string. Finally, the download is executed file by file.

Each file is saved under it's original filename. Note that the original file names are often not self-explanatory (called "data" or "January"). The files content is revealed by its place in the directory structure.

Automatic download (for most sources)

In [ ]:
download(sources, data_path,
         archive_version=archive_version,
         start_from_user=start_from_user,
         end_from_user=end_from_user,
         testmode=False)
In [ ]:
download(sources, data_path,
         archive_version=archive_version,
         start_from_user=start_from_user,
         end_from_user=end_from_user,
         testmode=False)

Manual download

Energinet.dk

Go to http://osp.energinet.dk/_layouts/Markedsdata/framework/integrations/markedsdatatemplate.aspx.

Check The Boxes as specified below:

  • Periode
    • Hent udtræk fra perioden: 01-01-2005 Til: 01-01-2018
    • Select all months
  • Datakolonner
    • Elspot Pris, Valutakode/MWh: Select all
    • Produktion og forbrug, MWh/h: Select all
  • Udtræksformat
    • Valutakode: EUR
    • Decimalformat: Engelsk talformat (punktum som decimaltegn
    • Datoformat: Andet datoformat (ÅÅÅÅ-MM-DD)
    • Hent Udtræk: Til Excel

Click Hent Udtræk

You will receive a file Markedsata.xls of about 50 MB. Open the file in Excel. There will be a warning from Excel saying that file extension and content are in conflict. Select "open anyways" and and save the file as .xlsx.

In order to be found by the read-function, place the downloaded file in the following subdirectory:
{{data_path}}{{os.sep}}Energinet.dk{{os.sep}}prices_wind_solar{{os.sep}}2005-01-01_2017-12-31

CEPS

Go to http://www.ceps.cz/en/all-data#GenerationRES

check boxes as specified below:

DISPLAY DATA FOR: Generation RES
TURN ON FILTER checked
FILTER SETTINGS:

  • Set the date range
    • interval
    • from: 2012 to: 2018
  • Agregation and data version
    • Aggregation: Hour
    • Agregation function: average (AVG)
    • Data version: real data
  • Filter
    • Type of power plant: ALL
  • Click USE FILTER
  • DOWNLOAD DATA: DATA V TXT

You will receive a file data.txt of about 1.5 MB.

In order to be found by the read-function, place the downloaded file in the following subdirectory:
{{data_path}}{{os.sep}}CEPS{{os.sep}}wind_pv{{os.sep}}2012-01-01_2018-01-31

ENTSO-E Power Statistics

Go to https://www.entsoe.eu/data/statistics/Pages/monthly_hourly_load.aspx

check boxes as specified below:

  • Date From: 01-01-2016 Date To: 30-04-2016
  • Country: (Select All)
  • Scale values to 100% using coverage ratio: NO
  • View Report
  • Click the Save symbol and select Excel

You will receive a file 1.01 Monthly%5FHourly Load%5FValues%5FStatistical.xlsx of about 1 MB.

In order to be found by the read-function, place the downloaded file in the following subdirectory:
{{os.sep}}original_data{{os.sep}}ENTSO-E Power Statistics{{os.sep}}load{{os.sep}}2016-01-01_2016-04-30

The data covers the period from 01-01-2016 up to the present, but 4 months of data seems to be the maximum that interface supports for a single download request, so you have to repeat the download procedure for 4-Month periods to cover the whole period until the present.

Read

This section: Read each downloaded file into a pandas-DataFrame and merge data from different sources if it has the same time resolution. Takes ~15 minutes to run.

Preparations

Set the title of the rows at the top of the data used to store metadata internally. The order of this list determines the order of the levels in the resulting output.

In [ ]:
headers = ['region', 'variable', 'attribute', 'source', 'web', 'unit']

Read a prepared table containing meta data on the geographical areas

In [ ]:
areas = pd.read_csv(areas_csv_path)

View the areas table

In [ ]:
areas.loc[areas['area ID'].notnull(),:'EIC'].fillna('')

Reading loop

Loop through sources and variables to do the reading. First read the originla CSV, Excel etc. files into pandas DataFrames.

In [ ]:
areas = pd.read_csv(areas_csv_path)

# For each source in the source dictionary
for source_name, source_dict in sources.items():
    # For each variable from source_name
    for variable_name, param_dict in source_dict.items():
#        variable_dir = os.path.join(data_path, source_name, variable_name)
        res_list = param_dict['resolution']
        url = param_dict['web']
        for res_key in res_list:
            df = read(data_path, areas, source_name, variable_name,
                      url, res_key, headers,
                      start_from_user=start_from_user,
                      end_from_user=end_from_user)

            os.makedirs(res_key, exist_ok=True)
            filename = '_'.join([source_name, variable_name]) + '.pickle'
            df.to_pickle(os.path.join(res_key, filename))
In [ ]:
df

Then combine the DataFrames that have the same temporal resolution

In [ ]:
# Create a dictionary of empty DataFrames to be populated with data
data_sets = {'15min': pd.DataFrame(), '30min': pd.DataFrame(), '60min': pd.DataFrame()}
entso_e = {'15min': pd.DataFrame(), '30min': pd.DataFrame(), '60min': pd.DataFrame()}
for res_key in data_sets.keys():
    for filename in os.listdir(res_key):
        source_name = filename.split('_')[0]
        if subset and not source_name in subset.keys():
            continue
        logger.info(filename)
        df_portion = pd.read_pickle(os.path.join(res_key, filename))

        if source_name == 'ENTSO-E Transparency FTP':
            dfs = entso_e
        else:
            dfs = data_sets

        if dfs[res_key].empty:
            dfs[res_key] = df_portion
        elif not df_portion.empty:
            dfs[res_key] = dfs[res_key].combine_first(df_portion)
        else:
            logger.warning(filename + ' WAS EMPTY')
In [ ]:
for res_key, df in data_sets.items():
    logger.info(res_key + ': %s', df.shape)
for res_key, df in entso_e.items():
    logger.info('ENTSO-E ' + res_key + ': %s', df.shape)

Display some rows of the dataframes to get a first impression of the data.

In [ ]:
data_sets['15min'].head()

Save raw data

Save the DataFrames created by the read function to disk. This way you have the raw data to fall back to if something goes wrong in the ramainder of this notebook without having to repeat the previos steps.

In [ ]:
data_sets['15min'].to_pickle('raw_data_15.pickle')
data_sets['30min'].to_pickle('raw_data_30.pickle')
data_sets['60min'].to_pickle('raw_data_60.pickle')
entso_e['15min'].to_pickle('raw_entso_e_15.pickle')
entso_e['30min'].to_pickle('raw_entso_e_30.pickle')
entso_e['60min'].to_pickle('raw_entso_e_60.pickle')

Load the DataFrames saved above

In [ ]:
data_sets = {}
data_sets['15min'] = pd.read_pickle('raw_data_15.pickle')
data_sets['30min'] = pd.read_pickle('raw_data_30.pickle')
data_sets['60min'] = pd.read_pickle('raw_data_60.pickle')
entso_e = {}
entso_e['15min'] = pd.read_pickle('raw_entso_e_15.pickle')
entso_e['30min'] = pd.read_pickle('raw_entso_e_30.pickle')
entso_e['60min'] = pd.read_pickle('raw_entso_e_60.pickle')

Processing

This section: missing data handling, aggregation of sub-national to national data, aggregate 15'-data to 60'-resolution. Takes 30 minutes to run.

Missing data handling

Interpolation

Patch missing data. At this stage, only small gaps (up to 2 hours) are filled by linear interpolation. This catched most of the missing data due to daylight savings time transitions, while leaving bigger gaps untouched

The exact locations of missing data are stored in the nan_table DataFrames.

Where data has been interpolated, it is marked in a new column comment. For eaxample the comment solar_DE-transnetbw_generation; means that in the original data, there is a gap in the solar generation timeseries from TransnetBW in the time period where the marker appears.

Patch the datasets and display the location of missing Data in the original data. Takes ~5 minutes to run.

In [ ]:
nan_tables = {}
for res_key, df in data_sets.items():
    data_sets[res_key], nan_tables[res_key] = find_nan(
        df, res_key, headers, patch=True)
In [ ]:
for res_key, df in entso_e.items():
    entso_e[res_key], nan_tables[res_key + ' ENTSO-E'] = find_nan(
        df, res_key, headers, patch=True)

Execute this to see an example of where the data has been patched.

In [ ]:
data_sets['15min'][data_sets['15min']['interpolated_values'].notnull()].tail()

Display the table of regions of missing values

In [ ]:
nan_tables['60min']

You can export the NaN-tables to Excel in order to inspect where there are NaNs

In [ ]:
#writer = pd.ExcelWriter('NaN_table.xlsx') 
for res_key, df in nan_tables.items():
    # df.to_excel(writer, res_key)
    # For some reason, output to Excel doesn't work currently, we resort to csv
    df.to_csv('NaN_table' + res_key + '.csv')
#writer.save()

Save/Load the patched data sets

In [ ]:
data_sets['15min'].to_pickle('patched_15.pickle')
data_sets['30min'].to_pickle('patched_30.pickle')
data_sets['60min'].to_pickle('patched_60.pickle')
entso_e['15min'].to_pickle('patched_entso_e_15.pickle')
entso_e['30min'].to_pickle('patched_entso_e_30.pickle')
entso_e['60min'].to_pickle('patched_entso_e_60.pickle')
In [ ]:
data_sets = {}
data_sets['15min'] = pd.read_pickle('patched_15.pickle')
data_sets['30min'] = pd.read_pickle('patched_30.pickle')
data_sets['60min'] = pd.read_pickle('patched_60.pickle')
entso_e = {}
entso_e['15min'] = pd.read_pickle('patched_entso_e_15.pickle')
entso_e['30min'] = pd.read_pickle('patched_entso_e_30.pickle')
entso_e['60min'] = pd.read_pickle('patched_entso_e_60.pickle')

Country specific calculations

Calculate onshore wind generation for German TSOs

For 50 Hertz, it is already in the data. For TenneT, it is calculated by substracting offshore from total generation. For Amprion and TransnetBW, onshore wind generation is just total wind generation. Takes <1 second to run.

In [ ]:
# Some of the following operations require the Dataframes to be lexsorted in
# the columns
for res_key, df in data_sets.items():
    df.sort_index(axis=1, inplace=True)
In [ ]:
for area, source, url in zip(
    ['DE_amprion', 'DE_tennet', 'DE_transnetbw'],
    ['Amprion', 'TenneT', 'TransnetBW'],
    ['http://www.amprion.net/en/wind-feed-in',
     'http://www.tennettso.de/site/en/Transparency/publications/network-figures/actual-and-forecast-wind-energy-feed-in',
     'https://www.transnetbw.com/en/transparency/market-data/key-figures']):
    
    new_col_header = {
        'variable': 'wind_onshore',
        'region': '{area}',
        'attribute': 'generation_actual',
        'source': '{source}',
        'web': '{url}',
        'unit': 'MW'
    }

    if area == 'DE_tennet':
        colname = ('DE_tennet', 'wind_offshore', 'generation_actual', 'TenneT')
        offshore = data_sets['15min'].loc[:, colname]
    else:
        offshore = 0

    data_sets['15min'][
        tuple(new_col_header[level].format(area=area, source=source, url=url)
        for level in headers)
    ] = (data_sets['15min'][(area, 'wind', 'generation_actual', source)] - offshore)

    # Sort again
    data_sets['15min'].sort_index(axis=1, inplace=True)

Calculate aggregate wind capacity for Germany (on + offshore)

Apart from being interesting on it's own, this is also required to calculate an aggregated wind-profile for Germany

In [ ]:
new_col_header = {
    'variable': 'wind',
    'region': 'DE',
    'attribute': 'capacity',
    'source': 'own calculation based on BNetzA and netztransparenz.de',
    'web': 'http://data.open-power-system-data.org/renewable_power_plants',
    'unit': 'MW'
}
new_col_header = tuple(new_col_header[level] for level in headers)

data_sets['15min'][new_col_header] = (
    data_sets['15min']
    .loc[:, ('DE', ['wind_onshore', 'wind_offshore'], 'capacity')]
    .sum(axis=1, skipna=False))

# Sort again
data_sets['15min'].sort_index(axis=1, inplace=True)

Aggregate German data from individual TSOs and calculate availabilities/profiles

The wind and solar in-feed data for the 4 German balancing areas is summed up and stored in in new columns, which are then used to calculate profiles, that is, the share of wind/solar capacity producing at a given time. The column headers are created in the fashion introduced in the read script. Takes 5 seconds to run.

In [ ]:
for tech in ['solar', 'wind', 'wind_onshore', 'wind_offshore']:
    for attribute in ['generation_actual']:  # we could also include 'generation_forecast'
        sum_col = data_sets['15min'].loc(axis=1)[(
            ['DE_50hertz', 'DE_amprion', 'DE_tennet', 'DE_transnetbw'],
            tech, attribute)].sum(axis=1, skipna=False).to_frame()

        # Calculate aggregate German generation
        sum_col.columns = pd.MultiIndex.from_tuples(tuples, names=headers)

        if attribute == 'generation_actual':
            # Calculate the profile column
            profile_col = (sum_col.values /
                           data_sets['15min']['DE', tech, 'capacity']).round(4)

            # Create a new MultiIndex and append profile to the dataset
            new_col_header = {
                'variable': '{tech}',
                'region': 'DE',
                'attribute': 'profile',
                'source': 'own calculation based on German TSOs, BNetzA and netztranzparenz.de',
                'web': '',
                'unit': 'fraction'
            }
            tuples = [tuple(new_col_header[level].format(tech=tech)
                            for level in headers)]
            profile_col.columns = pd.MultiIndex.from_tuples(tuples, names=headers)
            data_sets['15min'] = data_sets['15min'].combine_first(profile_col)
        
        else:
            pass

        # Create a new MultiIndex 
        new_col_header = {
            'variable': '{tech}',
            'region': 'DE',
            'attribute': '{attribute}',
            'source': 'own calculation based on German TSOs',
            'web': '',
            'unit': 'MW'
        }
        tuples = [tuple(new_col_header[level].format(
            tech=tech, attribute=attribute) for level in headers)]
        
        # append aggregate German generation to the dataset after rounding
        data_sets['15min'] = data_sets['15min'].combine_first(sum_col.round(0))

Another savepoint

In [ ]:
data_sets['15min'].to_pickle('calc_15.pickle')
data_sets['30min'].to_pickle('calc_30.pickle')
data_sets['60min'].to_pickle('calc_60.pickle')
In [ ]:
data_sets = {}
data_sets['15min'] = pd.read_pickle('calc_15.pickle')
data_sets['30min'] = pd.read_pickle('calc_30.pickle')
data_sets['60min'] = pd.read_pickle('calc_60.pickle')
entso_e = {}
entso_e['15min'] = pd.read_pickle('patched_entso_e_15.pickle')
entso_e['30min'] = pd.read_pickle('patched_entso_e_30.pickle')
entso_e['60min'] = pd.read_pickle('patched_entso_e_60.pickle')

Fill columns not retrieved directly from TSO webites with ENTSO-E Transparency data

In [ ]:
for res_key, df in entso_e.items():
    # Combine with TSO data
    
    # Copy entire 30min data from ENTSO-E
    if data_sets[res_key].empty:
        data_sets[res_key] = df

    else:
        # Keep only region, variable, attribute in MultiIndex for comparison  
        data_cols = data_sets[res_key].columns.droplevel(['source', 'web', 'unit'])
        # Compare columns from ENTSO-E against ours, keep which we don't have yet
        tuples = [col for col in df.columns if not col[:3] in data_cols]
        add_cols = pd.MultiIndex.from_tuples(tuples, names=headers)
        data_sets[res_key] = data_sets[res_key].combine_first(df[add_cols])

        # Add the ENTSO-E markers (but only for the columns actually copied)
        add_cols = ['_'.join(col[:3]) for col in tuples]
        # Spread marker column out over a DataFrame for easiser comparison
        # Filter out everey secon colum, which contains the delimiter "|"
        # fromthe marker
        marker_table = (df['interpolated_values'].str.split(' | ', expand=True)
                        .filter(regex='^\d*[02468]$', axis='columns'))
        # Replace cells with markers marking columns not copied with NaNs
        marker_table[~marker_table.isin(add_cols)] = np.nan
        
        for col_name, col in marker_table.iteritems():
            if col_name == 0:
                marker_entso_e = col
            else:
                marker_entso_e = glue_markers(marker_entso_e, col)
        
        # Glue ENTSO-E marker onto our old marker
        marker = data_sets[res_key]['interpolated_values']
        data_sets[res_key].loc[:, 'interpolated_values'] = glue_markers(
            marker, df['interpolated_values'].reindex(marker.index))

Resample higher frequencies to 60'

Some data comes in 15 or 30-minute intervals (i.e. German or British renewable generation), other in 60-minutes (i.e. load data from ENTSO-E and Prices). We resample the 15 and 30-minute data to hourly resolution and append it to the 60-minutes dataset.

The marker column is resampled separately in such a way that all information on where data has been interpolated is preserved.

The .resample('H').mean() methods calculates the means from the values for 4 quarter hours [:00, :15, :30, :45] of an hour values, inserts that for :00 and drops the other 3 entries. Takes 15 seconds to run.

In [ ]:
#marker_60 = data_sets['60min']['interpolated_values']
for res_key, df in data_sets.items():
    if res_key == '60min':
        break
    # Resample first the marker column 
    marker_resampled = df['interpolated_values'].groupby(
        pd.Grouper(freq='60Min', closed='left', label='left')
        ).agg(resample_markers, drop_region='DE_AT_LU')
    marker_resampled = marker_resampled.reindex(data_sets['60min'].index)

    # Glue condensed 15 min marker onto 60 min marker
    data_sets['60min'].loc[:, 'interpolated_values'] = glue_markers(
        data_sets['60min']['interpolated_values'],
        marker_resampled.reindex(data_sets['60min'].index))

    # Drop DE_AT_LU bidding zone data from the 15 minute resolution data to
    # be resampled since it is already provided in 60 min resolution by 
    # ENTSO-E Transparency
    df = df.drop('DE_AT_LU', axis=1, errors='ignore')
    
    # Do the resampling
    resampled = df.resample('H').mean()
    resampled.columns = resampled.columns.map(mark_own_calc)
    resampled.columns.names = headers
    
    # Round the resampled columns
    for col in resampled.columns:
        if col[2] == 'profile':        
            resampled.loc[:,col] = resampled.loc[:,col].round(4)
        else:
            resampled.loc[:,col] = resampled.loc[:,col].round(0)
    
    data_sets['60min'] = data_sets['60min'].combine_first(resampled)

Insert a column with Central European (Summer-)time

The index column of th data sets defines the start of the timeperiod represented by each row of that data set in UTC time. We include an additional column for the CE(S)T Central European (Summer-) Time, as this might help aligning the output data with other data sources.

In [ ]:
info_cols = {'utc': 'utc_timestamp',
             'cet': 'cet_cest_timestamp',
             'marker': 'interpolated_values'}
In [ ]:
for res_key, df in data_sets.items():
    if df.empty:
        continue
    df.index.rename(info_cols['utc'], inplace=True)
    df.insert(0, info_cols['cet'],
              df.index.tz_localize('UTC').tz_convert('Europe/Brussels'))

Create a final savepoint

In [ ]:
data_sets['15min'].to_pickle('final_15.pickle')
data_sets['30min'].to_pickle('final_30.pickle')
data_sets['60min'].to_pickle('final_60.pickle')
In [ ]:
os.chdir(temp_path)
data_sets = {}
data_sets['15min'] = pd.read_pickle('final_15.pickle')
data_sets['30min'] = pd.read_pickle('final_30.pickle')
data_sets['60min'] = pd.read_pickle('final_60.pickle')

Show the colum names contained inthe final DataFrame in a table

In [ ]:
col_info = pd.DataFrame()
df = data_sets['60min']
for level in df.columns.names:
    col_info[level] = df.columns.get_level_values(level)
    
col_info

Write data to disk

This section: Save as Data Package (data in CSV, metadata in JSON file). All files are saved in the directory of this notebook. Alternative file formats (SQL, XLSX) are also exported. Takes about 1 hour to run.

Limit time range

Cut off the data outside of [start_from_user:end_from_user]

In [ ]:
for res_key, df in data_sets.items():
    # In order to make sure that the respective time period is covered in both
    # UTC and CE(S)T, we set the start in CE(S)T, but the end in UTC
    if start_from_user:
        start_from_user = (
            pytz.timezone('Europe/Brussels')
            .localize(datetime.combine(start_from_user, time()))
            .astimezone(pytz.timezone('UTC')))
    if end_from_user:
        end_from_user = (
            pytz.timezone('UTC')
            .localize(datetime.combine(end_from_user, time()))
            # Appropriate offset to inlude the end of period
            + timedelta(days=1, minutes=-int(res_key[:2])))
    # Then cut off the data_set
    data_sets[res_key] = df.loc[start_from_user:end_from_user, :]

Different shapes

Data are provided in three different "shapes":

  • SingleIndex (easy to read for humans, compatible with datapackage standard, small file size)
    • Fileformat: CSV, SQLite
  • MultiIndex (easy to read into GAMS, not compatible with datapackage standard, small file size)
    • Fileformat: CSV, Excel
  • Stacked (compatible with data package standard, large file size, many rows, too many for Excel)
    • Fileformat: CSV

The different shapes need to be created internally befor they can be saved to files. Takes about 1 minute to run.

In [ ]:
data_sets_singleindex = {}
data_sets_multiindex = {}
data_sets_stacked = {}
for res_key, df in data_sets.items():
    if df.empty:
        continue

#    # Round floating point numbers to 2 digits
#    for col_name, col in df.iteritems():
#        if col_name[0] in info_cols.values():
#            pass
#        elif col_name[2] == 'profile':
#            df[col_name] = col.round(4)
#        else:
#            df[col_name] = col.round(3)
            
    # MultIndex
    data_sets_multiindex[res_key + '_multiindex'] = df
    
    # SingleIndex
    df_singleindex = df.copy()
    # use first 3 levels of multiindex to create singleindex
    df_singleindex.columns = [
        col_name[0] if col_name[0] in info_cols.values()
        else '_'.join([level for level in col_name[0:3] if not level == ''])
        for col_name in df.columns.values]

    data_sets_singleindex[res_key + '_singleindex'] = df_singleindex
    
    # Stacked
    stacked = df.copy().drop(columns=info_cols['cet'], level=0)
    stacked.columns = stacked.columns.droplevel(['source', 'web', 'unit'])
    # Concatrenate all columns below each other (="stack").
    # df.transpose().stack() is faster than stacking all column levels 
    # seperately
    stacked = stacked.transpose().stack(dropna=True).to_frame(name='data')
    data_sets_stacked[res_key + '_stacked'] = stacked

Write to SQL-database

This file format is required for the filtering function on the OPSD website. This takes ~3 minutes to complete.

In [ ]:
for res_key, df in data_sets_singleindex.items():
    table = 'time_series_' + res_key
    df = df.copy()
    df.index = df.index.strftime('%Y-%m-%dT%H:%M:%SZ')
    cet_col_name = info_cols['cet']
    df[cet_col_name] = (df[cet_col_name].dt.strftime('%Y-%m-%dT%H:%M:%S%z'))
    df.to_sql(table, sqlite3.connect('time_series.sqlite'),
              if_exists='replace', index_label=info_cols['utc'])

Write to Excel

Writing the full tables to Excel takes extremely long. As a workaround, only the first 5 rows are exported. The rest of the data can than be inserted manually from the _multindex.csv files.

In [ ]:
writer = pd.ExcelWriter('time_series1.xlsx')
for res_key, df in data_sets_multiindex.items():
    df.head().to_excel(writer, res_key.split('_')[0], float_format='%.2f',
                       merge_cells=True)
    # merge_cells=False doesn't work properly with multiindex
writer.save()

Write to CSV

This takes about 10 minutes to complete.

In [ ]:
# itertoools.chain() allows iterating over multiple dicts at once
for res_stacking_key, df in itertools.chain(
        data_sets_singleindex.items(),
        data_sets_multiindex.items(),
        data_sets_stacked.items()
    ):
    
    if not res_stacking_key.startswith('60min'):
        continue
        
    df = df.copy()
    
    # convert the format of the cet_cest-timestamp to ISO-8601
    if not (res_stacking_key.split('_')[1] == 'stacked':
#            or type(df.iloc[0,0]) == str):
        df.iloc[:,0] = df.iloc[:,0].dt.strftime('%Y-%m-%dT%H:%M:%S%z')
    filename = 'time_series_' + res_stacking_key + '.csv'
    df.to_csv(filename, float_format='%.4f',
              date_format='%Y-%m-%dT%H:%M:%SZ')

Create metadata

This section: create the metadata, both general and column-specific. All metadata we be stored as a JSON file. Takes 10s to run.

In [ ]:
# Change to out_path directory
os.chdir(out_path)
os.getcwd()
In [ ]:
make_json(data_sets, info_cols, version, changes, headers, areas)

Write checksums.txt

We publish SHA-checksums for the outputfiles on GitHub to allow verifying the integrity of outputfiles on the OPSD server.

In [ ]:
os.chdir(out_path)
files = os.listdir(out_path)

# Create checksums.txt in the output directory
with open('checksums.txt', 'w') as f:
    for file_name in files:
        if file_name.split('.')[-1] in ['csv', 'sqlite', 'xlsx']:
            file_hash = get_sha_hash(file_name)
            f.write('{},{}\n'.format(file_name, file_hash))

# Copy the file to root directory from where it will be pushed to GitHub,
# leaving a copy in the version directory for reference
copyfile('checksums.txt', os.path.join(home_path, 'checksums.txt'))