National generation capacity: Processing notebook
This Notebook is part of the National Generation Capacity Datapackage of Open Power System Data.

1. Introductory notes

The script processes the compiled nationally aggregated generation capacity for European countries. Due to varying formats and data specifications of references for national generation capacity, the script firstly focuses on rearranging the manually compiled data. Thus, the script itself does not collect, select, download or manage data from original sources. Secondly, international data sources, such as EUROSTAT and ENTSO-E, are directly downloaded from original web sources and complement the initial data set.

2. Script setup

In [ ]:
import json
import logging
import os.path
import shutil
import sqlite3
import hashlib
import pandas as pd
import yaml  # http://pyyaml.org/, pip install pyyaml, conda install pyyaml
import os
import shutil
import numpy as np
import urllib.parse
import urllib.request
import posixpath
import datetime
import zipfile
import openpyxl
from openpyxl.styles import PatternFill, colors, Font, Alignment
from openpyxl.utils import get_column_letter

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    datefmt='%d %b %Y %H:%M:%S'
)

logger = logging.getLogger()

# create download and output folder if they do not exist
os.makedirs('download', exist_ok=True)
os.makedirs('output', exist_ok=True)
os.makedirs(os.path.join('output'), exist_ok=True)
os.makedirs(os.path.join('output', 'original_data'), exist_ok=True)
# logging.info(format)
In [ ]:
def get_sha_hash(path, blocksize=65536):
    sha_hasher = hashlib.sha256()
    with open(path, 'rb') as f:
        buffer = f.read(blocksize)
        while len(buffer) > 0:
            sha_hasher.update(buffer)
            buffer = f.read(blocksize)
        return sha_hasher.hexdigest()
In [ ]:
def downloadandcache(url, filename):
    """
    Download a file into a folder called "downloads".
    Returns the local filepath.

    Parameters
    ----------
    url : str
        Url of a file to be downloaded
    filename : str
        Name of the downloaded file
    """

    path = urllib.parse.urlsplit(url).path
    now = datetime.datetime.now()
    datestring = str(now.year) + "-" + str(now.month) + "-" + str(now.day)
    filepath = os.path.join('download', datestring + "-" + filename)
    filepath_original_data = os.path.join('output',
                                          'original_data',
                                          filename)

    # check if file exists, otherwise download it
    if not os.path.exists(filepath):
        logger.info('Downloading file %s', filename)
        urllib.request.urlretrieve(url, filepath)
        urllib.request.urlretrieve(url, filepath_original_data)
    else:
        logger.info('Using local file from %s', filepath)

    return filepath

3. Data download and import

We compile data from different national and international sources. Firstly, national data sources are manually compiled due to varying data formats and specifications. Secondly, international sources are compiled directly and appended to the compiled data set. The international data sources comprise:

In the following section, the data sets are downloaded and uploaded to Python.

3.1 Manually compiled dataset

The manually compiled dataset is imported and rearranged to a DataFrame for further processing. The dataset comprises for each European country and specified generation technology different data entries, which are based on different sources. As these sources differ by country and year, information on the corresponding reference are directly given with the data entry.

In [ ]:
data_file = 'National_Generation_Capacities.xlsx'
filepath = os.path.join('input', data_file)

# Read data into pandas
data_raw = pd.read_excel(filepath,
                         sheetname='Summary',
                         header=None,
                         na_values=['-'],
                         skiprows=0)


# Deal with merged cells from Excel: fill first three rows with information
data_raw.iloc[0:2] = data_raw.iloc[0:2].fillna(method='ffill', axis=1)

# Set index for rows
data_raw = data_raw.set_index([0])
data_raw.index.name = 'technology'

# Extract energylevels from raw data
energylevels_raw = data_raw.ix[:, 0:5]

# Delete definition of energy levels from raw data
data_raw = data_raw.drop(data_raw.columns[[0, 1, 2, 3, 4]], axis=1)

# Set multiindex column names
data_raw.columns = pd.MultiIndex.from_arrays(data_raw[:6].values,
                                             names=['country', 'type', 'year',
                                                    'source', 'source_type',
                                                    'capacity_definition'])

# Remove 3 rows which are already used as column names
data_raw = data_raw[pd.notnull(data_raw.index)]

# Extract the ordering of technologies
technology_order = data_raw.index.str.replace('- ', '').values.tolist()

# data_raw

3.2 EUROSTAT data

EUROSTAT publishes annual structural data on national electricity generation capacities for European countries. The dataset is available in the EUROSTAT database within the category 'Environment and Energy' (nrg_113a).

In [ ]:
# Data source for EUROSTAT data
# http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Fnrg_113a.tsv.gz

url_eurostat = ('http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/'
                'BulkDownloadListing?sort=1&downfile=data%2Fnrg_113a.tsv.gz')

filepath_eurostat = downloadandcache(url_eurostat, 'nrg_113a.tsv.gz')

print(filepath_eurostat)
In [ ]:
data_eurostat = pd.read_csv(filepath_eurostat,
                            compression='gzip',
                            sep='\t|,',
                            engine='python')

data_eurostat['source'] = 'EUROSTAT'
data_eurostat['source_type'] = 'Statistical Office'
data_eurostat['type'] = 'Installed capacity in MW'

# data_eurostat.head()

3.3 ENTSO-E data

The ENTSO-E publishes annual data on national generation capacites in different specifications and formats. We use two relevant data sources from the ENTSOE-E, which comprises firstly statistical data within the Data Portal (up to 2015) or ENTSO-E Transparency Platform, and secondly datasets compiled within the ENTSO-E System Outlook & Adequacy Forecast (SO&AF). The ENTSO-E Transparency Platform is currently not implemented as a data source for national generation capacities.

The advantage of the ENTSO-E SO&AF is the higher granularity of the data with respect to the main fuel or technology. However, as the SO&AF provides a forecast on future system conditions in particular peak hours, the dataset also accounts for expected capacity changes throughout the years. Therefore, we only consider years which are closest to the publication year of the respective SO&AF.

3.3.1 ENTSO-E statistical data

In the following, we use the statistical data available in the Data Portal (up to 2015).

In [ ]:
# Data source for ENTSO-E capacity data
# https://www.entsoe.eu/fileadmin/template/other/statistical_database/excel.php?pid=151&show_repr=2&opt_period=1&opt_Year=2015&send=send&opt_Response=99&dataindx=0

data_entsoe = pd.DataFrame()
for year in range(2010, 2016):
    url_entsoe = ('https://www.entsoe.eu/fileadmin/template/other/'
                  'statistical_database/excel.php?pid=151&show_repr=2&'
                  'opt_period=1&opt_Year=' + str(year) + '&send=send&opt_Response=99&dataindx=0')
    filepath_entsoe = downloadandcache(url_entsoe, 'Statistics_' + str(year) + '.xls')
    rawdata_entsoe = pd.read_excel(filepath_entsoe, sheet='Statistics', skiprows=5, header=1)
    data_entsoe = data_entsoe.append(rawdata_entsoe)

# data_entsoe.head()

3.3.2 ENTSO-E SO&AF data

The following datasets cover the ENTSO-E System Outlook & Adequacy Forecast (SO&AF) for the years 2011-2015. For each SO&AF dataset, we process only the capacity values of the first year, except for the SO&AF 2014, which provides two relevant years.

In [ ]:
url_entsoe_soaf2015 = (
    'https://www.entsoe.eu/Documents/Publications/SDC/data/SO_AF_2015_dataset.zip')
url_entsoe_soaf2014 = (
    'https://www.entsoe.eu/Documents/SDC%20documents/SOAF/140602_SOAF%202014_dataset.zip')
url_entsoe_soaf2013 = (
    'https://www.entsoe.eu/fileadmin/user_upload/_library/publications/entsoe/So_AF_2013-2030/130403_SOAF_2013-2030_dataset.zip')
url_entsoe_soaf2012 = (
    'https://www.entsoe.eu/fileadmin/user_upload/_library/SDC/SOAF/120705_SOAF_2012_Dataset.zip')
url_entsoe_soaf2011 = (
    'https://www.entsoe.eu/fileadmin/user_upload/_library/SDC/SOAF/SO_AF_2011_-_2025_.zip')

data_soaf = pd.DataFrame()
data_soaf_2011 = pd.DataFrame()
data_soaf_2012 = pd.DataFrame()
data_soaf_2013 = pd.DataFrame()
data_soaf_2014 = pd.DataFrame()
data_soaf_2015 = pd.DataFrame()

# 2011
filepath_entsoe_soaf = downloadandcache(url_entsoe_soaf2011, 'SO_AF_2011_-_2025_.zip')
with zipfile.ZipFile(filepath_entsoe_soaf, "r") as O:
    O.extractall('download')

xlsx = pd.ExcelFile(os.path.join('download', 'SO&AF 2011 - 2025 Scenario B.xls'))
list_sheets = xlsx.sheet_names

for sheet in list_sheets:
    data_to_append = pd.read_excel(os.path.join('download', 'SO&AF 2011 - 2025 Scenario B.xls'),
                                   sheetname=str(sheet), skiprows=11, header=[0], index_col=0)
    data_to_append = data_to_append[[2011]]
    data_to_append['country'] = str(sheet[0:2])
    data_to_append = data_to_append.set_index(['country'], append=True)
    data_soaf_2011 = data_soaf_2011.append(data_to_append)

# 2012
filepath_entsoe_soaf = downloadandcache(url_entsoe_soaf2012, '120705_SOAF_2012_Dataset.zip')
with zipfile.ZipFile(filepath_entsoe_soaf, "r") as O:
    O.extractall('download')

xlsx = pd.ExcelFile(os.path.join('download', 'SOAF 2012 Scenario B.xls'))
list_sheets = xlsx.sheet_names

for sheet in list_sheets:
    data_to_append = pd.read_excel(os.path.join('download', 'SOAF 2012 Scenario B.xls'),
                                   sheetname=str(sheet), skiprows=11, header=[0], index_col=0)
    data_to_append = data_to_append[[2012]]
    data_to_append['country'] = str(sheet[0:2])
    data_to_append = data_to_append.set_index(['country'], append=True)
    data_soaf_2012 = data_soaf_2012.append(data_to_append)

# 2013
filepath_entsoe_soaf = downloadandcache(url_entsoe_soaf2013, '130403_SOAF_2013-2030_dataset.zip')
with zipfile.ZipFile(filepath_entsoe_soaf, "r") as O:
    O.extractall('download')

xlsx = pd.ExcelFile(os.path.join('download', 'ScB.xls'))
list_sheets = xlsx.sheet_names

for sheet in list_sheets:
    data_to_append = pd.read_excel(os.path.join('download', 'ScB.xls'),
                                   sheetname=str(sheet), skiprows=11, header=[0], index_col=0)
    data_to_append = data_to_append[[2013]]
    data_to_append['country'] = str(sheet[0:2])
    data_to_append = data_to_append.set_index(['country'], append=True)
    data_soaf_2013 = data_soaf_2013.append(data_to_append)

# 2014
filepath_entsoe_soaf = downloadandcache(url_entsoe_soaf2014, '140602_SOAF%202014_dataset.zip')
with zipfile.ZipFile(filepath_entsoe_soaf, "r") as O:
    O.extractall('download')

xlsx = pd.ExcelFile(os.path.join('download', 'ScB.xlsx'))
list_sheets = xlsx.sheet_names

for sheet in list_sheets:
    data_to_append = pd.read_excel(os.path.join('download', 'ScB.xlsx'),
                                   sheetname=str(sheet), skiprows=11, header=[0], index_col=0)
    data_to_append = data_to_append[[2014, 2015]]
    data_to_append['country'] = str(sheet[0:2])
    data_to_append = data_to_append.set_index(['country'], append=True)
    data_soaf_2014 = data_soaf_2014.append(data_to_append)

# 2015
filepath_entsoe_soaf = downloadandcache(url_entsoe_soaf2015, 'SO_AF_2015_dataset.zip')
with zipfile.ZipFile(filepath_entsoe_soaf, "r") as O:
    O.extractall('download')

xlsx = pd.ExcelFile(os.path.join('download', 'SO&AF 2015 dataset', 'ScB_publication.xlsx'))
list_sheets = xlsx.sheet_names

for sheet in list_sheets:
    data_to_append = pd.read_excel(os.path.join('download', 'SO&AF 2015 dataset', 'ScB_publication.xlsx'),
                                   sheetname=str(sheet), skiprows=11, header=[0], index_col=0)
    data_to_append = data_to_append[[2016]]
    data_to_append['country'] = str(sheet[0:2])
    data_to_append = data_to_append.set_index(['country'], append=True)
    data_soaf_2015 = data_soaf_2015.append(data_to_append)
In [ ]:
data_soaf = pd.concat([data_soaf_2011,data_soaf_2012,data_soaf_2013,data_soaf_2014,data_soaf_2015], axis=1)

4. Data processing

In the following section, the different data sources are standardized and combined to a single data set.

4.1 Manually compiled dataset

The initial raw data is given as a cross-table format. To ensure the compatibility of standard data formats, we convert the initial crosstab format of the input data to a list.

In [ ]:
# Reshape dataframe to list
data_opsd = pd.DataFrame(data_raw.stack(level=['source', 'source_type', 'year',
                                               'type', 'country',
                                               'capacity_definition']))

# Reset index for dataframe
data_opsd = data_opsd.reset_index()
data_opsd['technology'] = data_opsd['technology'].str.replace('- ', '')
data_opsd = data_opsd.set_index('technology')

# Delete entries with missing source
data_opsd = data_opsd[data_opsd['source'].isnull() == False]
data_opsd = data_opsd[data_opsd['source'] != 0]

# Delete entries from EUROSTAT and entsoe as they will be directly used from original sources
data_opsd = data_opsd[data_opsd['source'] != 'EUROSTAT']
data_opsd = data_opsd[data_opsd['source'] != 'entsoe']


data_opsd = data_opsd.rename(columns={0: 'capacity'})

data_opsd['capacity'] = pd.to_numeric(data_opsd['capacity'], errors='coerce')

data_opsd = data_opsd.reset_index()

# For some source, permission to publish data
data_opsd.loc[(data_opsd['source'] == 'ELIA'),
              'comment'] = 'data available, but cannot be provided'
data_opsd.loc[(data_opsd['source'] == 'BMWi'),
              'comment'] = 'data available, but cannot be provided'
data_opsd.loc[(data_opsd['source'] == 'Mavir'),
              'comment'] = 'data available, but cannot be provided'

# data_opsd.head()

4.2 EUROSTAT data

4.2.1 Convert cross-table format to list

In [ ]:
data_eurostat = pd.melt(data_eurostat, id_vars=['unit', 'product',
                                                'indic_nrg', 'geo\\time',
                                                'source', 'source_type', 'type'],
                        var_name='year', value_name='value')
# data_eurostat.head()

4.2.2 Read definition table for energy sources and merge with dataset

The classification of generation capacities in the EUROSTAT dataset is specified in Regulation (EC) No 1099/2008 (Annex B, 3.3). The available EUROSTAT dataset nrg_113a covers the following indicators:

indic_nrg Description Technology in OPSD
12_1176011 Electrical capacity, main activity producers - Combustible Fuels Fossil fuels & bioenergy
12_1176012 Electrical capacity, autoproducers - Combustible Fuels Fossil fuels & bioenergy
12_1176061 Electrical capacity, main activity producers - Mixed plants
12_1176101 Electrical capacity, main activity producers - Other Sources
12_1176102 Electrical capacity, autoproducers - Other Sources
12_1176111 Electrical capacity, main activity producers - Steam
12_1176112 Electrical capacity, autoproducers - Steam
12_1176121 Electrical capacity, main activity producers - Gas Turbine
12_1176122 Electrical capacity, autoproducers - Gas Turbine
12_1176131 Electrical capacity, main activity producers - Combined Cycle
12_1176132 Electrical capacity, autoproducers - Combined Cycle
12_1176141 Electrical capacity, main activity producers - Internal Combustion
12_1176142 Electrical capacity, autoproducers - Internal Combustion
12_1176401 Electrical capacity, main activity producers - Other Type of Generation
12_1176402 Electrical capacity, autoproducers - Other Type of Generation
12_1176253 Net maximum capacity - Municipal Wastes Non-renewable waste
12_1176263 Net maximum capacity - Wood/Wood Wastes/Other Solid Wastes Other bioenergy and renewable waste
12_1176273 Net maximum capacity - Biogases Biomass and biogas
12_1176283 Net maximum capacity - Industrial Wastes (non-renewable) Non-renewable waste
12_1176343 Net maximum capacity - Liquid Biofuels Biomass and biogas
12_1176031 Electrical capacity, main activity producers - Nuclear Nuclear
12_1176032 Electrical capacity, autoproducers - Nuclear Nuclear
12_1176051 Electrical capacity, main activity producers - Hydro Hydro
12_1176052 Electrical capacity, autoproducers - Hydro Hydro
12_1176071 Net electrical capacity, main activity producers - Pure Pumped Hydro Pumped storage
12_1176072 Net electrical capacity, autoproducers - Pure Pumped Hydro Pumped storage
12_117615 Net maximum capacity - Hydro <1 MW
12_117616 Net maximum capacity - Hydro >= 1 MW and <= 10 MW
12_117617 Net maximum capacity - Hydro 10 MW and over
12_1176301 Electrical capacity, main activity producers - Tide, wave and ocean Marine
12_1176302 Electrical capacity, autoproducers - Tide, wave and ocean Marine
12_1176303 Net maximum capacity - Tide, Wave, Ocean
12_1176081 Electrical capacity, main activity producers - Geothermal Geothermal
12_1176082 Electrical capacity, autoproducers - Geothermal Geothermal
12_1176083 Net maximum capacity - Geothermal
12_1176091 Electrical capacity, main activity producers - Wind Wind
12_1176092 Electrical capacity, autoproducers - Wind Wind
12_1176233 Net maximum capacity - Solar Photovoltaic Photovoltaics
12_1176243 Net maximum capacity - Solar Thermal Electric Concentrated solar power

Bold rows indicate top level classes within the EUROSTAT classification, whereas normal and italic rows cover different kinds of subclassifications. Especially within the top level 'Combustible fuels' different kinds of subcategorizations based on fuel or technology are available. Simarily, 'Hydro' is differentiated by type (e.g. pumped-hydro storage) or capacity classes. Italic rows are not further considered within the OPSD dataset due to the mismatch with existing technology classes.

In [ ]:
data_definition = pd.read_csv(os.path.join('input', 'definition_EUROSTAT_indic.txt'),
                              header=None,
                              names=['indic', 'description',
                                     'energy source'],
                              sep='\t')

data_eurostat = data_eurostat.merge(data_definition,
                                    how='left',
                                    left_on='indic_nrg',
                                    right_on='indic')
In [ ]:
data_eurostat = data_eurostat[data_eurostat['energy source'].isnull() == False]

data_eurostat['value'] = data_eurostat['value'].astype(str)
data_eurostat['value'], data_eurostat['comment'] = data_eurostat.value.str.split(' ', 1).str

data_eurostat['value'] = data_eurostat['value'].replace(':', np.nan)
data_eurostat.loc[data_eurostat['value'].isnull() == True,
                  'comment'] = 'not available'

data_eurostat['year'] = data_eurostat['year'].astype(int)
data_eurostat['value'] = data_eurostat['value'].astype(float)

# data_eurostat.head()

4.2.3 Adjust EUROSTAT data to OPSD data format

In the following, we adjust the EUROSTAT dataset to the OPSD data format. The necessary steps are:

  1. Rename, delete columns aligned with OPSD dataset,
  2. Deselect aggregated countries not contained in the OPSD dataset,
  3. Aggregate entries based on energy source, year and country.
In [ ]:
data_eurostat = data_eurostat.drop(['unit', 'product', 'indic_nrg',
                                    'indic', 'description'], axis=1)

data_eurostat = data_eurostat.rename(columns={'geo\\time': 'country',
                                              'energy source': 'technology',
                                              'value': 'capacity'})
# data_eurostat.head()
In [ ]:
data_eurostat['country'] = data_eurostat['country'].replace('UK', 'GB')
data_eurostat['country'] = data_eurostat['country'].replace('EL', 'GR')

# Limit countries to OPSD countries (if required)
#countries = data_opsd.country.unique()
#data_eurostat = data_eurostat[data_eurostat['country'].isin(countries)]

# Consider only countries and no aggregates, like EU28 or EA19
data_eurostat = data_eurostat[data_eurostat['country'].apply(len) == 2]


# data_eurostat.head()
In [ ]:
data_eurostat = pd.DataFrame(data_eurostat.groupby(['technology', 'source',
                                                    'source_type', 'year',
                                                    'type', 'country'])
                             ['capacity'].sum())

data_eurostat.loc[data_eurostat['capacity'].isnull() == True,
                  'comment'] = 'not available'
data_eurostat['comment'] = data_eurostat['comment'].fillna('').astype(str)
data_eurostat = data_eurostat.reset_index()
# data_eurostat.head()

4.2.4 Determine aggregated values

The categorization of technologies within the OPSD data format requires the specification of aggregates and/or subcategories to ensure consistency of the dataset on each technology level.

4.2.4.1 Technology and fuel level

In [ ]:
country_list = data_eurostat.country.unique()
year_list = data_eurostat.year.unique()

table_to_append = pd.DataFrame(columns=['country', 'year', 'technology', 'capacity'])
table_to_append['year'] = table_to_append['year'].astype(int)
table_to_append['capacity'] = table_to_append['capacity'].astype(float)
for country in country_list:
    for year in year_list:

        col_list = ['country', 'year', 'technology', 'capacity']
        data_selected = data_eurostat.loc[(data_eurostat['country'] == country)
                                          & (data_eurostat['year'] == int(year))].copy()
        data_selected.loc[data_selected['capacity'].isnull() == True, 'capacity'] = 0

        # Solar
        val_1 = data_selected.loc[data_selected['technology'] == 'Photovoltaics',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Concentrated solar power',
                                  'capacity'].values[0]
        val = val_1 + val_2
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized solar',
                                       0]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Solar',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Wind
        val_1 = data_selected.loc[data_selected['technology'] == 'Wind',
                                  'capacity'].values[0]
        val = val_1
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized wind',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Hydro
        val_1 = data_selected.loc[data_selected['technology'] == 'Hydro',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Pumped storage',
                                  'capacity'].values[0]
        val = val_1 - val_2
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized hydro',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Bioenergy and renewable waste ### Sum of 'Biomass and biogas' and 'Other bioenergy[...]'. No 'Sewage and landfill'
        val_1 = data_selected.loc[data_selected['technology'] == 'Biomass and biogas',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Other bioenergy and renewable waste',
                                  'capacity'].values[0]
        val = val_1 + val_2
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Bioenergy and renewable waste',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Fossil fuels
        # The EUROSTAT definition of combustible fuels includes also biomass, which needs to be
        # separated within the OPSD categories.

        val_1 = data_selected.loc[data_selected['technology'] == 'Fossil fuels',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Biomass and biogas',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Other bioenergy and renewable waste',
                                  'capacity'].values[0]
        val = val_1 - val_2 - val_3
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Fossil fuels',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        val_4 = data_selected.loc[data_selected['technology'] == 'Non-renewable waste',
                                  'capacity'].values[0]
        val = val_1 - val_2 - val_3 - val_4
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized fossil fuels',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)


# Drop 'Fossil Fuels' which still include 'Biomass and biogas'
data_eurostat = data_eurostat[data_eurostat.technology != 'Fossil fuels']

# Append OPSD-conform data to original frame with corrected 'Fossil fuels' value
data_eurostat = data_eurostat.append(table_to_append)

4.2.4.2 Fuel type level

In [ ]:
table_to_append = pd.DataFrame(columns=['country', 'year', 'technology', 'capacity'])
table_to_append['year'] = table_to_append['year'].astype(int)
table_to_append['capacity'] = table_to_append['capacity'].astype(float)
for country in country_list:
    for year in year_list:

        col_list = ['country', 'year', 'technology', 'capacity']
        data_selected = data_eurostat.loc[(data_eurostat['country'] == country)
                                          & (data_eurostat['year'] == int(year))].copy()
        data_selected.loc[data_selected['capacity'].isnull() == True, 'capacity'] = 0

        # Sum Renewable energy sources
        val_1 = data_selected.loc[data_selected['technology'] == 'Hydro',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Wind',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Solar',
                                  'capacity'].values[0]
        val_4 = data_selected.loc[data_selected['technology'] == 'Geothermal',
                                  'capacity'].values[0]
        val_5 = data_selected.loc[data_selected['technology'] == 'Marine',
                                  'capacity'].values[0]
        val_6 = data_selected.loc[data_selected['technology'] == 'Bioenergy and renewable waste',
                                  'capacity'].values[0]
        val = val_1 + val_2 + val_3 + val_4 + val_5 + val_6
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Renewable energy sources',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

# Append to data frame
data_eurostat = data_eurostat.append(table_to_append)

4.2.4.3 Total level

In [ ]:
table_to_append = pd.DataFrame(columns=['country', 'year', 'technology', 'capacity'])
table_to_append['year'] = table_to_append['year'].astype(int)
table_to_append['capacity'] = table_to_append['capacity'].astype(float)
for country in country_list:
    for year in year_list:

        col_list = ['country', 'year', 'technology', 'capacity']
        data_selected = data_eurostat.loc[(data_eurostat['country'] == country)
                                          & (data_eurostat['year'] == int(year))].copy()
        data_selected.loc[data_selected['capacity'].isnull() == True, 'capacity'] = 0

        # Sum Renewable energy sources
        val_1 = data_selected.loc[data_selected['technology'] == 'Fossil fuels',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Nuclear',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Renewable energy sources',
                                  'capacity'].values[0]
        val = val_1 + val_2 + val_3
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Total',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

data_eurostat = data_eurostat.append(table_to_append)

4.2.5 Add additional information

In [ ]:
data_eurostat['source'] = 'EUROSTAT'
data_eurostat['source_type'] = 'Statistical Office'
data_eurostat['capacity_definition'] = 'Unknown'
data_eurostat['type'] = 'Installed capacity in MW'

#data_eurostat.head()

4.3 ENTSO-E statistical data

4.3.1 Align table structure to OPSD

In [ ]:
data_entsoe = pd.melt(data_entsoe,
                      id_vars=['country', 'year', 'repr'],
                      var_name='energy_source',
                      value_name='value')

data_entsoe = data_entsoe.rename(columns={'energy_source': 'technology',
                                          'value': 'capacity',
                                          'repr': 'comment'})
#data_entsoe['comment'] = 'Representativity of total national dataset: ' + data_entsoe['comment'].astype(str)

data_entsoe['country'] = data_entsoe['country'].replace('NI', 'GB')


# Limit countries to OPSD countries (if required)
#countries = data_opsd.country.unique()
#data_entsoe = data_entsoe[data_entsoe['country'].isin(countries)]

data_entsoe = pd.DataFrame(data_entsoe.groupby(['country', 'year',
                                                'technology'])
                           ['capacity'].sum())
data_entsoe = data_entsoe.reset_index()

data_entsoe['source'] = 'entsoe Statistics'
data_entsoe['source_type'] = 'Other association'
data_entsoe['capacity_definition'] = 'Net capacity'
data_entsoe['type'] = 'Installed capacity in MW'

# data_entsoe.head()

4.3.2 Standardize definition of energy sources/technologies

In [ ]:
dict_energy_source = {'hydro': 'Hydro',
                      'of which storage': 'Reservoir',
                      'of which run of river': 'Run-of-river',
                      'of which pumped storage': 'Pumped storage',
                      'nuclear': 'Nuclear',
                      'renewable': 'renewable',  # auxiliary definition, will be deleted at a later stage
                      'of which wind': 'Wind',
                      'of which solar': 'Solar',
                      'of which biomass': 'Biomass and biogas',
                      'fossil_fuels': 'Fossil fuels',
                      'others': 'Other or unspecified energy sources',
                      ' ': 'NaN'}
data_entsoe["technology"].replace(dict_energy_source, inplace=True)
data_entsoe["technology"].replace('NaN', np.nan, inplace=True)

data_entsoe = data_entsoe[data_entsoe['technology'].isnull() == False]

# data_entsoe.head()

4.3.3 Determine aggregated values

4.3.3.1 Technology and fuel level

In [ ]:
country_list = data_entsoe.country.unique()
year_list = data_entsoe.year.unique()

table_to_append = pd.DataFrame(columns=['country', 'year', 'technology', 'capacity'])
table_to_append['year'] = table_to_append['year'].astype(int)
table_to_append['capacity'] = table_to_append['capacity'].astype(float)
for country in country_list:
    for year in year_list:

        col_list = ['country', 'year', 'technology', 'capacity']
        data_selected = data_entsoe.loc[(data_entsoe['country'] == country)
                                        & (data_entsoe['year'] == int(year))].copy()
        data_selected.loc[data_selected['capacity'].isnull() == True, 'capacity'] = 0

        # Solar
        val_1 = data_selected.loc[data_selected['technology'] == 'Solar',
                                  'capacity'].values[0]
        val = val_1
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized solar',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Wind
        val_1 = data_selected.loc[data_selected['technology'] == 'Wind',
                                  'capacity'].values[0]
        val = val_1
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized wind',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Hydro
        val_1 = data_selected.loc[data_selected['technology'] == 'Hydro',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Run-of-river',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Reservoir',
                                  'capacity'].values[0]
        val_4 = data_selected.loc[data_selected['technology'] == 'Pumped storage',
                                  'capacity'].values[0]
        val = val_1 - val_2 - val_3 - val_4
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized hydro',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Biomass
        val_1 = data_selected.loc[data_selected['technology'] == 'Biomass and biogas',
                                  'capacity'].values[0]
        val = val_1
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Bioenergy and renewable waste',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Differently categorized renewable energy sources
        val_1 = data_selected.loc[data_selected['technology'] == 'renewable',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Wind',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Solar',
                                  'capacity'].values[0]
        val_4 = data_selected.loc[data_selected['technology'] == 'Biomass and biogas',
                                  'capacity'].values[0]
        val = val_1 - val_2 - val_3 - val_4
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized renewable energy sources',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Differently categorized fossil fuels
        val_1 = data_selected.loc[data_selected['technology'] == 'Fossil fuels',
                                  'capacity'].values[0]
        val = val_1
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized fossil fuels',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

data_entsoe = data_entsoe.append(table_to_append)

# Skip auxiliary technology class
data_entsoe = data_entsoe[data_entsoe['technology'] != 'renewable']

4.3.3.2 Fuel type level

In [ ]:
table_to_append = pd.DataFrame(columns=['country', 'year', 'technology', 'capacity'])
table_to_append['year'] = table_to_append['year'].astype(int)
table_to_append['capacity'] = table_to_append['capacity'].astype(float)
for country in country_list:
    for year in year_list:

        col_list = ['country', 'year', 'technology', 'capacity']
        data_selected = data_entsoe.loc[(data_entsoe['country'] == country)
                                        & (data_entsoe['year'] == int(year))].copy()
        data_selected.loc[data_selected['capacity'].isnull() == True, 'capacity'] = 0

        # Renewable energy sources
        val_1 = data_selected.loc[data_selected['technology'] == 'Solar',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Wind',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Hydro',
                                  'capacity'].values[0]
        val_4 = data_selected.loc[data_selected['technology'] == 'Bioenergy and renewable waste',
                                  'capacity'].values[0]
        val_5 = data_selected.loc[data_selected['technology'] == 'Differently categorized renewable energy sources',
                                  'capacity'].values[0]
        val = val_1 + val_2 + val_3 + val_4 + val_5
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Renewable energy sources',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

data_entsoe = data_entsoe.append(table_to_append)

4.3.3.3 Total level

In [ ]:
table_to_append = pd.DataFrame(columns=['country', 'year', 'technology', 'capacity'])
table_to_append['year'] = table_to_append['year'].astype(int)
table_to_append['capacity'] = table_to_append['capacity'].astype(float)
for country in country_list:
    for year in year_list:

        col_list = ['country', 'year', 'technology', 'capacity']
        data_selected = data_entsoe.loc[(data_entsoe['country'] == country)
                                        & (data_entsoe['year'] == int(year))].copy()
        data_selected.loc[data_selected['capacity'].isnull() == True, 'capacity'] = 0

        # Total capacity
        val_1 = data_selected.loc[data_selected['technology'] == 'Renewable energy sources',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Nuclear',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Fossil fuels',
                                  'capacity'].values[0]
        val_4 = data_selected.loc[data_selected['technology'] == 'Other or unspecified energy sources',
                                  'capacity'].values[0]
        val = val_1 + val_2 + val_3 + val_4
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Total',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

data_entsoe = data_entsoe.append(table_to_append)

4.3.4 Add additional information

In [ ]:
data_entsoe['source'] = 'entsoe Statistics'
data_entsoe['source_type'] = 'Other association'
data_entsoe['capacity_definition'] = 'Net capacity'
data_entsoe['type'] = 'Installed capacity in MW'

#data_entsoe.head()

4.4 ENTSO-E SO&AF data

4.4.1 Align table structure to OPSD

In [ ]:
data_soaf = data_soaf.reset_index()
data_soaf = data_soaf.rename(columns={'level_0': 'technology',
                                      'level_1': 'country'})

data_soaf = pd.melt(data_soaf,
                    id_vars=['technology', 'country'],
                    var_name='year',
                    value_name='capacity')

data_soaf['country'] = data_soaf['country'].replace('NI', 'GB')

# Limit countries to OPSD countries (if required)
#countries = data_opsd.country.unique()
#data_entsoe = data_entsoe[data_entsoe['country'].isin(countries)]

data_soaf = pd.DataFrame(data_soaf.groupby(['country', 'year',
                                            'technology'])
                         ['capacity'].sum())
data_soaf = data_soaf.reset_index()

data_soaf.loc[:, 'capacity'] *= 1000
data_soaf['source'] = 'entsoe SOAF'
data_soaf['source_type'] = 'Other association'
data_soaf['capacity_definition'] = 'Net capacity'
data_soaf['type'] = 'Installed capacity in MW'

# data_soaf.head()

4.4.2 Standardize definition of energy sources/technologies

In [ ]:
dict_technologies = {'Nuclear Power': 'Nuclear',
                     'Fossil Fuels': 'Fossil fuels',
                     'Hard Coal': 'Hard coal',
                     'Lignite': 'Lignite',
                     'Gas': 'Natural gas',
                     'Oil': 'Oil',
                     'Mixed Fuels': 'Mixed fossil fuels',
                     'Hydro power (total)': 'Hydro',
                     'of which renewable hydro generation': 'NaN',
                     'of which run-of-river (pre-dominantly)': 'Run-of-river',
                     'of which storage and pumped storage (total)': 'Reservoir including pumped storage', # auxiliary class definition
                     'Renewable Energy Sources (other than hydro)': 'renewable',
                     'Solar': 'Solar',
                     'Wind': 'Wind',
                     'of which offshore': 'Offshore',
                     'of which onshore': 'Onshore',
                     'Biomass': 'Biomass and biogas',
                     'Not Clearly Identifiable Energy Sources': 'Other or unspecified energy sources',
                     'Net generating Capacity': 'NaN',
                     'Import Capacity': 'NaN',
                     'Export Capacity': 'NaN',
                     'Load': 'NaN',
                     'Load Management': 'NaN',
                     'Maintenance and Overhauls': 'NaN',
                     'Margin Against Seasonal Peak Load': 'NaN',
                     'Adequacy Reference Margin': 'NaN',
                     'National Power Data': 'NaN',
                     'Non-Usable Capacity': 'NaN',
                     'Outages': 'NaN',
                     'Reliable Available Capacity': 'NaN',
                     'Remaining Capacity': 'NaN',
                     'Spare Capacity': 'NaN',
                     'System Service Reserve': 'NaN',
                     'Unavailable Capacity': 'NaN',
                     'Simultaneous Exportable Capacity for Adequacy': 'NaN',
                     'Simultaneous Importable Capacity for Adequacy': 'NaN',
                     '“The values of Simultaneous Importable/Exportable Capacity for Adequacy do not include the border with Austria as there is a common market between Germany and Austria for which no NTC exists.”': 'NaN'}

data_soaf['technology'].replace(dict_technologies, inplace=True)
data_soaf['technology'].replace('NaN', np.nan, inplace=True)
data_soaf = data_soaf[data_soaf['technology'].isnull() == False]

# data_soaf.head()

4.4.3 Determine aggregated values

4.4.3.1 Technology and fuel level

In [ ]:
country_list = data_soaf.country.unique()
year_list = data_soaf.year.unique()

table_to_append = pd.DataFrame(columns=['country', 'year', 'technology', 'capacity'])
table_to_append['year'] = table_to_append['year'].astype(int)
table_to_append['capacity'] = table_to_append['capacity'].astype(float)
for country in country_list:
    for year in year_list:

        col_list = ['country', 'year', 'technology', 'capacity']
        data_selected = data_soaf.loc[(data_soaf['country'] == country)
                                      & (data_soaf['year'] == int(year))].copy()
        data_selected.loc[data_selected['capacity'].isnull() == True, 'capacity'] = 0

        # Solar
        val_1 = data_selected.loc[data_selected['technology'] == 'Solar',
                                  'capacity'].values[0]
        val = val_1
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized solar',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Wind
        val_1 = data_selected.loc[data_selected['technology'] == 'Wind',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Offshore',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Onshore',
                                  'capacity'].values[0]
        val = val_1 - val_2 - val_3
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized wind',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Hydro
        val_1 = data_selected.loc[data_selected['technology'] == 'Hydro',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Run-of-river',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Reservoir including pumped storage',
                                  'capacity'].values[0]
        val = val_1 - val_2 - val_3
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized hydro',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Biomass
        val_1 = data_selected.loc[data_selected['technology'] == 'Biomass and biogas',
                                  'capacity'].values[0]
        val = val_1
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Bioenergy and renewable waste',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Differently categorized renewable energy sources
        val_1 = data_selected.loc[data_selected['technology'] == 'renewable',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Wind',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Solar',
                                  'capacity'].values[0]
        val_4 = data_selected.loc[data_selected['technology'] == 'Biomass and biogas',
                                  'capacity'].values[0]
        val = val_1 - val_2 - val_3 - val_4
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized renewable energy sources',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Fossil fuels
        val_1 = data_selected.loc[data_selected['technology'] == 'Fossil fuels',
                                  'capacity'].values[0]

        val_2 = data_selected.loc[data_selected['technology'] == 'Lignite',
                                  'capacity'].values[0]

        val_3 = data_selected.loc[data_selected['technology'] == 'Hard coal',
                                  'capacity'].values[0]

        val_4 = data_selected.loc[data_selected['technology'] == 'Oil',
                                  'capacity'].values[0]

        val_5 = data_selected.loc[data_selected['technology'] == 'Natural gas',
                                  'capacity'].values[0]

        val_6 = data_selected.loc[data_selected['technology'] == 'Mixed fossil fuels',
                                  'capacity'].values[0]

        val = val_1 - val_2 - val_3 - val_4 - val_5 - val_6

        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Differently categorized fossil fuels',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

data_soaf = data_soaf.append(table_to_append)

# Skip auxiliary technology class
data_soaf = data_soaf[data_soaf['technology'] != 'renewable']

4.4.3.2 Fuel type and total level

In [ ]:
table_to_append = pd.DataFrame(columns=['country', 'year', 'technology', 'capacity'])
table_to_append['year'] = table_to_append['year'].astype(int)
table_to_append['capacity'] = table_to_append['capacity'].astype(float)
for country in country_list:
    for year in year_list:

        col_list = ['country', 'year', 'technology', 'capacity']
        data_selected = data_soaf.loc[(data_soaf['country'] == country)
                                      & (data_soaf['year'] == int(year))].copy()
        data_selected.loc[data_selected['capacity'].isnull() == True, 'capacity'] = 0

        # Renewable energy sources
        val_1 = data_selected.loc[data_selected['technology'] == 'Solar',
                                  'capacity'].values[0]
        val_2 = data_selected.loc[data_selected['technology'] == 'Wind',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Hydro',
                                  'capacity'].values[0]
        val_4 = data_selected.loc[data_selected['technology'] == 'Bioenergy and renewable waste',
                                  'capacity'].values[0]
        val_5 = data_selected.loc[data_selected['technology'] == 'Differently categorized renewable energy sources',
                                  'capacity'].values[0]
        val = val_1 + val_2 + val_3 + val_4 + val_5
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Renewable energy sources',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)

        # Total capacity
        val_1 = val  # renewable energy sources
        val_2 = data_selected.loc[data_selected['technology'] == 'Fossil fuels',
                                  'capacity'].values[0]
        val_3 = data_selected.loc[data_selected['technology'] == 'Nuclear',
                                  'capacity'].values[0]
        val_4 = data_selected.loc[data_selected['technology'] == 'Other or unspecified energy sources',
                                  'capacity'].values[0]
        val = val_1 + val_2 + val_3 + val_4
        row_to_append = pd.DataFrame([[country,
                                       int(year),
                                       'Total',
                                       val]], columns=col_list)
        table_to_append = table_to_append.append(row_to_append)


data_soaf = data_soaf.append(table_to_append)

4.4.4 Add additional information

In [ ]:
data_soaf['source'] = 'entsoe SO&AF'
data_soaf['source_type'] = 'Other association'
data_soaf['capacity_definition'] = 'Net capacity'
data_soaf['type'] = 'Installed capacity in MW'

#data_soaf.head()

4.5 Merge data sources

In [ ]:
data = pd.concat([data_opsd, data_eurostat, data_soaf, data_entsoe])
data = data.reset_index(drop=True)

# Define data types
data['comment'] = data['comment'].fillna('').astype(str)
data = data.astype(str)
data['capacity'] = data['capacity'].astype(float)
data['year'] = data['year'].astype(int)

# Sort columns
data = data[['technology', 'source', 'source_type',
             'year', 'type', 'country',
             'capacity_definition', 'capacity', 'comment']]

# data.head()

5. Implementation of energy source levels

Due to varying categorizations in the country-specific raw data, a revised categorization of energy sources and conversion technologies is applied with the aim to reflect the individual categorization of the different national references at the best. We specify the following four different energy source levels, which can be seen as general levels of the classification:

  • Energy source level 0 - Total generation capacity
  • Energy source level 1 - Generation capacity by energy type (fossil, nuclear, renewable, other)
  • Energy source level 2 - Generation capacity by fuel or energy source (e.g. coal, lignite, hard coal, natural gas, wind)
  • Energy source level 3 - Generation capacity by fuel or energy source and detailed fuel for bioenergy
  • Technology level - Generation capacity by fuel and technology (e.g. combined cycle natural gas, gas turbine, onshore wind, offshore wind)
In [ ]:
# Drop first rows
energylevels = energylevels_raw[5:]

# Define multiindex for DataFrame
energylevels.columns = pd.MultiIndex.from_arrays(energylevels[:1].values,
                                                 names=['level'])

# Delete double rows due to multiindex defintion
energylevels = energylevels[pd.notnull(energylevels.index)]

# Reset index for Dataframe
energylevels = energylevels.reset_index()
energylevels['technology'] = energylevels['technology'].str.replace('- ', '')
energylevels = energylevels.set_index('technology')

# Rename column headings
energylevels.rename(columns={'Level 0': 'energy_source_level_0',
                             'Level 1': 'energy_source_level_1',
                             'Level 2': 'energy_source_level_2',
                             'Level 3': 'energy_source_level_3',
                             'Technology level': 'technology_level'},
                    inplace=True)

# Convert 0/1 assignment to boolean
energylevels = energylevels.replace({0: False, 1: True})

# Apply technology levels to data by merging both DataFrames
data = data.merge(energylevels,
                  left_on='technology',
                  right_index=True,
                  how='left')

data = data.reset_index(drop=True)

# data.head()

6. Convert stacked data to crosstable format

In [ ]:
data_crosstable = pd.pivot_table(data[['technology', 'source',
                                       'source_type', 'year',
                                       'type', 'country',
                                       'capacity_definition',
                                       'capacity']],
                                 index=['technology'],
                                 columns=['country', 'type', 'year',
                                          'source', 'source_type',
                                          'capacity_definition'],
                                 values='capacity')

# Apply initial ordering of technologies
data_crosstable = data_crosstable.reindex(technology_order)

# Delete index naming
data_crosstable.index.name = None
data_crosstable.columns.names = ('Country (ISO code)',
                                 'Type of data', 'Year',
                                 'Source', 'Type of source',
                                 'Capacity definition (net, gross, unknown)')

# Select reduced dataset since 2013
data_crosstable_since2013 = data_crosstable.T.select(
    lambda x: x[2] in [2013, 2014, 2015, 2016, 2017]).T

# data_crosstable.head()
# data_crosstable_since2013.head()
In [ ]:
energylevels_table = energylevels_raw

energylevels_table.columns = pd.MultiIndex.from_arrays(energylevels_raw[:6].values,
                                                       names=['country', 'type', 'year',
                                                              'source', 'source_type',
                                                              'capacity_definition'
                                                              ])

# Remove 3 rows which are already used as column names
energylevels_table = energylevels_table[pd.notnull(energylevels_table.index)]

energylevels_table = energylevels_table.reset_index()
energylevels_table['technology'] = energylevels_table['technology'].str.replace('- ', '')
energylevels_table = energylevels_table.set_index('technology')

# Delete index naming
energylevels_table.index.name = None
energylevels_table.columns.names = ('Country (ISO code)',
                                    'Description', None,
                                    None, None,
                                    'Level')

# energylevels_table.head()

7. Documentation of the data package

We document the data packages meta data in the specific format JSON as proposed by the Open Knowledge Foundation. See the Frictionless Data project by OKFN (http://data.okfn.org/) and the Data Package specifications (http://dataprotocols.org/data-packages/) for more details.

In order to keep the notebook more readable, we first formulate the metadata in the human-readable YAML format using a multi-line string. We then parse the string into a Python dictionary and save that to disk as a JSON file.

In [ ]:
# Here we define meta data of the resulting data package.
# The meta data follows the specification at:
# http://dataprotocols.org/data-packages/

metadata = """

name: opsd-national_generation_capacity
title: National generation capacity
description: Aggregated generation capacity by technology and country
long_description: This data package comprises technology-specific aggregated generation capacities for European countries. The generation capacities are consistently categorized based on fuel and technology. For each European country, various references are used ranging from international (e.g. ENTSOE or EUROSTAT) to national sources from e.g. regulatory authorities. The input data is processed in the script linked below. 
version: "2017-07-07"
keywords: 
    - generation capacity
    - conventional generation
    - renewable generation
    - Europe
    - Open Power System Data
homepage: http://data.open-power-system-data.org/national_generation_capacity/
documentation: https://github.com/Open-Power-System-Data/national_generation_capacity/blob/2017-07-07/main.ipynb
geographical-scope: 28 European countries
last_changes: Revised technology classification, restructured input file format, adjusted input data to final version

resources:
    - path: national_generation_capacity.xlsx
      format: xlsx
      mediatype: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
    - path: national_generation_capacity_since2013.xlsx
      format: xlsx
      mediatype: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
    - path: national_generation_capacity_stacked.csv
      format: csv
      mediatype: text/csv
      schema:    
        fields:
            - name: id
              description: ID for data entries 
              type: integer
            - name: technology
              description: Generation technology defined by fuel and conversion technology
              type: string
              opsd-contentfilter: "true"
            - name: source
              description: Source of data entry
              type: string
              opsd-contentfilter: "true"
            - name: source_type
              description: Type of data source
              type: string
              opsd-contentfilter: "true"
            - name: year
              description: Year of data entry
              type: integer
              format: YYYY
              opsd-contentfilter: "true"
            - name: type
              description: Type of capacity (e.g. installed capacity)
              type: string
            - name: country
              description: Country ISO code
              type: string
              opsd-contentfilter: "true"
            - name: capacity_definition
              description: Capacity definition used in the relevant source (net, gross, or unknown)
              type: string
            - name: comment
              description: Comments on data entry
              type: string
            - name: capacity
              description: Installed capacity in MW
              type: float
            - name: energy_source_level_0
              description: Energy source level 0 (total aggregated capacity)
              type: boolean
            - name: energy_source_level_1
              description: Energy source level 1 (aggregation or classification by type of fuel)
              type: boolean
            - name: energy_source_level_2
              description: Energy source level 2 (aggregation or classification by fuel)
              type: boolean
            - name: energy_source_level_3
              description: Energy source level 3 (aggregation or classification by fuel refined for bioenergy)
              type: boolean
            - name: technology_level
              description: Technology (aggregation or classification by fuel and technology)
              type: boolean
    - path: national_generation_capacity_stacked.xlsx
      format: xlsx
      mediatype: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
      schema:    
        fields:
            - name: id
              description: ID for data entries 
              type: integer
            - name: technology
              description: Generation technology defined by fuel and conversion technology
              type: string
            - name: source
              description: Source of data entry
              type: string
            - name: source_type
              description: Type of data source
              type: string
            - name: year
              description: Year of data entry
              type: integer
              format: YYYY
            - name: type
              description: Type of capacity (e.g. installed capacity)
              type: string
            - name: country
              description: Country ISO code
              type: string
            - name: capacity_definition
              description: Capacity definition used in the relevant source (net, gross, or unknown)
              type: string
            - name: comment
              description: Comments on data entry
              type: string
            - name: capacity
              description: Installed capacity in MW
              type: float
            - name: energy_source_level_0
              description: Energy source level 0 (total aggregated capacity)
              type: boolean
            - name: energy_source_level_1
              description: Energy source level 1 (aggregation or classification by type of fuel)
              type: boolean
            - name: energy_source_level_2
              description: Energy source level 2 (aggregation or classification by fuel)
              type: boolean
            - name: energy_source_level_3
              description: Energy source level 3 (aggregation or classification by fuel refined for bioenergy)
              type: boolean
            - name: technology_level
              description: Technology (aggregation or classification by fuel and technology)
              type: boolean
              
sources:
    - name: ENTSOE SO&AF
      web: https://www.entsoe.eu/outlooks/maf/Pages/default.aspx
    - name: ENTSOE Statistics
      web: https://www.entsoe.eu/db-query/miscellaneous/net-generating-capacity
    - name: EUROSTAT
      web: http://ec.europa.eu/eurostat/product?code=nrg_113a&mode=view
    - name: e-control
      web: http://www.e-control.at/statistik/strom/bestandsstatistik
    - name: ELIA
      web: http://www.elia.be/en/grid-data/power-generation/generating-facilities
    - name: UN Statistical Office
      web: http://data.un.org/Data.aspx?d=EDATA&f=cmID%3AEC
    - name: BFE
      web: http://www.bfe.admin.ch/themen/00526/00541/00542/00630/index.html?dossier_id=00765
    - name: ERU
      web: http://www.eru.cz/en/elektrina/statistika-a-sledovani-kvality/rocni-zpravy-o-provozu
    - name: BMWi
      web: http://www.bmwi.de/BMWi/Redaktion/Binaer/Energiedaten/energietraeger10-stromerzeugungskapazitaeten-bruttostromerzeugung,property=blob,bereich=bmwi2012,sprache=de,rwb=true.xls
    - name: DEA
      web:  http://www.ens.dk/en/info/facts-figures/energy-statistics-indicators-energy-efficiency/annual-energy-statistics
    - name: REE
      web: http://www.ree.es/en/statistical-data-of-spanish-electrical-system/national-indicators/national-indicators
    - name: RTE 2014
      web: http://www.rte-france.com/en/document/overview-electrical-energy-france-march-2014
    - name: RTE 2015
      web:  http://clients.rte-france.com/lang/an/visiteurs/vie/prod/parc_reference.jsp
    - name: RTE 2017
      web: http://clients.rte-france.com/servlets/ParcRefServletV3?type=&year=2017      
    - name: Terna 2013
      web: http://download.terna.it/terna/0000/0216/17.XLSX
    - name: Terna 2014
      web: http://download.terna.it/terna/0000/0216/16.XLSX 
    - name: ILR 
      web: http://www.ilr.public.lu/electricite/statistiques/evolution_centrales-2008-2014.pdf        
    - name: Tennet NL
      web: http://energieinfo.tennet.org/dataexport/exporteerdatacountry.aspx?id=InstalledCapacity
    - name: CIRE
      web: http://www.rynek-energii-elektrycznej.cire.pl/st,33,207,tr,75,0,0,0,0,0,podstawowe-dane.html
    - name: TSO Bulgaria 
      web: http://www.tso.bg/uploads/file/Profile/en/ESO_Annual_Report_2012_en.pdf
    - name: Statistics Estonia
      web: http://pub.stat.ee/px-web.2001/Dialog/varval.asp?ma=FE032&ti=CAPACITY+AND+PRODUCTION+OF+POWER+PLANTS&path=../I_Databas/Economy/07Energy/02Energy_consumption_and_production/01Annual_statistics/&lang=1
    - name: Statistics Finland
      web: http://pxnet2.stat.fi/PXWeb/pxweb/en/StatFin/StatFin__ene__ehk/240_ehk_tau_112_en.px/table/tableViewLayout1/?rxid=31077c25-37e4-480e-81e6-49a66cbe4dc2
    - name: Department of Energy & Climate Change UK
      web: https://www.gov.uk/government/statistics/electricity-chapter-5-digest-of-united-kingdom-energy-statistics-dukes 
    - name: Regulatory Authority for Energy Greece 
      web:  http://www.rae.gr/site/file/system/docs/ActionReports/national_2012
    - name: Croation Transmission System operator (HOPS) 2013
      web: https://www.hops.hr/wps/wcm/connect/fbb3e297-dbfc-437a-bd36-458e02b9e7e4/Temeljni+podaci+2013.pdf?MOD=AJPERES
    - name: Croation Transmission System operator (HOPS) 2015
      web: http://www.hops.hr/wps/wcm/connect/5843dd4a-3e05-4ef4-ab14-f8b4934f71c1/Temeljni+podaci+2015.pdf?MOD=AJPERES
    - name: Mavir 2014
      web: http://www.mavir.hu/documents/10262/188569160/BT_terv_2014/9946a7a2-38ec-4794-9d7f-96a7a927d1b9 
    - name: Mavir 2013
      web: http://www.mavir.hu/documents/10262/188569160/BT_terv_2013_11_12_EN/ea873e22-bf88-4ee4-8a00-db09030bbb34
    - name: Eirgrid
      web: http://www.soni.ltd.uk/media/documents/Operations/CapacityStatements/All%20Island%20Generation%20Capacity%20Statement%202015.%20-%202024..pdf
    - name: Litgrid
      web: http://www.litgrid.eu/index.php/power-system/power-system-information/generation-capacity/546 
    - name: Central Statistical Bureau of Latvia
      web: http://data.csb.gov.lv/pxweb/en/vide/vide__ikgad__energetika/EN0130.px/table/tableViewLayout1/?rxid=a79839fe-11ba-4ecd-8cc3-4035692c5fc8
    - name: Energy Ministry NO 2013
      web:  https://www.regjeringen.no/globalassets/upload/oed/faktaheftet/facts_energy_water.pdf
    - name: Energy Ministry NO 2015 
      web: https://www.regjeringen.no/contentassets/fd89d9e2c39a4ac2b9c9a95bf156089a/facts_2015_energy_and_water_web.pdf 
    - name: REN
      web: http://www.ren.pt/files/2015-05/2015-05-04145306_f7664ca7-3a1a-4b25-9f46-2056eef44c33$$72f445d4-8e31-416a-bd01-d7b980134d0f$$ee3c56e5-6d14-4aa0-ac1f-ca5006917e03$$storage_image$$pt$$1.pdf
    - name: Anre 2014
      web: http://www.anre.ro/download.php?f=ga%2BCig%3D%3D&t=vdeyut7dlcecrLbbvbY%3D
    - name: Anre 2015
      web: http://www.anre.ro/download.php?f=gq58hQ%3D%3D&t=vdeyut7dlcecrLbbvbY%3D      
    - name: Svensk Energi 2014
      web: http://www.svenskenergi.se/Global/Statistik/El%C3%A5ret/El%C3%A5ret%202014_slututg%C3%A5va.pdf
    - name: Svensk Energi 2015
      web: http://www.svenskenergi.se/Global/Statistik/El%C3%A5ret/el%C3%A5ret2015_160429_web2.pdf      
    - name: Agencija za energijo 2013
      web: http://www.agen-rs.si/documents/10926/0/Agencija-za-energijo---Energetika-SLO-za-2013-3.pdf/b63d191d-ecbc-4efe-8b91-1e0f80d3272b
    - name: Agencija za energijo 2014
      web: http://www.agen-rs.si/documents/10926/38704/Poro%C4%8Dilo/54b1b378-1e76-4d40-8e0d-c30339baa248
    - name: Agencija za energijo 2015
      web: http://www.energetika-portal.si/fileadmin/dokumenti/publikacije/agen_e/porae_2015.pdf            
    - name: Statistical Office of Slovakia 2013
      web: https://slovak.statistics.sk/PortalTraffic/fileServlet?Dokument=bcc9ac82-9eb4-4320-b460-1f5c726db355
    - name: Statistical Office of Slovakia 2014
      web: https://slovak.statistics.sk/PortalTraffic/fileServlet?Dokument=6d8bdb1f-528c-41b3-9564-0ff365c98bb8
    - name: Statistical Office of Slovakia 2015
      web: https://slovak.statistics.sk/PortalTraffic/fileServlet?Dokument=00248658-7e49-418d-a7f5-2b8c243c2135      

       
contributors:
    - name: Friedrich Kunz
      email: [email protected]
      web: http://open-power-system-data.org/

"""

metadata = yaml.load(metadata)

datapackage_json = json.dumps(metadata, indent=4, separators=(',', ': '))

8. Write results to file

In [ ]:
output_path = 'output'

# Copy original source file to folder original_data
shutil.copy2(os.path.join('input', data_file),
             os.path.join('output', 'original_data', data_file))

# Write the result to file
data.to_csv(os.path.join(output_path, 'national_generation_capacity_stacked.csv'),
            encoding='utf-8', index_label='ID')

# Write the results to excel file
data.to_excel(os.path.join(output_path, 'national_generation_capacity_stacked.xlsx'),
              sheet_name='output', index_label='ID')

writer = pd.ExcelWriter(os.path.join(output_path, 'national_generation_capacity.xlsx'))
data_crosstable.to_excel(writer, sheet_name='output')
energylevels_table.to_excel(writer, sheet_name='technology levels')
writer.save()

writer = pd.ExcelWriter(os.path.join(output_path, 'national_generation_capacity_since2013.xlsx'))
data_crosstable_since2013.to_excel(writer, sheet_name='output')
energylevels_table.to_excel(writer, sheet_name='technology levels')
writer.save()

# Write the results to sql database
data.to_sql('national_generation_capacity_stacked',
            sqlite3.connect(os.path.join(output_path,
                                         'national_generation_capacity.sqlite')),
            if_exists="replace", index_label='ID')

# Write the information of the metadata
with open(os.path.join(output_path, 'datapackage.json'), 'w') as f:
    f.write(datapackage_json)

9. Formatting of Excel tables

In [ ]:
# Formatting of the crosstable output file to improve readability
wb1 = openpyxl.load_workbook(os.path.join(output_path,
                                          'national_generation_capacity.xlsx'))
wb2 = openpyxl.load_workbook(os.path.join(output_path,
                                          'national_generation_capacity_since2013.xlsx'))
wb3 = openpyxl.load_workbook(os.path.join('input',
                                          'National_Generation_Capacities.xlsx'))
for file in [wb1, wb2]:
    ws1 = file.get_sheet_by_name('output')
    ws2 = file.get_sheet_by_name('technology levels')

    # Adjust colum width
    ws1.column_dimensions['A'].width = 50
    ws2.column_dimensions['A'].width = 50

    # Place additional notes based on input file
    ws3 = wb3.get_sheet_by_name('Additional notes')
    for col in range(1, 3):
        for row in range(1, 10):
            ws1.cell(row=row + 50,
                     column=col).value = ws3.cell(row=row, column=col).value
            ws1.cell(row=51, column=1).font = Font(color=colors.BLACK,
                                                   italic=False,
                                                   bold=True)
            ws1.cell(row=row + 51, column=1).font = Font(color=colors.BLACK,
                                                         italic=True, bold=False)

    # Coloring and style of rows, alignment, as well as column width
    for col in range(1, 1500):
        ws1.column_dimensions[get_column_letter(col + 1)].width = 16
        for row in range(2, 7):
            ws1.cell(row=row, column=col).font = Font(color=colors.BLACK,
                                                      italic=False,
                                                      bold=False)
        for row in range(8, 47):
            rgb = [242, 242, 242]
            colour = "{0:02X}{1:02X}{2:02X}".format(*rgb)
            ws1.cell(row=row, column=col).fill = PatternFill(fgColor=colour,
                                                             bgColor=colour,
                                                             patternType="solid")
            ws1.cell(row=row, column=1).font = Font(color=colors.BLACK,
                                                    italic=True,
                                                    bold=False)
            ws1.cell(row=row, column=1).alignment = Alignment(horizontal='left',
                                                              indent=2)
        for row in [46]:
            rgb = [166, 166, 166]
            colour = "{0:02X}{1:02X}{2:02X}".format(*rgb)
            ws1.cell(row=row, column=col).fill = PatternFill(fgColor=colour,
                                                             bgColor=colour,
                                                             patternType="solid")
            ws1.cell(row=row, column=col).font = Font(color=colors.BLACK,
                                                      italic=False,
                                                      bold=True)
            ws1.cell(row=row, column=1).font = Font(color=colors.BLACK,
                                                    italic=False,
                                                    bold=True)
            ws1.cell(row=row, column=1).alignment = Alignment(horizontal='left',
                                                              indent=0)
        for row in [8, 21, 22, 45]:
            rgb = [191, 191, 191]
            colour = "{0:02X}{1:02X}{2:02X}".format(*rgb)
            ws1.cell(row=row, column=col).fill = PatternFill(fgColor=colour,
                                                             bgColor=colour,
                                                             patternType="solid")
            ws1.cell(row=row, column=col).font = Font(color=colors.BLACK,
                                                      italic=False,
                                                      bold=True)
            ws1.cell(row=row, column=1).font = Font(color=colors.BLACK,
                                                    italic=False,
                                                    bold=True)
            ws1.cell(row=row, column=1).alignment = Alignment(horizontal='left',
                                                              indent=0)
        for row in [9, 10, 11, 12, 17, 18, 19, 20, 23, 30, 34, 38, 39, 40, 44]:
            rgb = [217, 217, 217]
            colour = "{0:02X}{1:02X}{2:02X}".format(*rgb)
            ws1.cell(row=row, column=col).fill = PatternFill(fgColor=colour,
                                                             bgColor=colour,
                                                             patternType="solid")
            ws1.cell(row=row, column=1).font = Font(color=colors.BLACK,
                                                    italic=False,
                                                    bold=False)
            ws1.cell(row=row, column=1).alignment = Alignment(horizontal='left',
                                                              indent=1)
    for col in range(1, 7):
        ws2.column_dimensions[get_column_letter(col + 1)].width = 16
        for row in range(2, 7):
            ws2.cell(row=row, column=col).font = Font(color=colors.BLACK,
                                                      italic=False,
                                                      bold=False)
        for row in range(8, 47):
            rgb = [242, 242, 242]
            colour = "{0:02X}{1:02X}{2:02X}".format(*rgb)
            ws2.cell(row=row, column=col).fill = PatternFill(fgColor=colour,
                                                             bgColor=colour,
                                                             patternType="solid")
            ws2.cell(row=row, column=1).font = Font(color=colors.BLACK,
                                                    italic=True,
                                                    bold=False)
            ws2.cell(row=row, column=1).alignment = Alignment(horizontal='left',
                                                              indent=2)
        for row in [46]:
            rgb = [166, 166, 166]
            colour = "{0:02X}{1:02X}{2:02X}".format(*rgb)
            ws2.cell(row=row, column=col).fill = PatternFill(fgColor=colour,
                                                             bgColor=colour,
                                                             patternType="solid")
            ws2.cell(row=row, column=col).font = Font(color=colors.BLACK,
                                                      italic=False,
                                                      bold=True)
            ws2.cell(row=row, column=1).font = Font(color=colors.BLACK,
                                                    italic=False,
                                                    bold=True)
            ws2.cell(row=row, column=1).alignment = Alignment(horizontal='left',
                                                              indent=0)
        for row in [8, 21, 22, 45]:
            rgb = [191, 191, 191]
            colour = "{0:02X}{1:02X}{2:02X}".format(*rgb)
            ws2.cell(row=row, column=col).fill = PatternFill(fgColor=colour,
                                                             bgColor=colour,
                                                             patternType="solid")
            ws2.cell(row=row, column=col).font = Font(color=colors.BLACK,
                                                      italic=False,
                                                      bold=True)
            ws2.cell(row=row, column=1).font = Font(color=colors.BLACK,
                                                    italic=False,
                                                    bold=True)
            ws2.cell(row=row, column=1).alignment = Alignment(horizontal='left',
                                                              indent=0)
        for row in [9, 10, 11, 12, 17, 18, 19, 20, 23, 30, 34, 38, 39, 40, 44]:
            rgb = [217, 217, 217]
            colour = "{0:02X}{1:02X}{2:02X}".format(*rgb)
            ws2.cell(row=row, column=col).fill = PatternFill(fgColor=colour,
                                                             bgColor=colour,
                                                             patternType="solid")
            ws2.cell(row=row, column=1).font = Font(color=colors.BLACK,
                                                    italic=False,
                                                    bold=False)
            ws2.cell(row=row, column=1).alignment = Alignment(horizontal='left',
                                                              indent=1)

wb1.save(os.path.join(output_path, 'national_generation_capacity.xlsx'))
wb2.save(os.path.join(output_path, 'national_generation_capacity_since2013.xlsx'))

10. Write checksums

In [ ]:
files = [
    'national_generation_capacity.xlsx', 'national_generation_capacity_since2013.xlsx',
    'national_generation_capacity_stacked.csv', 'national_generation_capacity_stacked.xlsx',
    'national_generation_capacity.sqlite'
]

with open('checksums.txt', 'w') as f:
    for file_name in files:
        file_hash = get_sha_hash(os.path.join(output_path, file_name))
        f.write('{},{}\n'.format(file_name, file_hash))