Notebook

Open Power System Data: Renewable Energy Power Plant List¶

Part 2: Validation and Output¶

Part 1 of the script has downloaded and merged the original data from the sources.

Part 2 subsequently checks, validates the list of renewable power plants and creates CSV/XLSX/SQLite files. It also generates a daily timeseries of cumulated installed capacities by generation types.

(Before running this script make sure you ran Part 1.)

Table of contents¶

Script Setup
Load Part 1-Data
4 Data Validation
- 4.1 Criteria for Suspect Data
- 4.2 Overview over Suspect Entries
5. Cleaning Data Frame
- 5.1 Correction of Data
- 5.2 Delete Suspect Data
6. Time Series, Comparison and Plots
7. Documentation and data package

Script Setup¶

In [ ]:

# importing all necessary Python libraries for this Script
%matplotlib inline

import json
import yaml  
import posixpath
import os
import numpy as np
import pandas as pd
import datetime  
import sqlite3 
import utm
import logging
from bokeh.charts import Scatter, Line,Bar, show, output_file
from bokeh.io import output_notebook
output_notebook()

# Set up a log
logger = logging.getLogger('notebook')
logger.setLevel('INFO')
nb_root_logger = logging.getLogger()
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s'\
                              '- %(message)s',datefmt='%d %b %Y %H:%M:%S')
nb_root_logger.handlers[0].setFormatter(formatter)

# Create input and output folders if they don't exist
os.makedirs('input/original_data', exist_ok=True)
os.makedirs('output', exist_ok=True)
os.makedirs('output/datapackage_renewables', exist_ok=True)

Load Part 1-Data¶

In [ ]:

# Read data from script Part 1
renewables = pd.read_sql('SELECT* FROM raw_data_output',
                          sqlite3.connect('raw_data.sqlite')
                        )
# Correction of date format (necessary due to SQLite-format)
renewables['start_up_date'] = renewables['start_up_date'
                                        ].astype('datetime64[ns]')
renewables['decommission_date'] = renewables['decommission_date'
                                            ].astype('datetime64[ns]')

renewables.info()

4 Data Validation¶

-- Work in progress --

Due to the limitation of information in the data set of Netztransparenz and the BNetzA-PV data set a validation of duplicates is almost impossible until 2014. Reasons are e.g. the shortened EEG remuneration number in the published data (due to data privacy) from 33 to 15 characters and the reduction of the post code on from 5 to 3 characters. Thus there is no unique identifier possible for each individual power plant that would allow the check for duplicates etc.

This section checks the data frame for a set of pre-defined criteria and adds markers to the entries. At the end an overview of the marked data is given.

4.1 Criteria for Suspect Data¶

Marker	Validation criteria
(A)	The data frame overlaps in date of the record until 2014-12-31 because of merging the different data sources. All entries from the BNetzA get a marker because of a very high probability of duplication.
(B)	The start_up_date column has no entry (NA/Null)
(C)	All power plant entries which do not have "start up (Inbetriebnahme)" listed as reason for their entry will get a marker. Explanation: The BNetzA power plant list entries include a reason for their entry, e.g. "start up", "retrofit" etc. If another reason than "start up" is given it means that the power plant is already existing and thus most likely already included in our data set. Due to the missing unique identifier and no possiblity to check for duplicates it is likely that those entries are duplicates.
(D)	Start-up date of PV system is older than 1975.
(E)	Generation type contains '#NV' value.
(F)	Generation type is not specified.
(G)	Electrical capacity is not specified.
...	more to come

In [ ]:

# Reorder data frame by start-up date
renewables = renewables.ix[pd.to_datetime(renewables.start_up_date
                                         ).sort_values().index]

Add marker to data according to criteria (see above)

In [ ]:

# Create empty marker column
renewables['comment'] = ""

# Validation criteria (A) for source BNetzA
idx_date = renewables[(renewables['start_up_date'] <= '2014-12-31') 
                      & (renewables['source'] == 'BNetzA')].index

renewables['comment'].loc[idx_date] = (renewables['comment'] 
                            + "A, ")

# Validation criteria (A) for source BNetzA_PV
idx_date_pv = renewables[(renewables['start_up_date'] <= '2014-12-31') 
                    & (renewables['source'] == 'BNetzA_PV')].index

renewables['comment'].loc[idx_date_pv] = (renewables['comment'] 
                            + "A, ")

# Validation criteria (B)
idx_date_null = renewables[(renewables['start_up_date'].isnull())].index

renewables['comment'].loc[idx_date_null] = (renewables['comment'] 
                            + "B, ")

# Validation criteria (C)
idx_not_inst = renewables[(renewables['notification_reason']!= 'Inbetriebnahme')
                     & (renewables['source'] == 'BNetzA')].index

renewables['comment'].loc[idx_not_inst] = (renewables['comment'] 
                            + "C, ")

# Validation criteria (D)
idx_pv_date = renewables[(renewables['start_up_date'] < '1975-01-01') 
                   & (renewables['generation_type'] == 'solar')].index

renewables['comment'].loc[idx_pv_date] = (renewables['comment'] 
                            + "D, ")
# Validation criteria (E)
idx_nv = renewables[renewables['generation_type'] == '#NV'].index

renewables['comment'].loc[idx_nv] = (renewables['comment'] 
                            + "E, ")

# Validation criteria (F)
idx_gen = renewables[renewables.generation_type.isnull()].index

renewables['comment'].loc[idx_gen] = (renewables['comment'] 
                            + "F, ")

# Validation criteria (G)
idx_capacity = renewables[renewables.electrical_capacity.isnull()].index

renewables['comment'].loc[idx_capacity] = (renewables['comment'] 
                            + "G, ")

4.2 Overview over Suspect Entries¶

In [ ]:

# Count entries
renewables.groupby(['comment','source'])['comment'].count()

In [ ]:

# Locate suspect entires
idx_suspect = renewables[renewables.comment.str.len() >1].index

# Show suspect entries
renewables.loc[idx_suspect].groupby(['comment','source'])['comment'].count()

In [ ]:

# Summarize capacity and generation type of suspect data
renewables.loc[idx_suspect].groupby(['generation_type'])[
                                   'electrical_capacity'].sum()/1000

5 Cleaning Data Frame¶

5.1 Correction of Data¶

If hard evidence for systematic data errors exist some data errors will be corrected automatically.

Will follow later

5.2 Delete Suspect Data¶

This section will clean the data frame from all marked and suspect entries.

In [ ]:

# create new data frame without suspect entries
renewables_clean = renewables.drop(idx_suspect)

# define column of data frame
df_columns = ['start_up_date','electrical_capacity','generation_type',
              'generation_subtype','thermal_capacity','postcode','city', 
              'address','tso','lon','lat','eeg_id','power_plant_id',
              'voltage_level','decommission_date','comment','source']

Create the final data frame

In [ ]:

# create final data frame
renewables_final = renewables_clean.loc[:, df_columns]

renewables_final.reset_index(drop=True)

logger.info('Clean final dataframe from not needed columns')

Overview of data frame

In [ ]:

# Show structure of data frame
renewables_final.info()

In [ ]:

# Group and summarize data frame by generation type ans installed capacity
renewables_final.groupby(['generation_type'])['electrical_capacity'].sum()/1000

In [ ]:

# Group data frame by remaining comments/markers
renewables_final.groupby(['comment'])['comment'].count()

6 Time Series, Comparison and Plots¶

This section

creates a (daily) time series of the cumulated installed capacity by generation types
validates the data by comparing it to a timeseries by the German Federal Ministry for Economic Affairs and Energy (BMWi, Bundesministeriums für Wirtschaft und Energie) Time series for the development of renewable energies in Germany
plots the data.

6.1 Creating the Time Series¶

Downloading BMWi time series

In [ ]:

# Defining URL
url_bmwi_stat  ='http://www.erneuerbare-energien.de/EE/Redaktion/DE/Downloads/zeitreihen-zur-entwicklung-der' \
             '-erneuerbaren-energien-in-deutschland-1990-2014-excel.xlsx;?__blob=publicationFile&v=5'

# Reading BMWi data
bmwi_stat = pd.ExcelFile(url_bmwi_stat)   
bmwi_stat = bmwi_stat.parse('4', skiprows=7, skip_footer=8)

# Transform data frame and set column names
stat = bmwi_stat.T
stat.columns = ['bmwi_hydro', 'bmwi_wind_onshore','bmwi_wind_offshore',
                'bmwi_solar','bmwi_biomass','bmwi_biomass_liquid',
                'bmwi_biomass_gas','bmwi_sewage_gas', 'bmwi_landfill_gas',
                'bmwi_geothermal','bmwi_total']

# Drop Null column and set index as year
stat = stat.drop(stat.index[[0]])
stat.index = pd.to_datetime(stat.index,format="%Y").year

Create daily time series from existing data frame

In [ ]:

# Set generation types of interest
generation_types = ['biomass','wind','solar','gas','geothermal','hydro']

# Set date range of time series
idx_stat = pd.date_range(start='1990-01-01', end='2015-01-01', freq='A')
idx_ts = pd.date_range(start='2005-01-01', end='2016-01-31', freq='D')

# Set range of time series as index
data = pd.DataFrame(index=idx_ts)
data_stat = pd.DataFrame(index=idx_stat)

# Create cumulated time series per generation type for both time series
for gtype in generation_types:
    
    temp = renewables_clean[['start_up_date','electrical_capacity'
                           ]].loc[renewables_clean['generation_type'].isin(
                                   [gtype])]
    
    temp_ts = temp.set_index('start_up_date')
    
    # Create cumulated time series per generation type and year
    data_stat['capacity_{0}_de'.format(gtype)]  = (
    temp_ts.resample('A', how='sum').cumsum().fillna(method='ffill')/1000)
    
    # Create cumulated time series per generation type and day
    data['capacity_{0}_de'.format(gtype)] = temp_ts.resample('D',
                                   how='sum').cumsum().fillna(method='ffill')/1000
    # Set index name
    data.index.name = 'timestamp'
    
data_stat.index = pd.to_datetime(data_stat.index,format="%Y").year

6.2 Comparison Plots with BMWi Data¶

** Calculate the absolute deviation between both timeseries**

In [ ]:

valuation = pd.concat([data_stat, stat], axis=1)
valuation = valuation.fillna(0)

# Calculate absolute deviation for each year and generation type

valuation['absolute_wind'] =(valuation['capacity_wind_de']
                                 -(valuation['bmwi_wind_onshore'] 
                                 +valuation['bmwi_wind_offshore'])).fillna(0)

valuation['absolute_solar'] =(valuation['capacity_solar_de']
                                  -valuation['bmwi_solar']).fillna(0)

valuation['absolute_hydro'] =(valuation['capacity_hydro_de']
                                  -valuation['bmwi_hydro']).fillna(0)

valuation['absolute_geothermal'] =(valuation['capacity_geothermal_de']
                                  -valuation['bmwi_geothermal']).fillna(0)

valuation['absolute_biomass'] =(valuation['capacity_biomass_de']
                                 -(valuation['bmwi_biomass'] 
                                  +valuation['bmwi_biomass_liquid']
                                  +valuation['bmwi_biomass_gas'])).fillna(0)

valuation['absolute_gas'] =(valuation['capacity_gas_de']
                                 -(valuation['bmwi_sewage_gas'] 
                                  +valuation['bmwi_landfill_gas'])).fillna(0)

valuation['absolute_total'] =((valuation['capacity_biomass_de']
                              +valuation['capacity_wind_de']
                              +valuation['capacity_solar_de']
                             +valuation['capacity_gas_de']
                             +valuation['capacity_geothermal_de']
                             +valuation['capacity_hydro_de']
                           ) -(valuation['bmwi_total'] )).fillna(0)

Plotting the absolute deviation

The graph shows the cumulative absolute deviation of the electrical capacity in MW between the OPSD data set of renewable power plants and the BMWI statistic from 1990 until 2014.
A positive deviation means the OPSD data set has more capacity than the BMWI statistics.

In [ ]:

#Plot settings for absolute deviation
deviation_columns = ['absolute_wind','absolute_solar','absolute_hydro',  
                  'absolute_biomass','absolute_gas', 'absolute_total',
                  'absolute_geothermal']

dataplot = valuation[deviation_columns]

deviation = Line(dataplot, 
                 y = deviation_columns,
                 dash = deviation_columns,
                 color = deviation_columns,
            title="Deviation between data set and BMWI statistic", 
            ylabel='Deviation in MW', 
            xlabel='From 1990 till 2014',
            legend=True)

In [ ]:

# Show Plot for absolute deviation
show(deviation)

** Calculate the relative deviation between both timeseries**

In [ ]:

# Relative deviation
valuation['relative_wind'] =(valuation['absolute_wind']
                            /(valuation['bmwi_wind_onshore'] 
                             +valuation['bmwi_wind_offshore'])).fillna(0)

valuation['relative_solar'] =(valuation['absolute_solar']
                            /(valuation['bmwi_solar'] )).fillna(0)

valuation['relative_hydro'] =(valuation['absolute_hydro']
                            /(valuation['bmwi_hydro'] )).fillna(0)

valuation['relative_geothermal'] =(valuation['absolute_geothermal']
                            /(valuation['bmwi_geothermal'])).fillna(0)

valuation['relative_biomass'] =(valuation['absolute_biomass']
                            /(valuation['bmwi_biomass'] )).fillna(0)

valuation['relative_gas'] =(valuation['absolute_gas']
                            /(valuation['bmwi_sewage_gas'] 
                             +valuation['bmwi_landfill_gas'])).fillna(0)

valuation['relative_total'] =(valuation['absolute_total']
                            /(valuation['bmwi_total'] )).fillna(0)

In [ ]:

# Plot settings relative deviation
relative_column = ['relative_wind','relative_solar','relative_hydro',
            'relative_biomass','relative_gas','relative_total']

dataplot2 = valuation[relative_column]

relative = Line(dataplot2*100, 
            y = relative_column,
            dash = relative_column,
            color = relative_column,
            title="Deviation between data set and BMWI statistic", 
            ylabel='Relative difference in percent', 
            xlabel='From 1990 till 2014',
            legend=True)

Plotting the relative deviation

The graph shows the relative difference of the electrical capacity between the OPSD data set of renewable power plants and the BMWI statistic from 1990 until 2014.
A positive value means the OPSD data set has more capacity than the BMWI statistics.

In [ ]:

# Show Plot for relative deviation
show(relative)

Saving deviation results in XLSX-file

In [ ]:

# write results as Excel file
valuation.to_excel('validation_report.xlsx', sheet_name='Capacities_1990_2014'
                   )

6.3 Geoplot of Power Plant Location¶

This section plots the geolocation of each power plant in the list.

Note: This section uses the Basemap-feature. Basemap is not working on a Windows system by default. An installation guideline can be found at http://matplotlib.org/basemap/users/installing.html and https://sourceforge.net/projects/matplotlib/files/matplotlib-toolkits/basemap-1.0.7/

In [ ]:

# import libraries
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

In [ ]:

def draw_germany():
    """Returns a basemap centered on Germany"""
    plt.figure(figsize=(14, 10))
    return Basemap(projection='lcc', resolution='l',
                   llcrnrlon= 5.87, urcrnrlon=16.04,
                   llcrnrlat= 47, urcrnrlat= 55.1,
                   lat_1=47.26, lat_2=56, lon_0=6,
                   area_thresh=10000)

m = draw_germany()

# Draw map background
m.fillcontinents(color='white', lake_color='#eeeeee')
m.drawstates(color='lightgray')
m.drawcoastlines(color='lightgray')
m.drawcountries(color='lightgray')
m.drawmapboundary(fill_color='#eeeeee')

# Get coordinates
lon = np.array(renewables_clean.lon)
lat = np.array(renewables_clean.lat)

# Plot settings
style = dict(s=5, marker='o', alpha=0.5, zorder=2)
m.scatter(lon,lat, latlon=True,
          label='Location of renewable power plants',
          color='#00592D', **style)

plt.legend(loc='lower left', frameon=False)

7. Documentation and data package¶

This section finally writes the data package:

CSV + XLSX + SQLite of the power plant list
CSV of the time series
Meta data (JSON)

7.1 Write Power Plant List¶

** Write CSV**

In [ ]:

path_package = 'output/datapackage_renewables'

os.makedirs(path_package, exist_ok=True)

# Wirte the results as csv
renewables_final.to_csv(path_package+'/renewable_power_plants_germany.csv',
                         sep=',' , 
                         decimal='.', 
                         date_format='%Y-%m-%d',
                         encoding='utf-8',
                         index = False,
                         if_exists="replace")

** Write XLSX**

In [ ]:

# Write the results as xlsx file
writer = pd.ExcelWriter(path_package+'/renewable_power_plants_germany.xlsx', 
                        engine='xlsxwriter')

# Because of the large number of entries we need to splite the data into two sheets
# (they don't fit on one single Excel sheet)
renewables_final[:1000000].to_excel(writer, 
                                     index = False,
                                    sheet_name='part-1')

renewables_final[1000000:].to_excel(writer, 
                                    index = False,
                                    sheet_name='part-2')

# Close the Pandas Excel writer and output the Excel file.
writer.save()    

Write SQLite

In [ ]:

# Write the results to sqlite database
renewables_final.to_sql('renewable_power_plants_germany', 
                         sqlite3.connect(path_package+
                                 '/renewable_power_plants_germany.sqlite'),
                         if_exists="replace") 

7.2 Write Time Series¶

In [ ]:

# Write daily cumulated time series as csv
data.to_csv(path_package+'/renewable_capacity_germany_timeseries.csv',
                         sep=',', decimal='.', 
                         date_format='%Y-%m-%dT%H:%M:%S%z',
                         encoding='utf-8',
                         if_exists="replace")

7.3 Writing Meta Data¶

The data packages meta data are created in the specific JSON format as proposed by the Open Knowledge Foundation. Please see the Frictionless Data project by OKFN (http://data.okfn.org/) and the Data Package specifications (http://dataprotocols.org/data-packages/) for more details.

In order to keep the notebook more readable the metadata is written in the human-readable YAML format using a multi-line string and then parse the string into a Python dictionary and save is as a JSON file.

In [ ]:

# The meta data follows the specification at:
# http://dataprotocols.org/data-packages/

metadata = """
name: opsd-renewable-energy-power-plants
title: List of renewable energy power plants in Germany
description: >-
    This data package contains a list of all renewable energy power plants in Germany 
    that are eligible under the renewable support scheme. For each plant, 
    commissioning data, technical characteristics, and geolocations are provided. 
    It also contains a time series of cumulated installed capacity by technology 
    in daily granularity. The data stem from two different sources: 
    Netztransparenz.de, a joint platform of the German transmission system operators, 
    and Bundesnetzagentur, the regulator. The data has been extracted, merged, 
    verified and cleaned. This processing is documented step-by-step in the script linked below.  
version: "2016-03-09"
keywords: [master data register,power plants,renewables,germany]
geographical-scope: Germany
resources:
    - path: renewable_power_plants_germany.csv
      format: csv
      mediatype: text/csv
      schema:         
          fields:
            - name: start_up_date
              description: Date of start up/installation date
              type: datetime
              format: YYYY-MM-DDThh:mm:ssZ
            - name: electrical_capacity
              description: Installed electrical capacity in kW
              type: number
              format: float
              unit: kW
            - name: generation_type
              description: Type of generation / energy source
              type: text
            - name: generation_subtype
              description: Subtype of generation / energy source
              type: text
            - name: thermal_capacity
              description: Installed thermal capacity in kW
              type: number
              format: float
              unit: kW
            - name: city
              description: Name of location
              type: text
            - name: tso
              description: Name of TSO  
              type: text    
            - name: lon
              description: Longitude coordinates
              type: geopoint
              format: lon
            - name: lat
              description: Latitude coordinates 
              type: geopoint
              format: lat
            - name: eeg_id
              description: EEG (German feed-in tariff law) remuneration number
              type: text
            - name: power_plant_id
              description: Power plant identification number by BNetzA
              type: text
            - name: voltage_level
              description: Voltage level of grid connection
              type: text 
            - name: decommission_date
              description: Date of decommission
              type: datetime
              format: YYYY-MM-DDThh:mm:ssZ  
            - name: comment
              description: Validation comments
              type: text 
            - name: source
              description: Source of database entry
              type: text
              source: TransnetBW, TenneT, Amprion, 50Hertz, BNetzA_PV, BNetzA
    - path: renewable_capacity_germany_timeseries.csv
      format: csv
      mediatype: text/csv
      schema:         
          fields:
            - name: timestamp
              description: Start time of the day
              type: datetime
              format: YYYY-MM-DDThh:mm:ssZ
            - name: capacity_biomass_de
              description: Cumulated biomass electrical capacity
              type: number
            - name: capacity_wind_de
              description: Cumulated wind capacity
              type: number                 
            - name: capacity_solar_de
              description: Cumulated solar capacity
              type: number                
            - name: capacity_gas_de
              description: Cumulated gas electrical capacity
              type: number  
            - name: capacity_geothermal_de
              description: Cumulated geothermal electrical capacity
              type: number 
            - name: capacity_hydro_de
              description: Cumulated hydro capacity
              type: number  
    - path: renewable_power_plants_germany.xlsx
      format: xlsx
      mediatype: xlsx
      schema:         
          fields:
            - name: start_up_date
              description: Date of start up/installation date
              type: datetime
              format: YYYY-MM-DDThh:mm:ssZ
            - name: electrical_capacity
              description: Installed electrical capacity in kW
              type: number
              format: float
              unit: kW
            - name: generation_type
              description: Type of generation / energy source
              type: text
            - name: generation_subtype
              description: Subtype of generation / energy source
              type: text
            - name: thermal_capacity
              description: Installed thermal capacity in kW
              type: number
              format: float
              unit: kW
            - name: city
              description: Name of location
              type: text
            - name: tso
              description: Name of TSO  
              type: text    
            - name: lon
              description: Longitude coordinates
              type: geopoint
              format: lon
            - name: lat
              description: Latitude coordinates 
              type: geopoint
              format: lat
            - name: eeg_id
              description: EEG (German feed-in tariff law) remuneration number
              type: text
            - name: power_plant_id
              description: Power plant identification number by BNetzA
              type: text
            - name: voltage_level
              description: Voltage level of grid connection
              type: text 
            - name: decommission_date
              description: Date of decommission
              type: datetime
              format: YYYY-MM-DDThh:mm:ssZ  
            - name: comment
              description: Validation comments
              type: text 
            - name: source
              description: Source of database entry
              type: text
              source: TransnetBW, TenneT, Amprion, 50Hertz, BNetzA_PV, BNetzA
licenses:
    - url: http://example.com/license/url/here
      name: License Name Here
      version: 1.0
      id: license-id-from-open
sources:
    - name: Bundesnetzagentur - register of renewable power plants (excl. PV)
      web: http://www.bundesnetzagentur.de/cln_1422/DE/Sachgebiete/ElektrizitaetundGas/Unternehmen_Institutionen/ErneuerbareEnergien/Anlagenregister/Anlagenregister_Veroeffentlichung/Anlagenregister_Veroeffentlichungen_node.html
      source: BNetzA
    - name: Bundesnetzagentur - register of PV power plants
      web: http://www.bundesnetzagentur.de/cln_1431/DE/Sachgebiete/ElektrizitaetundGas/Unternehmen_Institutionen/ErneuerbareEnergien/Photovoltaik/DatenMeldgn_EEG-VergSaetze/DatenMeldgn_EEG-VergSaetze_node.html    
      source: BNetzA_PV
    - name: Netztransparenz.de - information platform of German TSOs (register of renewable power plants in their control area)
      web: https://www.netztransparenz.de/de/Anlagenstammdaten.htm
      source: TransnetBW, TenneT, Amprion, 50Hertz
maintainers:
    - name: Wolf-Dieter Bunke
      email: wolf-dieter.bunke@uni-flensburg.de
      web: http://open-power-system-data.org/
views: True
openpowersystemdata-enable-listing: True
opsd-jupyter-notebook-url: https://github.com/Open-Power-System-Data/datapackage_renewable_power_plants/blob/master/main.ipynb
opsd-changes-to-last-version: Update of output data 
"""

metadata = yaml.load(metadata)

datapackage_json = json.dumps(metadata, indent=4, separators=(',', ': '))

# Write the information of the metadata
with open(os.path.join(path_package, 'datapackage.json'), 'w') as f:
    f.write(datapackage_json)