Renewable power plants: Validation and output Notebook
This Notebook is part of the Renewable power plants Data Package of Open Power System Data. |
Part 1 of the script (Download and process Notebook) has downloaded and merged the original data. This Notebook subsequently checks, validates the list of renewable power plants and creates CSV/XLSX/SQLite files. It also generates a daily time series of cumulated installed capacities by energy source.
(Before running this script make sure you ran Part 1, so that the renewables.pickle files for each country exist in the same folder as the scripts)
import json
import logging
import os
import urllib.parse
import re
import zipfile
from bokeh.charts import Scatter, Line, Bar, show, output_file
from bokeh.io import output_notebook
import pandas as pd
import requests
import sqlalchemy
import yaml
import hashlib
import os
output_notebook()
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
datefmt='%d %b %Y %H:%M:%S'
)
logger = logging.getLogger()
# Create input and output folders if they don't exist
os.makedirs('input/original_data', exist_ok=True)
os.makedirs('output', exist_ok=True)
os.makedirs('output/renewable_power_plants', exist_ok=True)
# Read data from script Part 1 download_and_process
DE_re = pd.read_pickle('DE_renewables.pickle')
DK_re = pd.read_pickle('DK_renewables.pickle')
FR_re = pd.read_pickle('FR_renewables.pickle')
PL_re = pd.read_pickle('PL_renewables.pickle')
CH_re = pd.read_pickle('CH_renewables.pickle')
-- Work in progress - all information on suspect data is welcome! --
This section checks the DataFrame for a set of pre-defined criteria and adds markers to the entries in an additional column. The marked data will be included in the output files, but marked, so that they can be easiliy filtered out. For creating the validation plots and the time series, suspect data is skipped.
# Read csv of validation marker description an show them
validation = pd.read_csv('input/validation_marker.csv',
sep=',', header=0)
validation
Main issue: Due to the shortened zip-Code (5 to 3 digits) and eeg_id (33 to 15 characters) in the data from Netztransparenz (until end of 2015), we are not able to identify which of the power plants are also in the BNetzA-data. Thus the duplicate entries from August 2014 - end 2015 are not filtered, but all BNetzA-entries in this timespan marked.
Add marker to data according to criteria (see validation_marker above)
# Create empty marker column named comment
DE_re['comment'] = ""
# Validation criteria (R_1) for data source BNetzA
idx_date = DE_re[(DE_re['commissioning_date'] <= '2015-12-31') & (
DE_re['data_source'] == 'BNetzA')].index
# this can be replaced by renewables.loc[idx_date,'comment'] += "R_1, ".
# The same goes for the expressions below
DE_re.loc[idx_date, 'comment'] = DE_re.loc[idx_date, 'comment'] + "R_1;"
# Validation criteria (R_1) for source BNetzA_PV
idx_date_pv = DE_re[(DE_re['commissioning_date'] <= '2015-12-31') & (
DE_re['data_source'] == 'BNetzA_PV')].index
DE_re.loc[idx_date_pv, 'comment'] += "R_1;"
# Validation criteria (R_2)
idx_not_inst = DE_re[(DE_re['notification_reason'] != 'Inbetriebnahme') & (
DE_re['data_source'] == 'BNetzA')].index
DE_re.loc[idx_not_inst, 'comment'] += "R_2;"
# Validation criteria (R_3)
idx_date_null = DE_re[(DE_re['commissioning_date'].isnull())].index
DE_re.loc[idx_date_null, 'comment'] += "R_3;"
# Validation criteria (R_4)
idx_capacity = DE_re[DE_re.electrical_capacity <= 0.0].index
DE_re.loc[idx_capacity, 'comment'] += "R_4;"
# Validation criteria (R_5)
# Find all power plants which are decommissioned from the grid.
# These are probably commissioned to another grid and thus doubled.
# Just the entry which is not double should be kept, thus the other one is marked
idx_grid_decomm = DE_re[DE_re['grid_decommissioning_date'].isnull() == False].index
DE_re.loc[idx_grid_decomm, 'comment'] += "R_5;"
# Validation criteria (R_6)
idx_decomm = DE_re[DE_re['decommissioning_date'].isnull() == False].index
DE_re.loc[idx_decomm, 'comment'] += "R_6"
Create cleaned DataFrame
All marked entries are deleted for the cleaned version of the DataFrame that is utilized for creating time series of installation and for the validation plots.
# Locate suspect entires
idx_suspect = DE_re[DE_re.comment.str.len() > 1].index
# create new DataFrame without suspect entries
DE_re_clean = DE_re.drop(idx_suspect)
Overview suspect entries - cleaned data
# Count marked entries
DE_re.groupby(['comment','data_source'])['comment'].count()
# Summarize electrical capacity per energy source of suspect data
DE_re.groupby(['comment', 'energy_source_level_2'])['electrical_capacity'].sum()
# Summarize electrical capacity per energy source level 2 of cleaned data (MW)
DE_re_clean.groupby(['energy_source_level_2'])['electrical_capacity'].sum()
# Create empty marker column
FR_re['comment'] = ""
# Validation criteria (R_7)
idx_not_Europe = FR_re[(FR_re['lat'] < 41) | (
FR_re['lon'] < -6) | (
FR_re['lon'] > 10)].index
FR_re.loc[idx_not_Europe, 'comment'] += "R_7"
# Show entries in the French list not located on the European continent
FR_re.loc[idx_not_Europe]
This section creates a daily and yearly time series of the cumulated installed capacity by energy source. This data will be part of the output and will be compared in a plot for validation in the next section.
# Additional column for chosing energy sources for time series
DE_re_clean['temp_energy_source'] = DE_re_clean['energy_source_level_2']
# Time series for on- and offshore wind should be separated, for hydro subtype
# should be used because all is run-of-river
idx_subtype = DE_re_clean[(DE_re_clean.energy_source_level_2 == 'Wind') |
(DE_re_clean.energy_source_level_2 == 'Hydro')].index
DE_re_clean.loc[idx_subtype, 'temp_energy_source'] = DE_re_clean.loc[
idx_subtype, 'technology']
# Set energy source for which time series should be generated
energy_sources = ['Solar', 'Onshore', 'Offshore', 'Bioenergy',
'Geothermal', 'Run-of-river']
# Set date range for which the time series should be generated
range_yearly = pd.date_range(start='1990-01-01', end='2016-01-01', freq='A')
range_daily = pd.date_range(start='2005-01-01', end='2016-12-31', freq='D')
# Set range of time series as index
timeseries_yearly = pd.DataFrame(index=range_yearly)
timeseries_daily = pd.DataFrame(index=range_daily)
# Create cumulated time series per energy source for both yearly and daily time series
for gtype in energy_sources:
temp = (DE_re_clean[['commissioning_date', 'electrical_capacity']]
.loc[DE_re_clean['temp_energy_source'].isin([gtype])])
temp_timeseries = temp.set_index('commissioning_date')
# Create cumulated time series per energy_source and year
timeseries_yearly['{0}'.format(gtype)] = temp_timeseries.resample(
'A').sum().cumsum().fillna(method='ffill')
# Create cumulated time series per energy_source and day
timeseries_daily['{0}'.format(gtype)] = temp_timeseries.resample(
'D').sum().cumsum().fillna(method='ffill')
# Filling the empty cells in the gethermal column since this did not work in the loop
timeseries_daily.Geothermal = timeseries_daily.Geothermal.fillna(method='ffill')
# Shorten timestamp to year for the yearly time series
timeseries_yearly.index = pd.to_datetime(timeseries_yearly.index, format="%Y").year
# Show yearly timeseries of installed capacity in MW per energy source level 2
timeseries_yearly
Compare cumulated capacity per country and energy source level 2 of the power plant list to data from the International Renewable Energy Agency (IRENA), ENTSO-E and Eurostat, already compiled in OPSD Data Package national generation capacity.
def download_and_cache_Irena_Europe(url, session=None):
"""This function downloads a file into a folder called
original_data and returns the local filepath."""
path = urllib.parse.urlsplit(url).path
filename = 'Europe.csv'
base_filepath = "input/"
filepath = base_filepath + filename
# check if file exists, if not download it
filepath = base_filepath + filename
if not os.path.exists(filepath):
if not session:
session = requests.session()
print("Downloading file: ", filename)
r = session.get(url, stream=True)
chuncksize = 1024
with open(filepath, 'wb') as file:
for chunck in r.iter_content(chuncksize):
file.write(chunck)
else:
print("Using local file from", filepath)
filepath = '' + filepath
return filepath
url = 'http://public.tableau.com/views/ExportIRENAData/Europe.csv?:embed=y&:showVizHome=no&:display_count=y&:display_static_image=y&:bootstrapWhenNotified=true'
download_and_cache_Irena_Europe(url)
irena_df = pd.read_csv('input/Europe.csv',
sep = ",",
decimal = ".",
thousands = ",",
encoding = 'UTF8',
header = 0)
# Select just the required values for capacity
idx_irena = irena_df[(irena_df['Flow'] == 'Total capacity (MWe)')].index
irena_df = irena_df.loc[idx_irena,('Country / Area','Sub Technology','Year','Value')]
# Read column and value translation list
columnnames = pd.read_csv('input/column_translation_list.csv')
valuenames = pd.read_csv('input/value_translation_list.csv')
# Create dictionary for column translation
idx_Europe = columnnames[columnnames['country'] == 'Europe'].index
column_dict = columnnames.loc[idx_Europe].set_index('original_name')[
'opsd_name'].to_dict()
# Create dictionnary in order to adapt energy_source_subtype names
idx_Europe = valuenames[valuenames['data_source'] == 'IRENA'].index
value_dict = valuenames.loc[idx_Europe].set_index('original_name')[
'opsd_name'].to_dict()
# Create dictionnary in order to assign energy_source to its subtype
energy_source_dict = valuenames.loc[idx_Europe].set_index('opsd_name')[
'energy_source_level_2'].to_dict()
energy_source_dict
# Translate columns by list
irena_df.rename(columns = column_dict, inplace = True)
# Replace energy_source names
irena_df.replace(value_dict, inplace=True)
# Create new column for energy_source
irena_df['energy_source_level_2'] = irena_df['technology']
# Fill this with the energy source instead of subtype information
irena_df.energy_source_level_2.replace(energy_source_dict, inplace=True)
# Group and summarize DataFrame by generation type and installed capacity
irena_sum_df = irena_df.groupby(['country','energy_source_level_2','year'])
# Group and summarize DataFrame by generation type and installed capacity
irena_sum_df = irena_df.groupby(['country','energy_source_level_2','year'],
as_index = False)['electrical_capacity'].sum()
# Add column to keep track of the source
irena_sum_df['data_source'] = 'IRENA'
opsd_df = pd.read_csv('input/aggregated_capacity.csv',
sep = ",",
decimal = ".",
encoding = 'UTF8',
header = 0)
# Translate columns by list
opsd_df.rename(columns = column_dict, inplace = True)
# Choose the required translation terms for this data source
idx_opsd = valuenames[valuenames['data_source'] == 'OPSD'].index
# Create dictionnary in order to change install_type names
value_dict = valuenames.loc[idx_opsd].set_index('original_name')['opsd_name'].to_dict()
# Replace energy_source names
opsd_df.replace(value_dict, inplace = True)
# Group and summarize DataFrame by generation type and installed capacity
e_source = pd.Series(['Bioenergy','Wind','Solar','Geothermal','Hydro','Marine'])
recent_years = (2014,2015)
idx_opsd = opsd_df[(opsd_df['year'].isin(recent_years))
& (opsd_df['energy_source_level_2'].isin(e_source))].index
# only required columns are chosen, and only rows of recent years
# Unnamed:0 gives an error
# opsd_df = opsd_df.loc[idx_opsd].drop(['Unnamed: 0','source_type',
# 'type','technology_level_0','technology_level_1',
# 'technology_level_2','technology_level_3'], axis = 1)
opsd_df = opsd_df.loc[idx_opsd].drop(['source_type',
'type','technology_level_0','technology_level_1',
'technology_level_2','technology_level_3'], axis = 1)
# index is not required any more
opsd_df = opsd_df.reset_index(drop = True)
Installed capacity is summed per country and energy source to compare it to the other sources for cumulated capacity. This gives an impression how complete the country power plant list datasets are.
Germany DE
# Summarize per energy_source until end of 2015 without marked entries
DE_re_sum = DE_re_clean[(DE_re_clean['commissioning_date'] <= '2016-12-31')].groupby([
'energy_source_level_2'], as_index = False)['electrical_capacity'].sum()
# Add information until which year (including) the capacity is summarized
DE_re_sum['year'] = 2015
DE_re_sum['country'] = 'DE'
DE_re_sum['data_source'] = 'German TSOs'
DE_re_sum
Denmark DK
# Summarize per energy_source until end of 2015
DK_re_sum = DK_re[(DK_re['commissioning_date'] <= '2015-12-31')].groupby([
'energy_source_level_2','data_source'], as_index = False)[
'electrical_capacity'].sum()
# Add information until which year (including) the capacity is summarized
DK_re_sum['year'] = 2015
DK_re_sum['country'] = 'DK'
DK_re_sum
France FR
# Summarize per energy_source. Data until 31.12.2014 is provided by the data source
FR_re_sum = FR_re.groupby(['energy_source_level_2','data_source'], as_index = False)[
'electrical_capacity'].sum()
# Add information until which year (including) the capacity is summarized
FR_re_sum['year'] = 2014
FR_re_sum['country'] = 'FR'
FR_re_sum
Poland PL
# Summarize per energy_source. Data until 31.12.2015 is provided by the data source
PL_re_sum = PL_re.groupby(['energy_source_level_2','data_source'], as_index = False)[
'electrical_capacity'].sum()
# Add information until which year (including) the capacity is summarized
PL_re_sum['year'] = 2015
PL_re_sum['country'] = 'PL'
PL_re_sum
Switzerland CH
# Summarize per energy_source. Data until 31.12.2016 is provided by the data source
CH_re_sum = CH_re.groupby(['energy_source_level_2','data_source'], as_index = False)[
'electrical_capacity'].sum()
# Add information until which year (including) the capacity is summarized
CH_re_sum['year'] = 2016
CH_re_sum['country'] = 'CH'
CH_re_sum
# Merge DataFrames
capacities = pd.concat([irena_sum_df, opsd_df,
DE_re_sum, DK_re_sum, FR_re_sum, PL_re_sum, CH_re_sum])
# Get rid von all indices
capacities = capacities.reset_index(drop = True)
One comparison plot for each country: Cumulated capacity in MW per energy source from different sources.
plot_country = 'DE'
# chose capacities of respective country and recent years
idx_country = capacities[(capacities['country'] == plot_country)
& (capacities['year'].isin(recent_years))].index
# prepare DataFrame for plotting
cap_country = capacities.loc[idx_country].drop(['country',
'capacity_definition'],
axis = 1)
cap_country = cap_country.reset_index(drop = True)
cap_country
# plot
p = Bar(cap_country, label='energy_source_level_2', values='electrical_capacity',
group=['data_source','year'], legend='top_left',
title=plot_country)
show(p)
plot_country = 'DK'
# chose capacities of respective country and recent years
idx_country = capacities[(capacities['country'] == plot_country)
& (capacities['year'].isin(recent_years))].index
# prepare DataFrame for plotting
cap_country = capacities.loc[idx_country].drop(['country',
'capacity_definition'],
axis = 1)
cap_country = cap_country.reset_index(drop = True)
cap_country
# plot
p = Bar(cap_country, label='energy_source_level_2', values='electrical_capacity',
group=['data_source','year'], legend='top_left',
title=plot_country)
show(p)
plot_country = 'FR'
# chose capacities of respective country and recent years
idx_country = capacities[(capacities['country'] == plot_country)
& (capacities['year'].isin(recent_years))].index
# prepare DataFrame for plotting
cap_country = capacities.loc[idx_country].drop(['country',
'capacity_definition'],
axis = 1)
cap_country = cap_country.reset_index(drop = True)
cap_country
# plot
p = Bar(cap_country, label='energy_source_level_2', values='electrical_capacity',
group=['data_source','year'], legend='top_left',
title=plot_country)
show(p)
plot_country = 'PL'
# chose capacities of respective country and recent years
idx_country = capacities[(capacities['country'] == plot_country)
& (capacities['year'].isin(recent_years))].index
# prepare data fraem for plotting
cap_country = capacities.loc[idx_country].drop(['country',
'capacity_definition'],
axis = 1)
cap_country = cap_country.reset_index(drop = True)
# plot
p = Bar(cap_country, label='energy_source_level_2', values='electrical_capacity',
group=['data_source','year'], legend='top_left',
title=plot_country)
show(p)
plot_country = 'CH'
# chose capacities of respective country and recent years
idx_country = capacities[(capacities['country'] == plot_country)
& (capacities['year'].isin(recent_years))].index
# prepare DataFrame for plotting
cap_country = capacities.loc[idx_country].drop(['country',
'capacity_definition'],
axis = 1)
cap_country = cap_country.reset_index(drop = True)
cap_country
# plot
p = Bar(cap_country, label='energy_source_level_2', values='electrical_capacity',
group=['data_source','year'], legend='top_left',
title=plot_country)
show(p)
The yearly capacity time series by energy source derived from the power plant list is validated plotting it in comparison to a time series published by the German Federal Ministry for Economic Affairs and Energy (BMWi, Bundesministeriums für Wirtschaft und Energie) Time series for the development of renewable energies in Germany
Download BMWi time series (If an error message is shown, check if the url_bmwi_stat is still correct.)
# Defining URL
url_bmwi_stat = 'http://www.erneuerbare-energien.de/EE/Redaktion/DE/'\
'Downloads/zeitreihen-zur-entwicklung-der-erneuerbaren-'\
'energien-in-deutschland-1990-2016-excel.xlsx;jsessionid='\
'FFE958ADA709DCBFDD437C8A8FF7D90B?__blob=publicationFile&v=6'
# Reading BMWi data
bmwi_stat = pd.read_excel(url_bmwi_stat,
sheetname='4',
header=7,
skip_footer=8,
index_col=0)
# drop last column of BMWi data as it contains no data for 2016 yet. To be changed as soon as data is available
bmwi_stat.drop(bmwi_stat.columns[len(bmwi_stat.columns)-1], axis=1, inplace=True)
# Transpose DataFrame and set column names
bmwi_df = bmwi_stat.T
bmwi_df.columns = ['bmwi_hydro', 'bmwi_wind_onshore', 'bmwi_wind_offshore',
'bmwi_solar', 'bmwi_biomass', 'bmwi_biomass_liquid',
'bmwi_biomass_gas', 'bmwi_sewage_gas', 'bmwi_landfill_gas',
'bmwi_geothermal', 'bmwi_total']
# Set year as index
bmwi_df.index = pd.to_datetime(bmwi_df.index.astype(str).str.slice(0,4), format="%Y").year
** Calculate the absolute deviation between BMWi and OPSD time series**
# Merge BMWi and OPSD DataFrames
valuation = pd.concat([bmwi_df, timeseries_yearly], axis=1)
valuation = valuation.fillna(0)
# Calculate absolute deviation for each year and energy source
valuation['absolute_wind_onshore'] = (valuation['Onshore']
- valuation['bmwi_wind_onshore']).fillna(0)
valuation['absolute_wind_offshore'] = (valuation['Offshore']
- valuation['bmwi_wind_offshore']).fillna(0)
valuation['absolute_solar'] = (valuation['Solar']
- valuation['bmwi_solar']).fillna(0)
valuation['absolute_hydro'] = (valuation['Run-of-river']
- valuation['bmwi_hydro']).fillna(0)
valuation['absolute_geothermal'] = (valuation['Geothermal']
- valuation['bmwi_geothermal']).fillna(0)
valuation['absolute_biomass'] = (valuation['Bioenergy']
- (valuation['bmwi_biomass']
+ valuation['bmwi_biomass_liquid']
+ valuation['bmwi_biomass_gas']
+ valuation['bmwi_sewage_gas']
+ valuation['bmwi_landfill_gas'])).fillna(0)
valuation['absolute_total'] = ((valuation['Bioenergy']
+ valuation['Onshore']
+ valuation['Offshore']
+ valuation['Solar']
+ valuation['Geothermal']
+ valuation['Run-of-river']
) - (valuation['bmwi_total'])).fillna(0)
Plotting the absolute deviation
The graph shows the cumulative absolute deviation of the electrical capacity in MW between the OPSD data set of renewable power plants and the BMWI statistic from 1990 until 2015.
A positive deviation means the OPSD data set has more capacity than the BMWI statistics.
#Plot settings for absolute deviation
deviation_columns = ['absolute_wind_onshore','absolute_wind_offshore',
'absolute_solar','absolute_hydro','absolute_biomass',
'absolute_geothermal','absolute_total']
dataplot = valuation[deviation_columns]
deviation = Line(dataplot,
y=deviation_columns,
dash=deviation_columns,
color=deviation_columns,
title="Deviation between data set and BMWI statistic (negative => BMWI value higher)",
ylabel='Deviation in MW',
xlabel='From 1990 till 2015',
legend=True)
# Show Plot for absolute deviation
show(deviation)
** Calculate the relative deviation between both time series**
# Relative deviation
valuation['relative_wind_onshore'] = (valuation['absolute_wind_onshore'] / valuation['bmwi_wind_onshore']).fillna(0)
valuation['relative_wind_offshore'] = (valuation['absolute_wind_offshore'] / valuation['bmwi_wind_offshore']).fillna(0)
valuation['relative_solar'] = (valuation['absolute_solar'] / (valuation['bmwi_solar'])).fillna(0)
valuation['relative_hydro'] = (valuation['absolute_hydro'] / (valuation['bmwi_hydro'])).fillna(0)
valuation['relative_geothermal'] = (valuation['absolute_geothermal'] / (valuation['bmwi_geothermal'])).fillna(0)
valuation['relative_biomass'] = (valuation['absolute_biomass'] / (valuation['bmwi_biomass'])).fillna(0)
valuation['relative_total'] = (valuation['absolute_total'] / (valuation['bmwi_total'])).fillna(0)
# Plot settings relative deviation
relative_column = ['relative_wind_onshore','relative_wind_offshore',
'relative_solar','relative_hydro','relative_biomass',
'relative_total']
dataplot2 = valuation[relative_column]
relative = Line(dataplot2 * 100,
y=relative_column,
dash=relative_column,
color=relative_column,
title="Deviation between data set and BMWI statistic (negative => BMWI value higher)",
ylabel='Relative difference in percent',
xlabel='From 1990 till 2015',
legend=True)
Plotting the relative deviation
The graph shows the relative difference of the electrical capacity between the OPSD data set of renewable power plants and the BMWI statistic from 1990 until 2015.
A positive value means the OPSD data set has more capacity than the BMWI statistics.
# Show Plot for relative deviation
show(relative)
Saving deviation results in XLSX-file
# write results as Excel file
valuation.to_excel('validation_report.xlsx', sheet_name='Capacities_1990_2015', float_format= '%20.5f')
A similar order of columns is pursued for all country lists.
DE_re = DE_re.loc[:,('commissioning_date','decommissioning_date',
'energy_source_level_1','energy_source_level_2',
'energy_source_level_3','technology',
'electrical_capacity','thermal_capacity',
'voltage_level','tso','dso','dso_id', 'eeg_id','bnetza_id',
'federal_state','postcode','municipality_code','municipality',
'address','address_number',
'utm_zone','utm_east','utm_north','lat','lon',
'data_source','comment')]
DK_re = DK_re.loc[:,('commissioning_date',
'energy_source_level_1','energy_source_level_2',
'technology',
'electrical_capacity',
'dso','gsrn_id',
'postcode','municipality_code','municipality',
'address','address_number',
'utm_east','utm_north','lat','lon',
'hub_height','rotor_diameter','manufacturer','model',
'data_source')]
FR_re = FR_re.loc[:,('municipality_code','municipality',
'energy_source_level_1','energy_source_level_2',
'energy_source_level_3','technology',
'electrical_capacity','number_of_installations',
'lat','lon',
'data_source','comment')]
PL_re = PL_re.loc[:,('district',
'energy_source_level_1','energy_source_level_2',
'energy_source_level_3','technology',
'electrical_capacity','number_of_installations',
'lat','lon',
'data_source')]
CH_re = CH_re.loc[:,('commissioning_date','municipality',
'energy_source_level_1','energy_source_level_2',
'energy_source_level_3','technology',
'electrical_capacity','municipality_code',
'project_name','production',
'tariff','notification_date',
'contract_period_end','street',
'canton','company',
'lat','lon',
'data_source')]
Five digits behind the decimal seperator for electrical capacity in MW.
DE_re['electrical_capacity'] = DE_re['electrical_capacity'
].map(lambda x: round(x,5))
DE_re['thermal_capacity'] = DE_re['thermal_capacity'
].map(lambda x: round(x,5))
DK_re['electrical_capacity'] = DK_re['electrical_capacity'
].map(lambda x: round(x,5))
FR_re['electrical_capacity'] = FR_re['electrical_capacity'
].map(lambda x: round(x,5))
PL_re['electrical_capacity'] = PL_re['electrical_capacity'
].map(lambda x: round(x,5))
CH_re['electrical_capacity'] = CH_re['electrical_capacity'
].map(lambda x: round(x,5))
CH_re.head(5)
#DE_re.info()
Five digits behind the decimal seperator for coordinates.
DE_re['lat'] = DE_re['lat'].map(lambda x: round(x,5))
DE_re['lon'] = DE_re['lon'].map(lambda x: round(x,5))
DE_re['utm_east'] = DE_re['utm_east'].map(lambda x: round(x,5))
DE_re['utm_north'] = DE_re['utm_north'].map(lambda x: round(x,5))
# Zone value has to be an integer, thus no digits after the comma
DE_re['utm_zone'] = pd.to_numeric(DE_re['utm_zone'], errors='coerce')
DE_re['utm_zone'] = DE_re['utm_zone'].map(lambda x: '%.0f' % x)
DK_re['lat'] = DK_re['lat'].map(lambda x: round(x,5))
DK_re['lon'] = DK_re['lon'].map(lambda x: round(x,5))
DK_re['utm_east'] = DK_re['utm_east'].map(lambda x: round(x,5))
DK_re['utm_north'] = DK_re['utm_north'].map(lambda x: round(x,5))
FR_re['lon'] = FR_re['lon'].map(lambda x: round(x,5))
FR_re['lat'] = FR_re['lat'].map(lambda x: round(x,5))
CH_re['lon'] = CH_re['lon'].map(lambda x: round(x,5))
CH_re['lat'] = CH_re['lat'].map(lambda x: round(x,5))
Dates in the output should be without a timestamp.
DE_re['commissioning_date'] = DE_re['commissioning_date'].apply(lambda x: x.date())
DE_re['decommissioning_date'] = DE_re['decommissioning_date'].apply(lambda x: x.date())
DK_re['commissioning_date'] = DK_re['commissioning_date'].apply(lambda x: x.date())
CH_re['commissioning_date'] = CH_re['commissioning_date'].apply(lambda x: x.date())
Reset index of timeseries.
# Time index is not required any more
timeseries_yearly = timeseries_yearly.reset_index()
timeseries_daily = timeseries_daily.reset_index()
# Set index name
timeseries_yearly.rename(columns={'index': 'year'}, inplace=True)
timeseries_daily.rename(columns={'index': 'day'}, inplace=True)
# Sort German DataFrame by commissioning date
DE_re = DE_re.ix[DE_re.commissioning_date.sort_values().index]
# Sort Danish DataFrame by commissioning date
DK_re = DK_re.ix[DK_re.commissioning_date.sort_values().index]
# Sort French DataFrame by municipality_code
FR_re = FR_re.ix[FR_re.municipality_code.sort_values().index]
# Sort Polish DataFrame by district
PL_re = PL_re.ix[PL_re.district.sort_values().index]
# Sort Swiss DataFrame by commissioning date
CH_re = CH_re.ix[CH_re.commissioning_date.sort_values().index]
# fillna leaves NaN (Not-a-Number) cells blank and replace cells with 'nan'
#DE_re.fillna('', inplace=True)
#DE_re.replace('nan','', inplace=True)
DK_re.fillna('', inplace=True)
DK_re.replace('nan','', inplace=True)
FR_re.fillna('', inplace=True)
FR_re.replace('nan','', inplace=True)
PL_re.fillna('', inplace=True)
PL_re.replace('nan','', inplace=True)
CH_re.fillna('', inplace=True)
CH_re.replace('nan','', inplace=True)
This section finally writes the Data Package:
path_package = 'output/renewable_power_plants'
os.makedirs(path_package, exist_ok=True)
** Write CSV-files**
This process will take some time depending on you hardware.
One csv-file for each country.
# Germany
DE_re.to_csv(path_package + '/renewable_power_plants_DE.csv',
sep=',',
decimal='.',
date_format='%Y-%m-%d',
encoding='utf-8',
index=False,
if_exists="replace")
# Denmark
DK_re.to_csv(path_package + '/renewable_power_plants_DK.csv',
sep=',',
decimal='.',
date_format='%Y-%m-%d',
encoding='utf-8',
index=False,
if_exists="replace")
# France
FR_re.to_csv(path_package + '/renewable_power_plants_FR.csv',
sep=',',
decimal='.',
date_format='%Y-%m-%d',
encoding='utf-8',
index=False,
if_exists="replace")
# Poland
PL_re.to_csv(path_package + '/renewable_power_plants_PL.csv',
sep=',',
decimal='.',
date_format='%Y-%m-%d',
encoding='utf-8',
index=False,
if_exists="replace")
# Switzerland
CH_re.to_csv(path_package + '/renewable_power_plants_CH.csv',
sep=',',
decimal='.',
date_format='%Y-%m-%d',
encoding='utf-8',
index=False,
if_exists="replace")
** Write XLSX-file**
This process will take some time depending on your hardware.
All country power plant list will be written in one xlsx-file. Each country power plant list is written in a separate sheet. As the German power plant list has to many entries for one sheet, it will be split in two. An additional sheet includes the explanations of the marker.
# In case there is a memory error, the output data can be saved in a pickle file,
# the kernel stopped and all variables deleted. Then just the necessary data can be
# read and then there should be no memory error for the output.
# DE_re.to_pickle('DE_re_before_output.pickle')
# DK_re.to_pickle('DK_re_before_output.pickle')
# FR_re.to_pickle('FR_re_before_output.pickle')
# PL_re.to_pickle('PL_re_before_output.pickle')
# In case the date has to be read in again, the first cell of the notebook has to
# be executed and then the data for the excel-output can be read in again.
# DE_re = pd.read_pickle('DE_re_before_output.pickle')
# DK_re = pd.read_pickle('DK_re_before_output.pickle')
# FR_re = pd.read_pickle('FR_re_before_output.pickle')
# PL_re = pd.read_pickle('PL_re_before_output.pickle')
# path_package = 'output/renewable_power_plants'
# Read csv of Marker Explanations
validation = pd.read_csv('input/validation_marker.csv',
sep=',', header=0)
# Write the results as xlsx file
%time writer = pd.ExcelWriter(path_package + '/renewable_power_plants.xlsx',engine='xlsxwriter',date_format='yyyy-mm-dd')
print('Writing DK')
%time DK_re.to_excel(writer, index=False, sheet_name='DK')
print('Writing FR')
%time FR_re.to_excel(writer, index=False, sheet_name='FR')
print('Writing PL')
%time PL_re.to_excel(writer, index=False, sheet_name='PL')
print('Writing CH')
%time PL_re.to_excel(writer, index=False, sheet_name='CH')
print('Writing DE part 1')
%time DE_re[:1000000].to_excel(writer, index=False, sheet_name='DE part-1')
print('Writing DE part 2')
%time DE_re[1000000:].to_excel(writer, index=False, sheet_name='DE part-2')
print('Writing validation')
# The explanation of validation markers is added as a sheet
%time validation.to_excel(writer, index=False, sheet_name='validation_marker')
print('Saving...')
# Close the Pandas Excel writer and output the Excel file.
%time writer.save()
print('...done!')#
Write SQLite
DK_re.info()
%%time
# The decommissioning_date column is giving the engine some trouble. That is
# why it is casted to the str type.
DE_re.decommissioning_date = DE_re.decommissioning_date.astype(str)
DE_re.commissioning_date = DE_re.commissioning_date.astype(str)
# Write the results to sqlite database. Using the chunksize parameter makes
# this cell not use so much memory. If the parameter is not set, the to_sql
# function will try to write all rows at the same time. This uses too much
# memory. If you have a lot of memory, you can remove the parameter or increase
# it to speed this process up. If you have memory problemes, try decreasing the
# chunksize.
engine = sqlalchemy.create_engine(
'sqlite:///'+ path_package +'/renewable_power_plants.sqlite')
DE_re.to_sql('renewable_power_plants_DE',
engine,
if_exists="replace",
chunksize=100000,
index=False
)
DK_re.to_sql('renewable_power_plants_DK',
engine,
if_exists='replace',
chunksize=100000,
index=False
)
FR_re.to_sql('renewable_power_plants_FR',
engine,
if_exists="replace",
chunksize=100000,
index=False
)
PL_re.to_sql('renewable_power_plants_PL',
engine,
if_exists="replace",
chunksize=100000,
index=False
)
CH_re.to_sql('renewable_power_plants_CH',
engine,
if_exists="replace",
chunksize=100000,
index=False
)
validation.to_sql('validation_marker',
engine,
if_exists="replace",
chunksize=100000,
index=False
)
timeseries_daily.to_sql('renewable_capacity_timeseries_DE',
engine,
if_exists="replace",
chunksize=100000,
index=False
)
# Write daily cumulated time series as csv
timeseries_daily.to_csv(path_package + '/renewable_capacity_timeseries_DE.csv',
sep=',',
float_format='%.3f',
decimal='.',
date_format='%Y-%m-%d',
encoding='utf-8',
if_exists="replace")
The Data Packages meta data are created in the specific JSON format as proposed by the Open Knowledge Foundation. Please see the Frictionless Data project by OKFN (http://data.okfn.org/) and the Data Package specifications (http://dataprotocols.org/data-packages/) for more details.
In order to keep the Jupyter Notebook more readable the metadata is written in the human-readable YAML format using a multi-line string and then parse the string into a Python dictionary and save it as a JSON file.
metadata = """
hide: yes
name: opsd-renewable-power-plants
title: Renewable power plants
description: List of renewable energy power stations
long_description: >-
This Data Package contains a list of renewable energy power plants in lists of
renewable energy-based power plants of Germany, Denmark, France and Poland.
Germany: More than 1.7 million renewable power plant entries, eligible under the
renewable support scheme (EEG).
Denmark: Wind and phovoltaic power plants with a high level of detail.
France: Aggregated capacity and number of installations per energy source per
municipality (Commune).
Poland: Summed capacity and number of installations per energy source
per municipality (Powiat).
Switzerland: Renewable power plants eligible under the Swiss feed in tariff KEV
(Kostendeckende Einspeisevergütung)
Due to different data availability, the power plant lists are of different
accurancy and partly provide different power plant parameter. Due to that, the
lists are provided as seperate csv-files per country and as separate sheets in the
excel file. Suspect data or entries with high probability of duplication are marked
in the column 'comment'. Theses validation markers are explained in the file
validation_marker.csv. Filtering all entries with comments out results in the recommended
data set.
Additionally, the Data Package includes a daily time series of cumulated
installed capacity per energy source type for Germany. All data processing is
conducted in Python and pandas and has been documented in the Jupyter Notebooks
linked below.
version: "2017-06-26"
keywords: [master data register,power plants,renewables,germany,denmark,france,poland,switzerland,open power system data]
geographical-scope: Germany, Denmark, France, Poland, Switzerland
resources:
- path: renewable_power_plants_DE.csv
format: csv
encoding: UTF-8
missingValue: ""
schema:
fields:
- name: commissioning_date
type: date
format: YYYY-MM-DD
description: Date of commissioning of specific unit
- name: decommissioning_date
type: date
format: YYYY-MM-DD
description: Date of decommissioning of specific unit
- name: energy_source_level_1
description: Type of energy source (e.g. Renewable energy)
type: string
- name: energy_source_level_2
description: Type of energy source (e.g. Wind, Solar)
type: string
opsd-contentfilter: "true"
- name: energy_source_level_3
description: Subtype of energy source (e.g. Biomass and biogas)
type: string
- name: technology
description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)
type: string
- name: electrical_capacity
description: Installed electrical capacity in MW
type: number
format: float
unit: MW
- name: thermal_capacity
description: Installed thermal capacity in MW
type: number
format: float
unit: MW
- name: voltage_level
description: Voltage level of grid connection
type: string
- name: tso
description: Name of transmission system operator of the area the plant is located
type: string
- name: dso
description: Name of distribution system operator of the region the plant is located in
type: string
- name: dso_id
description: Company number of German distribution grid operator
type: string
- name: eeg_id
description: Power plant EEG (German feed-in tariff law) remuneration number
type: string
- name: bnetza_id
description: Power plant identification number by BNetzA
type: string
- name: federal_state
description: Name of German administrative level 'Bundesland'
type: string
- name: postcode
description: German zip-code
type: string
- name: municipality_code
description: German Gemeindenummer (municipalitiy number)
type: string
- name: municipality
description: Name of German Gemeinde (municipality)
type: string
- name: address
description: Street name or name of land parcel
type: string
- name: address_number
description: House number or number of land parcel
type: string
- name: utm_zone
description: Universal Transverse Mercator zone value
type:
- name: utm_east
description: Coordinate in Universal Transverse Mercator (east)
type: numeric
format: float
- name: utm_north
description: Coordinate in Universal Transverse Mercator (north)
type: numeric
format: float
- name: lat
description: Latitude coordinates
type: geopoint
format: lat
- name: lon
description: Longitude coordinates
type: geopoint
format: lon
- name: data_source
description: Source of database entry
type: string
- name: comment
description: Shortcodes for comments related to this entry, explanation can be looked up in validation_marker.csv
type: string
- path: renewable_power_plants_DK.csv
format: csv
encoding: UTF-8
missingValue: ""
schema:
fields:
- name: commissioning_date
type: date
format: YYYY-MM-DD
- name: energy_source_level_1
description: Type of energy source (e.g. Renewable energy)
type: string
- name: energy_source_level_2
description: Type of energy source (e.g. Wind, Solar)
type: string
opsd-contentfilter: "true"
- name: technology
description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)
type: string
- name: electrical_capacity
description: Installed electrical capacity in MW
type: number
format: float
- name: dso
description: Name of distribution system operator of the region the plant is located in
type: string
- name: gsrn_id
description: Danish wind turbine identifier number (GSRN)
type: number
format: integer
- name: postcode
description: Danish zip-code
type: string
- name: municipality_code
description: Danish 3-digit Kommune-Nr
type: string
- name: municipality
description: Name of Danish Kommune
type: string
- name: address
description: Street name or name of land parcel
type: string
- name: address_number
description: House number or number of land parcel
type: string
- name: utm_east
description: Coordinate in Universal Transverse Mercator (east)
type: numeric
format: float
- name: utm_north
description: Coordinate in Universal Transverse Mercator (north)
type: numeric
format: float
- name: lat
description: Latitude coordinates
type: geopoint
format: lat
- name: lon
description: Longitude coordinates
type: geopoint
format: lon
- name: hub_height
description: Wind turbine hub heigth in m
type: numeric
format: float
- name: rotor_diameter
description: Wind turbine rotor diameter in m
type: numeric
format: float
- name: manufacturer
description: Company that has built the wind turbine
type: string
- name: model
description: Wind turbind model type
type: string
- name: data_source
description: Source of database entry
type: string
- path: renewable_power_plants_FR.csv
format: csv
encoding: UTF-8
missingValue: ""
schema:
fields:
- name: municipality_code
description: French 5-digit INSEE code for Communes
type: string
- name: municipality
description: Name of French Commune
type: string
- name: energy_source_level_1
description: Type of energy source (e.g. Renewable energy)
type: string
- name: energy_source_level_2
description: Type of energy source (e.g. Wind, Solar)
type: string
opsd-contentfilter: "true"
- name: energy_source_level_3
description: Subtype of energy source (e.g. Biomass and biogas)
type: string
- name: technology
description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)
type: string
- name: electrical_capacity
description: Installed electrical capacity in MW
type: number
format: float
- name: number_of_installations
description: Number of installations of the energy source subtype in the municipality
type: number
format: integer
- name: lat
description: Latitude coordinates
type: geopoint
format: lat
- name: lon
description: Longitude coordinates
type: geopoint
format: lon
- name: data_source
description: Source of database entry
type: string
- path: renewable_power_plants_PL.csv
format: csv
encoding: UTF-8
missingValue: ""
schema:
fields:
- name: district
description: Name of the Polish powiat
type: string
- name: energy_source_level_1
description: Type of energy source (e.g. Renewable energy)
type: string
- name: energy_source_level_2
description: Type of energy source (e.g. Wind, Solar)
type: string
opsd-contentfilter: "true"
- name: energy_source_level_3
description: Subtype of energy source (e.g. Biomass and biogas)
type: string
- name: technology
description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)
type: string
- name: electrical_capacity
description: Installed electrical capacity in MW
type: number
format: float
- name: number_of_installations
description: Number of installations of the energy source subtype in the district
type: number
format: integer
- name: data_source
description: Source of database entry
type: string
- path: renewable_power_plants_CH.csv
format: csv
encoding: UTF-8
missingValue: ""
schema:
fields:
- name: commissioning_date
type: date
format: YYYY-MM-DD
- name: municipality
type: string
- name: energy_source_level_1
description: Type of energy source (e.g. Renewable energy)
type: string
- name: energy_source_level_2
description: Type of energy source (e.g. Wind, Solar)
type: string
opsd-contentfilter: "true"
- name: technology
description: Technology to harvest energy source (e.g. Onshore, Photovoltaics)
type: string
- name: electrical_capacity
description: Installed electrical capacity in MW
type: number
format: float
- name: municipality_code
type: number
format: integer
- name: project_name
description: name of the project
type: string
- name: production
description: yearly production in MWh
type: numeric
format: float
- name: tariff
description: tariff in CHF for 2016
type: numeric
format: float
- name: notification_date
description: date of data entriy at BFE
type: date
format: YYYY-MM-DD HH:MM:SS.SSSSS
- name: street
description: Street name
type: string
- name: canton
description: name of the cantones/ member states of the Swiss conferderation
type: string
- name: company
description: name of the company
type: string
- name: lat
description: Latitude coordinates
type: geopoint
format: lat
- name: lon
description: Longitude coordinates
type: geopoint
format: lon
- name: data_source
description: Source of database entry
type: string
- path: renewable_power_plants.xlsx
format: xlsx
- path: validation_marker.csv
format: csv
encoding: UTF-8
mediatype: text/csv
missingValue: ""
schema:
fields:
- name: Validation_Marker
description: Name of validation marker utilized in column comment in the renewable_power_plant_germany.csv
type: string
- name: Explanation
description: Comment explaining meaning of validation marker
type: string
- path: renewable_capacity_timeseries_DE.csv
format: csv
encoding: UTF-8
mediatype: text/csv
missingValue: ""
schema:
fields:
- name:
description: Day
type: datetime
format: YYYY-MM-DD
- name: Solar
description: Cumulated electrical solar capacity in MW
type: number
format: float
unit: MW
- name: Onshore
description: Cumulated electrical wind onshore capacity in MW
type: number
format: float
unit: MW
- name: Offshore
description: Cumulated electrical wind offshore capacity in MW
type: number
format: float
unit: MW
- name: Bioenergy and renewable waste
description: Cumulated electrical bioenergy and renewable waste capacity in MW
type: number
format: float
unit: MW
- name: Geothermal
description: Cumulated electrical geothermal capacity in MW
type: number
format: float
unit: MW
- name: Run-of-river
description: Cumulated electrical run-of-river capacity in MW
type: number
format: float
unit: MW
licenses:
- type: MIT license
url: http://www.opensource.org/licenses/MIT
sources:
- name: BNetzA
web: http://www.bundesnetzagentur.de/cln_1422/DE/Sachgebiete/ElektrizitaetundGas/Unternehmen_Institutionen/ErneuerbareEnergien/Anlagenregister/Anlagenregister_Veroeffentlichung/Anlagenregister_Veroeffentlichungen_node.html
source: Bundesnetzagentur register of renewable power plants (excl. PV)
- name: BNetzA_PV
web: http://www.bundesnetzagentur.de/cln_1431/DE/Sachgebiete/ElektrizitaetundGas/Unternehmen_Institutionen/ErneuerbareEnergien/Photovoltaik/DatenMeldgn_EEG-VergSaetze/DatenMeldgn_EEG-VergSaetze_node.html
source: Bundesnetzagentur register of PV power plants
- name: TransnetBW, TenneT, Amprion, 50Hertz, Netztransparenz.de
web: https://www.netztransparenz.de/de/Anlagenstammdaten.htm
source: Netztransparenz.de - information platform of German TSOs (register of renewable power plants in their control area)
- name: Postleitzahlen Deutschland
web: http://www.suche-postleitzahl.org/downloads
source: Zip codes of Germany linked to geo-information
- name: Energinet.dk
web: http://www.energinet.dk/SiteCollectionDocuments/Danske%20dokumenter/El/SolcelleGraf.xlsx
source: register of Danish wind power plants
- name: Energistyrelsen
web: http://www.ens.dk/sites/ens.dk/files/info/tal-kort/statistik-noegletal/oversigt-energisektoren/stamdataregister-vindmoeller/anlaegprodtilnettet.xls
source: ens.dk - register of Danish PV power plants
- name: GeoNames
web: http://download.geonames.org/export/zip/
source: geonames.org
- name: French Ministery of the Environment, Energy and the Sea
web: http://www.statistiques.developpement-durable.gouv.fr/energie-climat/r/energies-renouvelables.html?tx_ttnews[tt_news]=20647
source: register of renewable power plants by generation type and municipality
- name: OpenDataSoft
web: http://public.opendatasoft.com/explore/dataset/correspondance-code-insee-code-postal/download/'\
'?format=csv&refine.statut=Commune%20simple&timezone=Europe/Berlin&use_labels_for_header=true
source: Code Postal - Code INSEE
- name: Urzad Regulacji Energetyki (URE)
web: http://www.ure.gov.pl/uremapoze/mapa.html
source: Energy Regulatory Office of Poland
contributors:
- name: Kristian Reincke
email: kristian.reincke@uni-flensburg.de
web: http://open-power-system-data.org/
views: True
openpowersystemdata-enable-listing: True
documentation: https://github.com/Open-Power-System-Data/renewable_power_plants/blob/2017-06-26/main.ipynb
last_changes: Included Swiss data, Updated all sources to newest available data and bug fixes
"""
metadata = yaml.load(metadata)
datapackage_json = json.dumps(metadata, indent=4, separators=(',', ': '))
# Write the information of the metadata
with open(os.path.join(path_package, 'datapackage.json'), 'w') as f:
f.write(datapackage_json)
Generates checksums.txt
def get_sha_hash(path, blocksize=65536):
sha_hasher = hashlib.sha256()
with open(path, 'rb') as f:
buffer = f.read(blocksize)
while len(buffer) > 0:
sha_hasher.update(buffer)
buffer = f.read(blocksize)
return sha_hasher.hexdigest()
output_path = 'output/renewable_power_plants'
files = [
'renewable_power_plants_DE.csv', 'renewable_power_plants_DK.csv',
'renewable_power_plants_PL.csv', 'renewable_power_plants_FR.csv',
'renewable_power_plants_CH.csv',
'renewable_power_plants.sqlite', 'renewable_power_plants.xlsx',
'renewable_capacity_timeseries_DE.csv'
]
with open(os.path.join(output_path, 'checksums.txt'), 'w') as f:
for file_name in files:
file_hash = get_sha_hash(os.path.join(output_path, file_name))
f.write('{},{}\n'.format(file_name, file_hash))