Weather data: Main notebook
This Notebook is part of the Weather data Datapackage of Open Power System Data. |
This script contains code that allows the download, subset and processing of MERRA-2 datasets (provided by NASA Goddard Space Flight Center) and export them as CSV.
Weather data differ significantly from the other data types used resp. provided by OPSD in that the sheer size of the data packages greatly exceeds OPSD's capacity to host them in a similar way as feed-in timeseries, power plant data etc. While the other data packages also offer a complete one-klick download of the bundled data packages with all relevant data this is impossible for weather datasets like MERRA-2 due to their size (variety of variables, very long timespan, huge geographical coverage etc.). It would make no sense to mirror the data from the NASA servers.
Instead we choose to provide only a documented methodological script. The method describes one way to automatically obtain the desired weather data from the MERRA-2 database and simplifies resp. unifies alternative manual data obtaining methods in a single script.
More detailed background information on weather data can be found in the Main notebook and the OPSD Wiki on Github.
The script is still in development. It is currently working. The next step will be adding support for more datasets and more weather variables.
To download MERRA-2 data, you have to register at NASA earth data portal
import pandas as pd
import xarray as xr
import numpy as np
import requests
import logging
import yaml
import json
import os
from datetime import datetime
from calendar import monthrange
from opendap_download.multi_processing_download import DownloadManager
import math
from functools import partial
import re
import getpass
from datetime import datetime, timedelta
import dateutil.parser
# Set up a log
logging.basicConfig(level=logging.INFO)
log = logging.getLogger('notebook')
This part defines the input parameters according to the user and creates an URL that can download the desired MERRA-2 data via the OPeNDAP interface (more information).
These are the possible parameters (i.e. weather data)
If you want to select more than one parameter, separate them with commas. For example: wind, solar radiation, temperature
# Getting user input
# This version only supports wind so far. This line does not do anything at
# this point.
possible_params = ['wind', 'solar radiation', 'temperature']
Definition of desired timespan the data is needed for. (currently only full years possible)
# User input of timespan
download_year = 2014
# Create the start date 2014-01-01
download_start_date = str(download_year) + '-01-01'
Definition of desired coordinates. The user has to input two corner coordinates of a rectangular area (Format WGS84, decimal system).
The area/coordinates will be converted from lat/lon to the MERRA-2 grid coordinates. Since the resolution of the MERRA-2 grid is 0.5 x 0.625°, the given coordinates will matched as close as possible.
# User input of coordinates
# ------
# Example: Schleswig-Holstein (lat/lon)
# Northeastern point: 55,036823°N, 11,349297°E
# Southwestern point: 53,366266°N, 7,887088°E
# Northeastern coordinate
lat_1, lon_1 = 53.366266, 7.887088
# Southwestern coordinate
lat_2, lon_2 = 55.036823, 11.349297
def translate_lat_to_geos5_native(latitude):
"""
The source for this formula is in the MERRA2
Variable Details - File specifications for GEOS pdf file.
latitude: float Needs +/- instead of N/S
"""
return 1 + ((latitude + 90) / 0.5)
def translate_lon_to_geos5_native(longitude):
"""See function above"""
return (1 + ((longitude) + 180) / 0.625)
def find_closest_coordinate(calc_coord, coord_array):
"""
Since the resolution of the grid is 0.5 x 0.625, the 'real world'
coordinates will not be matched 100% correctly. This function matches
the coordinates as close as possible.
"""
# np.argmin() finds the smallest value in an array and returns its
# index. np.abs() returns the absolute value of each item of an array.
# To summarize, the function finds the difference closest to 0 and returns
# its index.
index = np.abs(coord_array-calc_coord).argmin()
return coord_array[index]
# The arrays contain the coordinates of the grid used by the API.
lat_coords = np.arange(0, 360, dtype=int)
lon_coords = np.arange(0, 575, dtype=int)
# Translate the coordinates that define your area to grid coordinates.
lat_coord_1 = translate_lat_to_geos5_native(lat_1)
lon_coord_1 = translate_lon_to_geos5_native(lon_1)
lat_coord_2 = translate_lat_to_geos5_native(lat_2)
lon_coord_2 = translate_lon_to_geos5_native(lon_2)
# Find the closest coordinate in the grid.
lat_co_1_closest = find_closest_coordinate(lat_coord_1, lat_coords)
lon_co_1_closest = find_closest_coordinate(lon_coord_1, lon_coords)
lat_co_2_closest = find_closest_coordinate(lat_coord_2, lat_coords)
lon_co_2_closest = find_closest_coordinate(lon_coord_2, lon_coords)
# Check the precision of the grid coordinates. These coordinates are not lat/lon.
# They are coordinates on the MERRA-2 grid.
log.info('Calculated coordinates for point 1: ' + str((lat_coord_1, lon_coord_1)))
log.info('Closest coordinates for point 1: ' + str((lat_co_1_closest, lon_co_1_closest)))
log.info('Calculated coordinates for point 2: ' + str((lat_coord_2, lon_coord_2)))
log.info('Closest coordinates for point 2: ' + str((lat_co_2_closest, lon_co_2_closest)))
Combining parameter choices above/translation according to OPenDAP guidelines into URL-appendix
def translate_year_to_file_number(year):
"""
The file names consist of a number and a meta data string.
The number changes over the years. 1980 until 1991 it is 100,
1992 until 2000 it is 200, 2001 until 2010 it is 300
and from 2011 until now it is 400.
"""
file_number = ''
if year >= 1980 and year < 1992:
file_number = '100'
elif year >= 1992 and year < 2001:
file_number = '200'
elif year >= 2001 and year < 2011:
file_number = '300'
elif year >= 2011:
file_number = '400'
else:
raise Exception('The specified year is out of range.')
return file_number
def generate_url_params(parameter, time_para, lat_para, lon_para):
"""Creates a string containing all the parameters in query form"""
parameter = map(lambda x: x + time_para, parameter)
parameter = map(lambda x: x + lat_para, parameter)
parameter = map(lambda x: x + lon_para, parameter)
return ','.join(parameter)
def generate_download_links(download_years, base_url, dataset_name, url_params):
"""
Generates the links for the download.
download_years: The years you want to download as array.
dataset_name: The name of the data set. For example tavg1_2d_slv_Nx
"""
urls = []
for y in download_years:
# build the file_number
y_str = str(y)
file_num = translate_year_to_file_number(download_year)
for m in range(1,13):
# build the month string: for the month 1 - 9 it starts with a leading 0.
# zfill solves that problem
m_str = str(m).zfill(2)
# monthrange returns the first weekday and the number of days in a
# month. Also works for leap years.
_, nr_of_days = monthrange(y, m)
for d in range(1,nr_of_days+1):
d_str = str(d).zfill(2)
# Create the file name string
file_name = 'MERRA2_{num}.{name}.{y}{m}{d}.nc4'.format(
num=file_num, name=dataset_name,
y=y_str, m=m_str, d=d_str)
# Create the query
query = '{base}{y}/{m}/{name}.nc4?{params}'.format(
base=base_url, y=y_str, m=m_str,
name=file_name, params=url_params)
urls.append(query)
return urls
requested_params = ['U2M', 'U10M', 'U50M', 'V2M', 'V10M', 'V50M', 'DISPH']
requested_time = '[0:1:23]'
# Creates a string that looks like [start:1:end]. start and end are the lat or
# lon coordinates define your area.
requested_lat = '[{lat_1}:1:{lat_2}]'.format(
lat_1=lat_co_1_closest, lat_2=lat_co_2_closest)
requested_lon = '[{lon_1}:1:{lon_2}]'.format(
lon_1=lon_co_1_closest, lon_2=lon_co_2_closest)
parameter = generate_url_params(requested_params, requested_time,
requested_lat, requested_lon)
BASE_URL = 'http://goldsmr4.sci.gsfc.nasa.gov:80/opendap/MERRA2/M2T1NXSLV.5.12.4/'
generated_URL = generate_download_links([download_year], BASE_URL,
'tavg1_2d_slv_Nx',
parameter)
# See what a query to the MERRA-2 portal looks like.
log.info(generated_URL[0])
This part subsequently downloads the subsetted raw data from the MERRA-2-datasets. The download process is outsourced from the notebook, because it is a standard and repetitive process. If you are interested in the the code, see the opendap_download module
from the dataset tavg1_2d_slv_Nx (M2T1NXSLV)
# Download data (one file per day and dataset) with links to local directory.
# Username and password for MERRA-2 (NASA earthdata portal)
username = input('Username: ')
password = getpass.getpass('Password:')
# The DownloadManager is able to download files. If you have a fast internet
# connection, setting this to 2 is enough. If you have slow wifi, try setting
# it to 4 or 5. If you download too fast, the data portal might ban you for a
# day.
NUMBER_OF_CONNECTIONS = 2
# The DownloadManager class is defined in the opendap_download module.
download_manager = DownloadManager()
download_manager.set_username_and_password(username, password)
download_manager.download_path = 'download'
download_manager.download_urls = generated_URL
# If you want to see the download progress, check the download folder you
# specified
%time download_manager.start_download(NUMBER_OF_CONNECTIONS)
from the dataset tavg1_2d_flx_Nx (M2T1NXFLX)
# Roughness data is in a different data set. The parameter is called Z0M.
roughness_para = generate_url_params(['Z0M'], requested_time,
requested_lat, requested_lon)
ROUGHNESS_BASE_URL = 'http://goldsmr4.sci.gsfc.nasa.gov/opendap/MERRA2/M2T1NXFLX.5.12.4/'
roughness_links = generate_download_links([download_year], ROUGHNESS_BASE_URL,
'tavg1_2d_flx_Nx', roughness_para)
download_manager.download_path = 'roughness_download'
download_manager.download_urls = roughness_links
# If you want to see the download progress, check the download folder you
# specified.
%time download_manager.start_download(NUMBER_OF_CONNECTIONS)
For now, the dataset only has MERRA-2 grid coordinates. To translate the points back to "real world" coordinates, the data portal offers a dimension scale file.
# The dimensions map the MERRA2 grid coordinates to lat/lon. The coordinates
# to request are 0:360 wheare as the other coordinates are 1:361
requested_lat_dim = '[{lat_1}:1:{lat_2}]'.format(
lat_1=lat_co_1_closest - 1, lat_2=lat_co_2_closest - 1)
requested_lon_dim = '[{lon_1}:1:{lon_2}]'.format(
lon_1=lon_co_1_closest - 1, lon_2=lon_co_2_closest - 1)
lat_lon_dimension_para = 'lat' + requested_lat_dim + ',lon' + requested_lon_dim
# Creating the download url.
dimension_url = 'http://goldsmr4.sci.gsfc.nasa.gov:80/opendap/MERRA2/M2T1NXSLV.5.12.4/2014/01/MERRA2_400.tavg1_2d_slv_Nx.20140101.nc4.nc4?'
dimension_url = dimension_url + lat_lon_dimension_para
download_manager.download_path = 'dimension_scale'
download_manager.download_urls = [dimension_url]
# Since the dimension is only one file, we only need one connection.
%time download_manager.start_download(1)
Due to the back and forth conversion from "real world" coordinates to MERRA-2 grid points, this part helps you to check if the conversion was precise enough.
file_path = os.path.join('dimension_scale', DownloadManager.get_filename(
dimension_url))
with xr.open_dataset(file_path) as ds_dim:
df_dim = ds_dim.to_dataframe()
lat_array = ds_dim['lat'].data.tolist()
lon_array = ds_dim['lon'].data.tolist()
# The log output helps evaluating the precision of the received data.
log.info('Requested lat: ' + str((lat_1, lat_2)))
log.info('Received lat: ' + str(lat_array))
log.info('Requested lon: ' + str((lon_1, lon_2)))
log.info('Received lon: ' + str(lon_array))
This part sets up a DataFrame and reads the raw data into it.
def extract_date(data_set):
"""
Extracts the date from the filename before merging the datasets.
"""
try:
# The attribute name changed during the development of this script
# from HDF5_Global.Filename to Filename.
if 'HDF5_GLOBAL.Filename' in data_set.attrs:
f_name = data_set.attrs['HDF5_GLOBAL.Filename']
elif 'Filename' in data_set.attrs:
f_name = data_set.attrs['Filename']
else:
raise AttributeError('The attribute name has changed again!')
# find a match between "." and ".nc4" that does not have "." .
exp = r'(?<=\.)[^\.]*(?=\.nc4)'
res = re.search(exp, f_name).group(0)
# Extract the date.
y, m, d = res[0:4], res[4:6], res[6:8]
date_str = ('%s-%s-%s' % (y, m, d))
data_set = data_set.assign(date=date_str)
return data_set
except KeyError:
# The last dataset is the one all the other sets will be merged into.
# Therefore, no date can be extracted.
return data_set
file_path = os.path.join('download', '*.nc4')
try:
with xr.open_mfdataset(file_path, concat_dim='date',
preprocess=extract_date) as ds_wind:
print(ds_wind)
df_wind = ds_wind.to_dataframe()
except xr.MergeError as e:
print(e)
df_wind.reset_index(inplace=True)
start_date = datetime.strptime(download_start_date, '%Y-%m-%d')
def calculate_datetime(d_frame):
"""
Calculates the accumulated hour based on the date.
"""
cur_date = datetime.strptime(d_frame['date'], '%Y-%m-%d')
hour = int(d_frame['time'])
delta = abs(cur_date - start_date).days
date_time_value = (delta * 24) + (hour)
return date_time_value
df_wind['date_time_hours'] = df_wind.apply(calculate_datetime, axis=1)
df_wind
def converting_timeformat_to_ISO8601(row):
"""Generates datetime according to ISO 8601 (UTC)"""
date = dateutil.parser.parse(row['date'])
hour = int(row['time'])
# timedelta from the datetime module enables the programmer
# to add time to a date.
date_time = date + timedelta(hours = hour)
return str(date_time.isoformat()) + 'Z' # MERRA2 datasets have UTC time zone.
df_wind['date_utc'] = df_wind.apply(converting_timeformat_to_ISO8601, axis=1)
df_wind['date_utc']
def calculate_windspeed(d_frame, idx_u, idx_v):
"""
Calculates the windspeed. The returned unit is m/s
"""
um = int(d_frame[idx_u])
vm = int(d_frame[idx_v])
speed = math.sqrt((um ** 2) + (vm ** 2))
return round(speed, 2)
# partial is used to create a function with pre-set arguments.
calc_windspeed_2m = partial(calculate_windspeed, idx_u='U2M', idx_v='V2M')
calc_windspeed_10m = partial(calculate_windspeed, idx_u='U10M', idx_v='V10M')
calc_windspeed_50m = partial(calculate_windspeed, idx_u='U50M', idx_v='V50M')
df_wind['v_2m'] = df_wind.apply(calc_windspeed_2m, axis=1)
df_wind['v_10m']= df_wind.apply(calc_windspeed_10m, axis=1)
df_wind['v_50m'] = df_wind.apply(calc_windspeed_50m, axis=1)
df_wind
file_path = os.path.join('roughness_download', '*.nc4')
with xr.open_mfdataset(file_path, concat_dim='date',
preprocess=extract_date) as ds_rough:
df_rough = ds_rough.to_dataframe()
df_rough.reset_index(inplace=True)
df_rough
df = pd.merge(df_wind, df_rough, on=['date', 'lat', 'lon', 'time'])
df.info()
# Calculate height for v_2m and v_10m (2 + DISPH or 10 + DISPH).
df['h_v1'] = df.apply((lambda x:int(x['DISPH']) + 2), axis=1)
df['h_v2'] = df.apply((lambda x:int(x['DISPH']) + 10), axis=1)
df['v_100m'] = np.nan
df.drop('DISPH', axis=1, inplace=True)
df.drop(['time', 'date'], axis=1, inplace=True)
df.drop(['U2M', 'U10M', 'U50M', 'V2M', 'V10M', 'V50M'], axis=1, inplace=True)
df['lat'] = df['lat'].apply(lambda x: lat_array[int(x)])
df['lon'] = df['lon'].apply(lambda x: lon_array[int(x)])
df
rename_map = {'date_time_hours': 'cumulated hours',
'date_utc': 'timestamp',
'v_2m': 'v1',
'v_10m': 'v2',
'Z0M': 'z0'
}
df.rename(columns=rename_map, inplace=True)
# Change order of the columns
columns = ['timestamp', 'cumulated hours', 'lat', 'lon',
'v1', 'v2', 'v_50m', 'v_100m',
'h_v1', 'h_v2', 'z0']
df = df[columns]
df.info()
Take a look at the resulting dataframe
df
Save the final dataframe locally
df.to_csv('weather_data_result.csv', index=False)
metadata = """
name: opsd-weather-data
title: Weather data
description: Script for the download of MERRA-2 weather data
long_description: >-
Weather data differ significantly from the other data types used resp.
provided by OPSD in that the sheer size of the data packages greatly
exceeds OPSD's capacity to host them in a similar way as feed-in
timeseries, power plant data etc. While the other data packages also
offer a complete one-klick download of the bundled data packages with
all relevant data this is impossible for weather datasets like MERRA-2 due
to their size (variety of variables, very long timespan, huge geographical
coverage etc.). It would make no sense to mirror the data from the NASA
servers.
Instead we choose to provide a documented methodological script
(as a kind of tutorial). The method describes one way to automatically
obtain the desired weather data from the MERRA-2 database and simplifies
resp. unifies alternative manual data obtaining methods in a single
script.
version: "2016-10-21"
resources:
- path: renewable_power_plants_DE.csv
format: csv
encoding: UTF-8
missingValue: ""
schema:
fields:
- name: timestamp
type: timestamp
format: YYYY-MM-DDThh:mm:ssZ
description: Timestap
- name: cumulated hours
type: int
description: Cumulated hours over the defined timerange
- name: lat
description: Latitude
type: string
- name: lon
description: Longitude
type: float
- name: v1
description: Windspeed at displacement height + 2m
type: float
unit: m
- name: v2
description: Windspeed at displacement height + 10m
type: float
unit: m
- name: v_50m
description: Windspeed at 50m height
type: float
unit: m
- name: v_100m
description: Windspeed at 100m height
type: float
unit: m
- name: h_v1
description: Height of v1
type: float
unit: m
- name: h_v2
description: Height of v2
type: float
unit: m
- name: z0
description: Roughness length
type: string
keywords: [Open Power System Data, MERRA-2, wind, solar, ]
geographical-scope: Worldwide
licenses:
- type: MIT license
url: http://www.opensource.org/licenses/MIT
sources:
- name: MERRA-2
web: https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/
source: National Aeronautics and Space Administration - Goddard Space Flight Center
contributors:
- name: Jan Urbansky
email: jan.urbansky@uni-flensburg.de
web:
views: True
openpowersystemdata-enable-listing: True
documentation: https://github.com/Open-Power-System-Data/weather_data/blob/2016-10-21/main.ipynb
last_changes: Published on the main repository
"""
metadata = yaml.load(metadata)
datapackage_json = json.dumps(metadata, indent=4, separators=(',', ': '))
with open('datapackage.json', 'w') as f:
f.write(datapackage_json)