In [3]:

import pandas as pd
import numpy as np
from sqlalchemy import *
import datetime


DATABASE_ENDPOINT = "aqueduct30v05.cgpnumwmfcqc.eu-central-1.rds.amazonaws.com"
DATABASE_NAME = "database01"
TABLE_NAME = "y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01"

F = open("/.password","r")
password = F.read().splitlines()[0]
F.close()

engine = create_engine("postgresql://rutgerhofste:{}@{}:5432/{}".format(password,DATABASE_ENDPOINT,DATABASE_NAME))
connection = engine.connect()

sql = "SELECT * FROM {} \
       LIMIT 10".format(TABLE_NAME)

df = pd.read_sql(sql, connection)

Some Background¶

the database is the result of running zonal statistics on a climate model. I am relatively new to postgreSQL and haven't set any indexes yet. The database is on AWS RDS on an x.large instance. None of the columns is unique. "pfafid_30spfaf06" is a zonal code for water basins. There are in total appr. 16000 unique pfaf_ids. year [1960-2014] month [1-12], temporal_resolution ["year","month"]

Result of (PgAdmin) :
SELECT pg_size_pretty(pg_total_relation_size('"public"."y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01"'));

Successfully run. Total query runtime: 425 msec. 1 rows affected:

11 GB

Result of (PgAdmin) : SELECT count(*) FROM y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01

Successfully run. Total query runtime: 52 secs. 1 rows affected.

11715275 i.e. 11,715,275 rows

In [4]:

df.head()

Out[4]:

	pfafid_30spfaf06	area_count_30spfaf06	area_m2_30spfaf06	month	pdomwn_count_30spfaf06	pdomwn_m_30spfaf06	pdomww_count_30spfaf06	pdomww_m_30spfaf06	pindwn_count_30spfaf06	pindwn_m_30spfaf06	...	plivww_count_30spfaf06	plivww_m_30spfaf06	riverdischarge_count_30spfaf06	riverdischarge_m_30spfaf06	temporal_resolution	year	input_file_name	datetime_stamp	ptotwn_m_30spfaf06	ptotww_m_30spfaf06
0	482012	15572	1.027443e+10	12	15572	0.000007	15572	0.000007	15572	1.803221e-07	...	15572	0.000004	15462.0	0.078879	year	1967	global_historical_merged_year_m_30sPfaf06_1960...	2018-05-31 16:42:33.339340	0.000089	0.000218
1	482013	2665	1.749258e+09	12	2665	0.000018	2665	0.000018	2665	0.000000e+00	...	2665	0.000005	2665.0	0.463177	year	1967	global_historical_merged_year_m_30sPfaf06_1960...	2018-05-31 16:42:33.339340	0.000058	0.000115
2	482014	7927	5.247636e+09	12	7927	0.000020	7927	0.000020	7927	7.687544e-06	...	7927	0.000016	7872.0	0.000123	year	1967	global_historical_merged_year_m_30sPfaf06_1960...	2018-05-31 16:42:33.339340	0.000115	0.000265
3	482015	6268	4.100086e+09	12	6268	0.000020	6268	0.000020	6268	0.000000e+00	...	6268	0.000004	6268.0	0.197547	year	1967	global_historical_merged_year_m_30sPfaf06_1960...	2018-05-31 16:42:33.339340	0.000023	0.000023
4	482016	51248	3.393310e+10	12	51248	0.000033	51248	0.000033	51248	2.805116e-05	...	51248	0.000027	51248.0	0.023862	year	1967	global_historical_merged_year_m_30sPfaf06_1960...	2018-05-31 16:42:33.339340	0.001410	0.003697

5 rows × 28 columns

I know combining month and year in one datetime column is best practice but for future use, keeping them separate is easier.

The query I like to run calculates a 10 year moving average for three columns:

ptotwn_m_30spfaf06
ptotww_m_30spfaf06
riverdischarge_m_30spfaf06

For axample the 10y annual moving average of 1969 is the average of 1960 - 1969. For a monthly moving average the average is filtered by month: average of jan 1960 jan 1961 ... jan 1969.

The query I have so far:

SELECT year, ptotww_m_30spfaf06, temporal_resolution, SUM(ptotww_m_30spfaf06) OVER(ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as ptotwwma_m_30spfaf06 FROM y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01 WHERE temporal_resolution = 'year' LIMIT 200

However this is slow (need to set index? Which columns? year, month?) and does not work for the monthly scores.

Successfully run. Total query runtime: 52 secs. 200 rows affected.

Which is quite slow.

In [ ]: