import pandas as pd
import numpy as np
from sqlalchemy import *
import datetime
DATABASE_ENDPOINT = "aqueduct30v05.cgpnumwmfcqc.eu-central-1.rds.amazonaws.com"
DATABASE_NAME = "database01"
TABLE_NAME = "y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01"
F = open("/.password","r")
password = F.read().splitlines()[0]
F.close()
engine = create_engine("postgresql://rutgerhofste:{}@{}:5432/{}".format(password,DATABASE_ENDPOINT,DATABASE_NAME))
connection = engine.connect()
sql = "SELECT * FROM {} \
LIMIT 10".format(TABLE_NAME)
df = pd.read_sql(sql, connection)
the database is the result of running zonal statistics on a climate model. I am relatively new to postgreSQL and haven't set any indexes yet. The database is on AWS RDS on an x.large instance. None of the columns is unique. "pfafid_30spfaf06" is a zonal code for water basins. There are in total appr. 16000 unique pfaf_ids. year [1960-2014] month [1-12], temporal_resolution ["year","month"]
Result of (PgAdmin) :
SELECT pg_size_pretty(pg_total_relation_size('"public"."y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01"'));
Successfully run. Total query runtime: 425 msec. 1 rows affected:
11 GB
Result of (PgAdmin) :
SELECT count(*) FROM y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01
Successfully run. Total query runtime: 52 secs. 1 rows affected.
11715275 i.e. 11,715,275 rows
df.head()
pfafid_30spfaf06 | area_count_30spfaf06 | area_m2_30spfaf06 | month | pdomwn_count_30spfaf06 | pdomwn_m_30spfaf06 | pdomww_count_30spfaf06 | pdomww_m_30spfaf06 | pindwn_count_30spfaf06 | pindwn_m_30spfaf06 | ... | plivww_count_30spfaf06 | plivww_m_30spfaf06 | riverdischarge_count_30spfaf06 | riverdischarge_m_30spfaf06 | temporal_resolution | year | input_file_name | datetime_stamp | ptotwn_m_30spfaf06 | ptotww_m_30spfaf06 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 482012 | 15572 | 1.027443e+10 | 12 | 15572 | 0.000007 | 15572 | 0.000007 | 15572 | 1.803221e-07 | ... | 15572 | 0.000004 | 15462.0 | 0.078879 | year | 1967 | global_historical_merged_year_m_30sPfaf06_1960... | 2018-05-31 16:42:33.339340 | 0.000089 | 0.000218 |
1 | 482013 | 2665 | 1.749258e+09 | 12 | 2665 | 0.000018 | 2665 | 0.000018 | 2665 | 0.000000e+00 | ... | 2665 | 0.000005 | 2665.0 | 0.463177 | year | 1967 | global_historical_merged_year_m_30sPfaf06_1960... | 2018-05-31 16:42:33.339340 | 0.000058 | 0.000115 |
2 | 482014 | 7927 | 5.247636e+09 | 12 | 7927 | 0.000020 | 7927 | 0.000020 | 7927 | 7.687544e-06 | ... | 7927 | 0.000016 | 7872.0 | 0.000123 | year | 1967 | global_historical_merged_year_m_30sPfaf06_1960... | 2018-05-31 16:42:33.339340 | 0.000115 | 0.000265 |
3 | 482015 | 6268 | 4.100086e+09 | 12 | 6268 | 0.000020 | 6268 | 0.000020 | 6268 | 0.000000e+00 | ... | 6268 | 0.000004 | 6268.0 | 0.197547 | year | 1967 | global_historical_merged_year_m_30sPfaf06_1960... | 2018-05-31 16:42:33.339340 | 0.000023 | 0.000023 |
4 | 482016 | 51248 | 3.393310e+10 | 12 | 51248 | 0.000033 | 51248 | 0.000033 | 51248 | 2.805116e-05 | ... | 51248 | 0.000027 | 51248.0 | 0.023862 | year | 1967 | global_historical_merged_year_m_30sPfaf06_1960... | 2018-05-31 16:42:33.339340 | 0.001410 | 0.003697 |
5 rows × 28 columns
I know combining month and year in one datetime column is best practice but for future use, keeping them separate is easier.
The query I like to run calculates a 10 year moving average for three columns:
For axample the 10y annual moving average of 1969 is the average of 1960 - 1969. For a monthly moving average the average is filtered by month: average of jan 1960 jan 1961 ... jan 1969.
The query I have so far:
SELECT year, ptotww_m_30spfaf06, temporal_resolution, SUM(ptotww_m_30spfaf06) OVER(ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as ptotwwma_m_30spfaf06 FROM y2018m05d29_rh_total_demand_postgis_30spfaf06_v01_v01 WHERE temporal_resolution = 'year' LIMIT 200
However this is slow (need to set index? Which columns? year, month?) and does not work for the monthly scores.
Successfully run. Total query runtime: 52 secs. 200 rows affected.
Which is quite slow.