# Timeseries Trend Analysis¶

This notebook runs a simple trend analysis on the timeseries shown in the time series viewer: https://proba-v-mep.esa.int/applications/time-series-viewer/app/app.html It demonstrates how Spark can be used on the MEP Hadoop cluster, from within a notebook.

## Disclaimer¶

This notebook is not scientifically validated, only meant as a technical demonstration on a real world use case.

## Context¶

So some context first about the data we are going to manipulate.

• The dataset contains FAPAR data (Fraction of Absorbed Photosynthetically Active Radiation)
• The data was computed using raster data taken by a satelite
• Normally, there should be one measurement every 10 days for every zone and landcover.
• Due to external conditions (e.g. clouds), data can be missing.

## Goal¶

Our goal is pretty simple: we want to detect if there are any trends for each zone.

## General approach¶

We will proceed as follows:

1. Clean missing measurements from the original dataset
2. Group measurements by zone and landcover
3. For every zone:
1. Do some cleaning for every landcover
2. Combine the different landcovers data
3. Remove seasonality by taking yearly averages
4. Compute the trend
4. Plot the trends on a world map.

We will detail each step along the way.

In [1]:
import pyspark
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import datetime, time

%matplotlib inline


Let's read the parquet file containing all our data, as follows:

In [2]:
from pyspark.conf import SparkConf
conf = SparkConf()
conf.set('spark.executor.memory', '4g')

Out[2]:
<pyspark.conf.SparkConf at 0x30b96d0>

The SparkContext 'sc' is our entry point to the Hadoop cluster. The sparkConf defines some parameters.

In [3]:
sc = pyspark.SparkContext(conf=conf)

In [4]:
sqlCtx = pyspark.SQLContext(sc)

In [5]:
!hdfs dfs -ls /tapdata/tsviewer/combined-stats/stats.parquet

Found 169 items
-rw-r--r--   3 mep_tsviewer vito          0 2017-03-22 00:20 /tapdata/tsviewer/combined-stats/stats.parquet/_SUCCESS
-rw-r--r--   3 mep_tsviewer vito   25357414 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00000.parquet
-rw-r--r--   3 mep_tsviewer vito   25341813 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00001.parquet
-rw-r--r--   3 mep_tsviewer vito   25344785 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00002.parquet
-rw-r--r--   3 mep_tsviewer vito   25325074 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00003.parquet
-rw-r--r--   3 mep_tsviewer vito   25382644 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00004.parquet
-rw-r--r--   3 mep_tsviewer vito   25368452 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00005.parquet
-rw-r--r--   3 mep_tsviewer vito   25338382 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00006.parquet
-rw-r--r--   3 mep_tsviewer vito   25364026 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00007.parquet
-rw-r--r--   3 mep_tsviewer vito   25343301 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00008.parquet
-rw-r--r--   3 mep_tsviewer vito   25337037 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00009.parquet
-rw-r--r--   3 mep_tsviewer vito   25361096 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00010.parquet
-rw-r--r--   3 mep_tsviewer vito   25355121 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00011.parquet
-rw-r--r--   3 mep_tsviewer vito   25374168 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00012.parquet
-rw-r--r--   3 mep_tsviewer vito   25373881 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00013.parquet
-rw-r--r--   3 mep_tsviewer vito   25343740 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00014.parquet
-rw-r--r--   3 mep_tsviewer vito   25375775 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00015.parquet
-rw-r--r--   3 mep_tsviewer vito   25376607 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00016.parquet
-rw-r--r--   3 mep_tsviewer vito   25332092 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00017.parquet
-rw-r--r--   3 mep_tsviewer vito   25363406 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00018.parquet
-rw-r--r--   3 mep_tsviewer vito   25364663 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00019.parquet
-rw-r--r--   3 mep_tsviewer vito   25345982 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00020.parquet
-rw-r--r--   3 mep_tsviewer vito   25364568 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00021.parquet
-rw-r--r--   3 mep_tsviewer vito   25345751 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00022.parquet
-rw-r--r--   3 mep_tsviewer vito   25359454 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00023.parquet
-rw-r--r--   3 mep_tsviewer vito   25362565 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00024.parquet
-rw-r--r--   3 mep_tsviewer vito   25342732 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00025.parquet
-rw-r--r--   3 mep_tsviewer vito   25371436 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00026.parquet
-rw-r--r--   3 mep_tsviewer vito   25364058 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00027.parquet
-rw-r--r--   3 mep_tsviewer vito   25314740 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00028.parquet
-rw-r--r--   3 mep_tsviewer vito   25362923 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00029.parquet
-rw-r--r--   3 mep_tsviewer vito   25376056 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00030.parquet
-rw-r--r--   3 mep_tsviewer vito   25339711 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00031.parquet
-rw-r--r--   3 mep_tsviewer vito   25397722 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00032.parquet
-rw-r--r--   3 mep_tsviewer vito   25379909 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00033.parquet
-rw-r--r--   3 mep_tsviewer vito   25335473 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00034.parquet
-rw-r--r--   3 mep_tsviewer vito   25364742 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00035.parquet
-rw-r--r--   3 mep_tsviewer vito   25383452 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00036.parquet
-rw-r--r--   3 mep_tsviewer vito   25353719 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00037.parquet
-rw-r--r--   3 mep_tsviewer vito   25371553 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00038.parquet
-rw-r--r--   3 mep_tsviewer vito   25364836 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00039.parquet
-rw-r--r--   3 mep_tsviewer vito   25354000 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00040.parquet
-rw-r--r--   3 mep_tsviewer vito   25372447 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00041.parquet
-rw-r--r--   3 mep_tsviewer vito   25352126 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00042.parquet
-rw-r--r--   3 mep_tsviewer vito   25354362 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00043.parquet
-rw-r--r--   3 mep_tsviewer vito   25362814 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00044.parquet
-rw-r--r--   3 mep_tsviewer vito   25347834 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00045.parquet
-rw-r--r--   3 mep_tsviewer vito   25337259 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00046.parquet
-rw-r--r--   3 mep_tsviewer vito   25349579 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00047.parquet
-rw-r--r--   3 mep_tsviewer vito   25338758 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00048.parquet
-rw-r--r--   3 mep_tsviewer vito   25359736 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00049.parquet
-rw-r--r--   3 mep_tsviewer vito   25340555 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00050.parquet
-rw-r--r--   3 mep_tsviewer vito   25359822 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00051.parquet
-rw-r--r--   3 mep_tsviewer vito   25377064 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00052.parquet
-rw-r--r--   3 mep_tsviewer vito   25361970 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00053.parquet
-rw-r--r--   3 mep_tsviewer vito   25338795 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00054.parquet
-rw-r--r--   3 mep_tsviewer vito   25345861 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00055.parquet
-rw-r--r--   3 mep_tsviewer vito   25361802 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00056.parquet
-rw-r--r--   3 mep_tsviewer vito   25356596 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00057.parquet
-rw-r--r--   3 mep_tsviewer vito   25364383 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00058.parquet
-rw-r--r--   3 mep_tsviewer vito   25383945 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00059.parquet
-rw-r--r--   3 mep_tsviewer vito   25357862 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00060.parquet
-rw-r--r--   3 mep_tsviewer vito   25344097 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00061.parquet
-rw-r--r--   3 mep_tsviewer vito   25394550 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00062.parquet
-rw-r--r--   3 mep_tsviewer vito   25361621 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00063.parquet
-rw-r--r--   3 mep_tsviewer vito   25328248 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00064.parquet
-rw-r--r--   3 mep_tsviewer vito   25360018 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00065.parquet
-rw-r--r--   3 mep_tsviewer vito   25353412 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00066.parquet
-rw-r--r--   3 mep_tsviewer vito   25353366 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00067.parquet
-rw-r--r--   3 mep_tsviewer vito   25366552 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00068.parquet
-rw-r--r--   3 mep_tsviewer vito   25336160 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00069.parquet
-rw-r--r--   3 mep_tsviewer vito   25352581 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00070.parquet
-rw-r--r--   3 mep_tsviewer vito   25356921 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00071.parquet
-rw-r--r--   3 mep_tsviewer vito   25337459 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00072.parquet
-rw-r--r--   3 mep_tsviewer vito   25352658 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00073.parquet
-rw-r--r--   3 mep_tsviewer vito   25362806 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00074.parquet
-rw-r--r--   3 mep_tsviewer vito   25332933 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00075.parquet
-rw-r--r--   3 mep_tsviewer vito   25337050 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00076.parquet
-rw-r--r--   3 mep_tsviewer vito   25351615 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00077.parquet
-rw-r--r--   3 mep_tsviewer vito   25372093 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00078.parquet
-rw-r--r--   3 mep_tsviewer vito   25400785 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00079.parquet
-rw-r--r--   3 mep_tsviewer vito   25355679 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00080.parquet
-rw-r--r--   3 mep_tsviewer vito   25363894 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00081.parquet
-rw-r--r--   3 mep_tsviewer vito   25374263 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00082.parquet
-rw-r--r--   3 mep_tsviewer vito   25349140 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00083.parquet
-rw-r--r--   3 mep_tsviewer vito   25386733 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00084.parquet
-rw-r--r--   3 mep_tsviewer vito   25380380 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00085.parquet
-rw-r--r--   3 mep_tsviewer vito   25349083 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00086.parquet
-rw-r--r--   3 mep_tsviewer vito   25399829 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00087.parquet
-rw-r--r--   3 mep_tsviewer vito   25378706 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00088.parquet
-rw-r--r--   3 mep_tsviewer vito   25345343 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00089.parquet
-rw-r--r--   3 mep_tsviewer vito   25358393 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00090.parquet
-rw-r--r--   3 mep_tsviewer vito   25346919 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00091.parquet
-rw-r--r--   3 mep_tsviewer vito   25332395 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00092.parquet
-rw-r--r--   3 mep_tsviewer vito   25355408 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00093.parquet
-rw-r--r--   3 mep_tsviewer vito   25350583 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00094.parquet
-rw-r--r--   3 mep_tsviewer vito   25337139 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00095.parquet
-rw-r--r--   3 mep_tsviewer vito   25349278 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00096.parquet
-rw-r--r--   3 mep_tsviewer vito   25350053 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00097.parquet
-rw-r--r--   3 mep_tsviewer vito   25361480 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00098.parquet
-rw-r--r--   3 mep_tsviewer vito   25375630 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00099.parquet
-rw-r--r--   3 mep_tsviewer vito   25375661 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00100.parquet
-rw-r--r--   3 mep_tsviewer vito   25345687 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00101.parquet
-rw-r--r--   3 mep_tsviewer vito   25342268 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00102.parquet
-rw-r--r--   3 mep_tsviewer vito   25334411 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00103.parquet
-rw-r--r--   3 mep_tsviewer vito   25357199 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00104.parquet
-rw-r--r--   3 mep_tsviewer vito   25375453 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00105.parquet
-rw-r--r--   3 mep_tsviewer vito   25365719 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00106.parquet
-rw-r--r--   3 mep_tsviewer vito   25381858 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00107.parquet
-rw-r--r--   3 mep_tsviewer vito   25353112 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00108.parquet
-rw-r--r--   3 mep_tsviewer vito   25354219 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00109.parquet
-rw-r--r--   3 mep_tsviewer vito   25376862 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00110.parquet
-rw-r--r--   3 mep_tsviewer vito   25370664 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00111.parquet
-rw-r--r--   3 mep_tsviewer vito   25357291 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00112.parquet
-rw-r--r--   3 mep_tsviewer vito   25363815 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00113.parquet
-rw-r--r--   3 mep_tsviewer vito   25379000 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00114.parquet
-rw-r--r--   3 mep_tsviewer vito   25346606 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00115.parquet
-rw-r--r--   3 mep_tsviewer vito   25338025 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00116.parquet
-rw-r--r--   3 mep_tsviewer vito   25353108 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00117.parquet
-rw-r--r--   3 mep_tsviewer vito   25351054 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00118.parquet
-rw-r--r--   3 mep_tsviewer vito   25331492 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00119.parquet
-rw-r--r--   3 mep_tsviewer vito   25352670 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00120.parquet
-rw-r--r--   3 mep_tsviewer vito   25341291 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00121.parquet
-rw-r--r--   3 mep_tsviewer vito   25325189 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00122.parquet
-rw-r--r--   3 mep_tsviewer vito   25341454 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00123.parquet
-rw-r--r--   3 mep_tsviewer vito   25334407 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00124.parquet
-rw-r--r--   3 mep_tsviewer vito   25371710 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00125.parquet
-rw-r--r--   3 mep_tsviewer vito   25400455 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00126.parquet
-rw-r--r--   3 mep_tsviewer vito   25334676 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00127.parquet
-rw-r--r--   3 mep_tsviewer vito   25362616 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00128.parquet
-rw-r--r--   3 mep_tsviewer vito   25363886 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00129.parquet
-rw-r--r--   3 mep_tsviewer vito   25336678 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00130.parquet
-rw-r--r--   3 mep_tsviewer vito   25352216 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00131.parquet
-rw-r--r--   3 mep_tsviewer vito   25371067 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00132.parquet
-rw-r--r--   3 mep_tsviewer vito   25373034 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00133.parquet
-rw-r--r--   3 mep_tsviewer vito   25366880 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00134.parquet
-rw-r--r--   3 mep_tsviewer vito   25341113 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00135.parquet
-rw-r--r--   3 mep_tsviewer vito   25390356 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00136.parquet
-rw-r--r--   3 mep_tsviewer vito   25375039 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00137.parquet
-rw-r--r--   3 mep_tsviewer vito   25334315 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00138.parquet
-rw-r--r--   3 mep_tsviewer vito   25370581 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00139.parquet
-rw-r--r--   3 mep_tsviewer vito   25370598 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00140.parquet
-rw-r--r--   3 mep_tsviewer vito   25318438 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00141.parquet
-rw-r--r--   3 mep_tsviewer vito   25367727 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00142.parquet
-rw-r--r--   3 mep_tsviewer vito   25361599 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00143.parquet
-rw-r--r--   3 mep_tsviewer vito   25327646 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00144.parquet
-rw-r--r--   3 mep_tsviewer vito   25366143 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00145.parquet
-rw-r--r--   3 mep_tsviewer vito   25373734 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00146.parquet
-rw-r--r--   3 mep_tsviewer vito   25350089 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00147.parquet
-rw-r--r--   3 mep_tsviewer vito   25347975 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00148.parquet
-rw-r--r--   3 mep_tsviewer vito   25336055 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00149.parquet
-rw-r--r--   3 mep_tsviewer vito   25321908 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00150.parquet
-rw-r--r--   3 mep_tsviewer vito   25364109 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00151.parquet
-rw-r--r--   3 mep_tsviewer vito   25366125 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00152.parquet
-rw-r--r--   3 mep_tsviewer vito   25376436 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00153.parquet
-rw-r--r--   3 mep_tsviewer vito   25388034 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00154.parquet
-rw-r--r--   3 mep_tsviewer vito   25346212 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00155.parquet
-rw-r--r--   3 mep_tsviewer vito   25346212 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00156.parquet
-rw-r--r--   3 mep_tsviewer vito   25364542 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00157.parquet
-rw-r--r--   3 mep_tsviewer vito   25361136 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00158.parquet
-rw-r--r--   3 mep_tsviewer vito   25357436 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00159.parquet
-rw-r--r--   3 mep_tsviewer vito   25357944 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00160.parquet
-rw-r--r--   3 mep_tsviewer vito   25367168 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00161.parquet
-rw-r--r--   3 mep_tsviewer vito   25368636 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00162.parquet
-rw-r--r--   3 mep_tsviewer vito   25342537 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00163.parquet
-rw-r--r--   3 mep_tsviewer vito   25343544 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00164.parquet
-rw-r--r--   3 mep_tsviewer vito   25355247 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00165.parquet
-rw-r--r--   3 mep_tsviewer vito   25353852 2017-03-22 00:20 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00166.parquet
-rw-r--r--   3 mep_tsviewer vito   25326912 2017-03-22 00:20 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00167.parquet

In [6]:
data = sqlCtx.read.parquet("hdfs:/tapdata/tsviewer/combined-stats/stats.parquet")
data

Out[6]:
DataFrame[landcover: string, zone: int, date: string, BIOPAR_ALB_BHV_V1: double, BIOPAR_ALB_BHV_V1_area: bigint, BIOPAR_ALB_BHV_V1_validArea: bigint, BIOPAR_ALB_DHV_V1: double, BIOPAR_ALB_DHV_V1_area: bigint, BIOPAR_ALB_DHV_V1_validArea: bigint, BIOPAR_BA_V1: double, BIOPAR_BA_V1_area: bigint, BIOPAR_BA_V1_validArea: bigint, BIOPAR_DMP: double, BIOPAR_DMP_area: bigint, BIOPAR_DMP_validArea: bigint, BIOPAR_FAPAR_V1: double, BIOPAR_FAPAR_V1_area: bigint, BIOPAR_FAPAR_V1_validArea: bigint, BIOPAR_FCOVER_V1: double, BIOPAR_FCOVER_V1_area: bigint, BIOPAR_FCOVER_V1_validArea: bigint, BIOPAR_LAI_V1: double, BIOPAR_LAI_V1_area: bigint, BIOPAR_LAI_V1_validArea: bigint, BIOPAR_NDVI_V1: double, BIOPAR_NDVI_V1_area: bigint, BIOPAR_NDVI_V1_validArea: bigint, BIOPAR_NDVI_V2: double, BIOPAR_NDVI_V2_area: bigint, BIOPAR_NDVI_V2_validArea: bigint, BIOPAR_SWI10_V3: double, BIOPAR_SWI10_V3_area: bigint, BIOPAR_SWI10_V3_validArea: bigint, BIOPAR_TOCR_B0: double, BIOPAR_TOCR_B0_area: bigint, BIOPAR_TOCR_B0_validArea: bigint, BIOPAR_TOCR_B2: double, BIOPAR_TOCR_B2_area: bigint, BIOPAR_TOCR_B2_validArea: bigint, BIOPAR_TOCR_B3: double, BIOPAR_TOCR_B3_area: bigint, BIOPAR_TOCR_B3_validArea: bigint, BIOPAR_TOCR_MIR: double, BIOPAR_TOCR_MIR_area: bigint, BIOPAR_TOCR_MIR_validArea: bigint, BIOPAR_VCI: double, BIOPAR_VCI_area: bigint, BIOPAR_VCI_validArea: bigint, BIOPAR_VPI: double, BIOPAR_VPI_area: bigint, BIOPAR_VPI_validArea: bigint, BIOPAR_WB_V1: double, BIOPAR_WB_V1_area: bigint, BIOPAR_WB_V1_validArea: bigint, BIOPAR_WB_V2: double, BIOPAR_WB_V2_area: bigint, BIOPAR_WB_V2_validArea: bigint, PROBAV_L3_S10_TOC_RED_333M: double, PROBAV_L3_S10_TOC_RED_333M_area: bigint, PROBAV_L3_S10_TOC_RED_333M_validArea: bigint, PROBAV_L3_S10_TOC_NIR_333M: double, PROBAV_L3_S10_TOC_NIR_333M_area: bigint, PROBAV_L3_S10_TOC_NIR_333M_validArea: bigint, PROBAV_L3_S10_TOC_BLUE_333M: double, PROBAV_L3_S10_TOC_BLUE_333M_area: bigint, PROBAV_L3_S10_TOC_BLUE_333M_validArea: bigint, PROBAV_L3_S10_TOC_SWIR_333M: double, PROBAV_L3_S10_TOC_SWIR_333M_area: bigint, PROBAV_L3_S10_TOC_SWIR_333M_validArea: bigint, PROBAV_L3_S10_TOC_NDVI_333M: double, PROBAV_L3_S10_TOC_NDVI_333M_area: bigint, PROBAV_L3_S10_TOC_NDVI_333M_validArea: bigint, RAINFALL: double, RAINFALL_area: bigint, RAINFALL_validArea: bigint]

Next, we want to register this data in a temporary table so we can query it using Spark's SQLContext.

In [7]:
data.registerTempTable("data")


Now although we instructed Spark where the data was and to register it, it did not actually read anything. This is because Spark is lazy: it will postpone doing any task until it the last moment. Requesting data will force it to load it, as follows:

In [8]:
data.take(1)

Out[8]:
[Row(landcover=u'Water_bodies', zone=3219, date=u'2007-11-01', BIOPAR_ALB_BHV_V1=0.14052462320017708, BIOPAR_ALB_BHV_V1_area=11163, BIOPAR_ALB_BHV_V1_validArea=11076, BIOPAR_ALB_DHV_V1=0.14695226555861848, BIOPAR_ALB_DHV_V1_area=11163, BIOPAR_ALB_DHV_V1_validArea=11076, BIOPAR_BA_V1=0.0009853981904505958, BIOPAR_BA_V1_area=11163, BIOPAR_BA_V1_validArea=11, BIOPAR_DMP=None, BIOPAR_DMP_area=None, BIOPAR_DMP_validArea=None, BIOPAR_FAPAR_V1=0.21222769551344417, BIOPAR_FAPAR_V1_area=11163, BIOPAR_FAPAR_V1_validArea=10995, BIOPAR_FCOVER_V1=0.11557305349856734, BIOPAR_FCOVER_V1_area=11163, BIOPAR_FCOVER_V1_validArea=10995, BIOPAR_LAI_V1=0.28491411937776656, BIOPAR_LAI_V1_area=11163, BIOPAR_LAI_V1_validArea=10995, BIOPAR_NDVI_V1=0.27314899691044103, BIOPAR_NDVI_V1_area=11163, BIOPAR_NDVI_V1_validArea=10995, BIOPAR_NDVI_V2=None, BIOPAR_NDVI_V2_area=None, BIOPAR_NDVI_V2_validArea=None, BIOPAR_SWI10_V3=None, BIOPAR_SWI10_V3_area=None, BIOPAR_SWI10_V3_validArea=None, BIOPAR_TOCR_B0=None, BIOPAR_TOCR_B0_area=None, BIOPAR_TOCR_B0_validArea=None, BIOPAR_TOCR_B2=None, BIOPAR_TOCR_B2_area=None, BIOPAR_TOCR_B2_validArea=None, BIOPAR_TOCR_B3=None, BIOPAR_TOCR_B3_area=None, BIOPAR_TOCR_B3_validArea=None, BIOPAR_TOCR_MIR=None, BIOPAR_TOCR_MIR_area=None, BIOPAR_TOCR_MIR_validArea=None, BIOPAR_VCI=None, BIOPAR_VCI_area=None, BIOPAR_VCI_validArea=None, BIOPAR_VPI=None, BIOPAR_VPI_area=None, BIOPAR_VPI_validArea=None, BIOPAR_WB_V1=None, BIOPAR_WB_V1_area=None, BIOPAR_WB_V1_validArea=None, BIOPAR_WB_V2=None, BIOPAR_WB_V2_area=None, BIOPAR_WB_V2_validArea=None, PROBAV_L3_S10_TOC_RED_333M=None, PROBAV_L3_S10_TOC_RED_333M_area=None, PROBAV_L3_S10_TOC_RED_333M_validArea=None, PROBAV_L3_S10_TOC_NIR_333M=None, PROBAV_L3_S10_TOC_NIR_333M_area=None, PROBAV_L3_S10_TOC_NIR_333M_validArea=None, PROBAV_L3_S10_TOC_BLUE_333M=None, PROBAV_L3_S10_TOC_BLUE_333M_area=None, PROBAV_L3_S10_TOC_BLUE_333M_validArea=None, PROBAV_L3_S10_TOC_SWIR_333M=None, PROBAV_L3_S10_TOC_SWIR_333M_area=None, PROBAV_L3_S10_TOC_SWIR_333M_validArea=None, PROBAV_L3_S10_TOC_NDVI_333M=None, PROBAV_L3_S10_TOC_NDVI_333M_area=None, PROBAV_L3_S10_TOC_NDVI_333M_validArea=None, RAINFALL=2.1919554834854207, RAINFALL_area=1406, RAINFALL_validArea=1406)]

Using Spark's SQLContext, we can query our data with SQL-like expressions. Here we are interested in the zone, landcover, date, FAPAR and area columns.

We will also filter out unwanted rows by specifying that we do not want rows for which we do not have a FAPAR value.

In [11]:
data = sqlCtx.sql("SELECT zone, landcover, date date, BIOPAR_FAPAR_V1 fapar, BIOPAR_FAPAR_V1 " +
"FROM data " +
"WHERE BIOPAR_FAPAR_V1 is not null").cache()
data.take(1)

Out[11]:
[Row(zone=3219, landcover=u'Water_bodies', date=u'2006-08-11', fapar=0.29711740277250515, BIOPAR_FAPAR_V1=0.29711740277250515)]

As you can see, although null values have been filtered out, some FAPAR values still remain nan. We can filter them out using the filter method from Spark. For that, we can either use the DataFrame's method filter, or we can use the DataFrame as a regular RDD.

The difference is that the DataFrame contains Rows, for which we can refer to the columns with their names, whereas the RDD contains tuples, where we have to select columns based on their index. However, with a regular RDD, we can pass any method we want to filter, map, reduce, ..., whereas with a DataFrame, we have to use pre-defined functions.

In [12]:
def fapar_not_nan(row):
return row[3] is not None and not np.isnan(row[3])

withoutNaNs = data.rdd.filter(fapar_not_nan).cache()
withoutNaNs.take(1)

Out[12]:
[Row(zone=3219, landcover=u'Water_bodies', date=u'2006-06-21', fapar=0.24140815792410703, BIOPAR_FAPAR_V1=0.24140815792410703)]

For grouping, we could use spark's groupBy method, or we could first map our data to a (Key, Value) pair and use the groupByKey method. We will use the latter method as we have more fined control over what goes into Value.

In [13]:
def by_zone(row):
return ((row[0]), (row[1], row[2], row[3], row[4]))

by_zone = withoutNaNs.map(by_zone).groupByKey().cache()
by_zone.take(1)

Out[13]:
[(672, <pyspark.resultiterable.ResultIterable at 0x601efd0>)]

The result is a list of pairs (Zone, Values) where Values is a list containing all the values for the specific Zone we grouped by.

Spark will combine the different values into a ResultIterable object, not really suited for the analysis we will be doing. Instead, we are going to use Pandas and Numpy. So first thing we want to do is to convert our ResultIterable into an actual Pandas DataFrame.

In [14]:
def iterable_to_pd(row):
return (row[0], pd.DataFrame(list(row[1]), columns=["landcover", "date", "fapar", "area"]))

by_zone = by_zone.map(iterable_to_pd).cache()
by_zone.take(1)

Out[14]:
[(672,                                                landcover        date  \
4                Mosaic_herbaceous_cover__tree_and_shrub  2000-06-01
5                Mosaic_herbaceous_cover__tree_and_shrub  2004-06-21
6                Mosaic_herbaceous_cover__tree_and_shrub  2010-10-11
7                Mosaic_herbaceous_cover__tree_and_shrub  2011-01-01
8                Mosaic_herbaceous_cover__tree_and_shrub  2015-01-21
9          Sparse_vegetation_tree_shrub_herbaceous_cover  2000-04-21
10         Sparse_vegetation_tree_shrub_herbaceous_cover  2006-08-11
11         Sparse_vegetation_tree_shrub_herbaceous_cover  2012-12-01
12         Sparse_vegetation_tree_shrub_herbaceous_cover  2016-12-21
13                                             Shrubland  2002-10-11
14                                             Shrubland  2003-01-01
15                                             Shrubland  2007-01-21
16                                             Shrubland  2013-05-11
17                                      Cropland_rainfed  1999-05-01
18                                      Cropland_rainfed  2003-05-21
19                                      Cropland_rainfed  2009-09-11
20                             Mosaic_natural_vegetation  2008-04-11
21                             Mosaic_natural_vegetation  2014-08-01
22                                           Urban_areas  2002-04-01
23                                           Urban_areas  2006-04-21
24                                           Urban_areas  2012-08-11
28                                          Water_bodies  2000-05-11
29                                          Water_bodies  2006-09-01
...                                                  ...         ...
11685             Mosaic_tree_and_shrub_herbaceous_cover  2002-01-01
11686             Mosaic_tree_and_shrub_herbaceous_cover  2006-01-21
11687             Mosaic_tree_and_shrub_herbaceous_cover  2012-05-11
11688  Shrub_or_herbaceous_cover_flooded_fresh_saline...  2000-06-11
11689  Shrub_or_herbaceous_cover_flooded_fresh_saline...  2006-10-01
11690  Shrub_or_herbaceous_cover_flooded_fresh_saline...  2010-10-21
11691  Shrub_or_herbaceous_cover_flooded_fresh_saline...  2011-01-11
11696                                    Mosaic_cropland  1999-02-21
11697                                    Mosaic_cropland  2005-06-11
11698                                    Mosaic_cropland  2011-10-01
11699                                    Mosaic_cropland  2015-10-21
11700                                    Mosaic_cropland  2016-01-11
11706                                          Grassland  2004-12-11
11707                                          Grassland  2005-03-01
11708                                          Grassland  2009-03-21
11709                                          Grassland  2015-07-11
11710                    Tree_cover_flooded_saline_water  2000-04-11
11711                    Tree_cover_flooded_saline_water  2006-08-01
11712                    Tree_cover_flooded_saline_water  2010-08-21
11713                    Tree_cover_flooded_saline_water  2016-12-11
11714                    Tree_cover_flooded_saline_water  2017-03-01

fapar      area
0      0.781023  0.781023
1      0.739872  0.739872
2      0.714261  0.714261
3      0.795091  0.795091
4      0.682756  0.682756
5      0.617121  0.617121
6      0.384000  0.384000
7      0.568926  0.568926
8      0.498292  0.498292
9      0.512471  0.512471
10     0.376571  0.376571
11     0.415657  0.415657
12     0.412000  0.412000
13     0.553395  0.553395
14     0.598780  0.598780
15     0.606932  0.606932
16     0.673394  0.673394
17     0.674976  0.674976
18     0.636602  0.636602
19     0.527672  0.527672
20     0.731046  0.731046
21     0.628583  0.628583
22     0.549303  0.549303
23     0.660212  0.660212
24     0.509302  0.509302
25     0.627513  0.627513
26     0.692247  0.692247
27     0.549743  0.549743
28     0.559624  0.559624
29     0.394943  0.394943
...         ...       ...
11685  0.469723  0.469723
11686  0.665650  0.665650
11687  0.653948  0.653948
11688  0.638316  0.638316
11689  0.405949  0.405949
11690  0.503323  0.503323
11691  0.501129  0.501129
11692  0.602000  0.602000
11693  0.296000  0.296000
11694  0.546000  0.546000
11695  0.572000  0.572000
11696  0.630002  0.630002
11697  0.814023  0.814023
11698  0.605181  0.605181
11699  0.597991  0.597991
11700  0.645325  0.645325
11701  0.576503  0.576503
11702  0.611666  0.611666
11703  0.615693  0.615693
11704  0.598025  0.598025
11705  0.624073  0.624073
11706  0.503609  0.503609
11707  0.602599  0.602599
11708  0.483708  0.483708
11709  0.498409  0.498409
11710  0.722473  0.722473
11711  0.675400  0.675400
11712  0.662609  0.662609
11713  0.610634  0.610634
11714  0.619562  0.619562

[11715 rows x 4 columns])]

Now that we are ready to write our Combine task, let's detail the algorithm we are going to write to process each zone:

1. Group the dataset by landcover, and for each:
1. Resample the data monthly (instead of 10-daily) with padding in order to fill some void if there are any
2. Reweight the FAPAR values by the area the landcover takes over the total area
3. Regroup the dataset by date and take the (monthly) sum
4. Resample the data yearly and remove the 1st and last year of data (to only include full years)
In [15]:
def combine_fapar(zone):
series = zone[1]
series["date"] = pd.to_datetime(series["date"])
series = series.set_index("date")

# Group the dataset by landcover, and for each:
by_landcover = series.groupby("landcover")

# Resample the data monthly (instead of 10-daily) with padding in order to fill some void if there are any
series = by_landcover.reset_index()

# Reweight the FAPAR values by the area the landcover takes over the total area
total_area = by_landcover["area"].mean().sum()
series["fapar"] *= series["area"] / total_area
series = series.drop("area", axis=1)

# Regroup the dataset by date and take the (monthly) sum
series = series.groupby("date").sum()

# Resample the data yearly and remove the 1st and last year of data (to only include full years)
yearly_sum = pd.DataFrame(series["fapar"].resample("12m", "mean")[1:-1])

return (zone[0], series, yearly_sum)

combined = by_zone.map(combine_fapar).cache()
combined.take(1)

Out[15]:
[(672,                 fapar
date
1999-01-31   9.717959
1999-02-28   8.868936
1999-03-31  10.370095
1999-04-30  12.420800
1999-05-31  12.032194
1999-06-30   9.968254
1999-07-31   9.284237
1999-08-31   8.657068
1999-09-30   6.982681
1999-10-31   5.440120
1999-11-30   5.537638
1999-12-31   8.929892
2000-01-31  10.512340
2000-02-29  11.302637
2000-03-31  13.014589
2000-04-30  14.539168
2000-05-31  14.625471
2000-06-30  13.036348
2000-07-31  10.837346
2000-08-31   9.071030
2000-09-30   6.932734
2000-10-31   8.162503
2000-11-30   7.231002
2000-12-31  10.725195
2001-01-31  11.366307
2001-02-28  10.276144
2001-03-31  10.517193
2001-04-30  10.990723
2001-05-31  10.960789
2001-06-30  11.189507
...               ...
2014-10-31   8.205650
2014-11-30   8.801627
2014-12-31  11.202606
2015-01-31  10.501179
2015-02-28   9.387215
2015-03-31  10.382771
2015-04-30  11.646911
2015-05-31  11.264060
2015-06-30  12.828613
2015-07-31  11.889360
2015-08-31  10.821384
2015-09-30   8.733411
2015-10-31   7.264628
2015-11-30   6.536714
2015-12-31   7.512584
2016-01-31   7.997100
2016-02-29  10.642866
2016-03-31   9.514068
2016-04-30   9.566114
2016-05-31   8.566643
2016-06-30   7.837038
2016-07-31   8.002204
2016-08-31   6.676644
2016-09-30   5.125340
2016-10-31   5.245660
2016-11-30   6.372412
2016-12-31   9.893308
2017-01-31  10.128927
2017-02-28   9.886582
2017-03-31  10.078414

[219 rows x 1 columns],                 fapar
date
2000-01-31   9.083688
2001-01-31  10.903694
2002-01-31   9.291626
2003-01-31  10.591492
2004-01-31   9.401098
2005-01-31  11.984670
2006-01-31  12.621007
2007-01-31  10.920098
2008-01-31  10.022505
2009-01-31  10.516419
2010-01-31  11.414949
2011-01-31  11.175359
2012-01-31  11.270612
2013-01-31  11.411795
2014-01-31  11.138742
2015-01-31  10.768098
2016-01-31   9.688729
2017-01-31   8.130935)]

Finally, we can compute the trend using a linear regression (OLS).

In [16]:
def min_years(datas):
return len(datas[2]) >=5

def ols(datas):
yearly_series = datas[2]
x, y = yearly_series.index.astype(int) // (10**9 * 60 * 24 * 30), yearly_series["fapar"]
X = np.c_[np.ones(len(x)), x]
return (datas[0], datas[1], datas[2], np.linalg.pinv(X).dot(y))

trends = combined.filter(min_years).map(ols).cache()


Next we will plot the trends. We are interested in the highest and lowest trends in our data.

Let's define our plotting function and plot the trends by descending order.

In [21]:
#copy auxiliary data from hdfs
!hdfs dfs -copyToLocal /tapdata/GAUL2013/zonecodes.csv ./

17/03/22 13:25:52 WARN hdfs.DFSClient: DFSInputStream has been closed already

In [23]:
zonecodes = pd.read_csv("./zonecodes.csv")
def plotTopTrends(trendsCollected):
for zone, series, yearly_series, trend in trendsCollected:
if not np.isnan(trend[0]) and not np.isnan(trend[1]):
codes["Trend"] = trend[1]
print codes

pdf = series.copy()
pdf["ols"] = pdf.index.astype(int) / (10**9 * 60 * 24 * 30) * trend[1] + trend[0]
pdf[["fapar", "ols"]].plot(figsize=(16,4))

#sns.regplot(x="unix_date", y="fapar", data=series, scatter=False)
plt.show()

In [24]:
plotTopTrends(trends.sortBy(lambda x: x[3][1], ascending=False).take(10))

          ADM0_NAME                          ADM1_NAME     Trend
117  American Samoa  Administrative unit not available  0.001467

/opt/rh/python27/root/usr/lib64/python2.7/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'sans-serif'] not found. Falling back to DejaVu Sans
(prop.get_family(), self.defaultFamily[fontext]))

    ADM0_NAME                  ADM1_NAME     Trend
793     Egypt  Al Bahr/al Ahmar (redsea)  0.000554

     ADM0_NAME                          ADM1_NAME     Trend
1311  Kiribati  Administrative unit not available  0.000551

       ADM0_NAME ADM1_NAME     Trend
1662  Montserrat  Plymouth  0.000495

     ADM0_NAME ADM1_NAME     Trend
1827  Pakistan    Punjab  0.000426

     ADM0_NAME  ADM1_NAME     Trend
1101     India  Rajasthan  0.000421

    ADM0_NAME                 ADM1_NAME     Trend
805     Egypt  As Ismailiyah (ismailia)  0.000421

    ADM0_NAME ADM1_NAME     Trend
104   Algeria     Saida  0.000418

    ADM0_NAME ADM1_NAME     Trend
112   Algeria    Tiaret  0.000405

     ADM0_NAME ADM1_NAME   Trend
2561    Turkey   Denizli  0.0004

In [25]:
plotTopTrends(trends.sortBy(lambda x: x[3][1], ascending=True).take(10))

            ADM0_NAME                          ADM1_NAME     Trend
574  Christmas Island  Administrative unit not available -0.001041

            ADM0_NAME                          ADM1_NAME     Trend
904  French Polynesia  Administrative unit not available -0.000726

       ADM0_NAME   ADM1_NAME     Trend
3314  Martinique  La Trinite -0.000519

    ADM0_NAME ADM1_NAME     Trend
731  Djibouti     Obock -0.000301

     ADM0_NAME ADM1_NAME     Trend
3102    Guinea   Conakry -0.000285

    ADM0_NAME     ADM1_NAME     Trend

       ADM0_NAME  ADM1_NAME     Trend

       ADM0_NAME ADM1_NAME     Trend