Timeseries Trend Analysis

This notebook runs a simple trend analysis on the timeseries shown in the time series viewer: https://proba-v-mep.esa.int/applications/time-series-viewer/app/app.html It demonstrates how Spark can be used on the MEP Hadoop cluster, from within a notebook.

Disclaimer

This notebook is not scientifically validated, only meant as a technical demonstration on a real world use case.

Context

So some context first about the data we are going to manipulate.

  • The dataset contains FAPAR data (Fraction of Absorbed Photosynthetically Active Radiation)
  • The data was computed using raster data taken by a satelite
  • Normally, there should be one measurement every 10 days for every zone and landcover.
  • Due to external conditions (e.g. clouds), data can be missing.

Goal

Our goal is pretty simple: we want to detect if there are any trends for each zone.

General approach

We will proceed as follows:

  1. Clean missing measurements from the original dataset
  2. Group measurements by zone and landcover
  3. For every zone:
    1. Do some cleaning for every landcover
    2. Combine the different landcovers data
    3. Remove seasonality by taking yearly averages
    4. Compute the trend
  4. Plot the trends on a world map.

We will detail each step along the way.

In [1]:
import pyspark
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import datetime, time

%matplotlib inline

Let's read the parquet file containing all our data, as follows:

In [2]:
from pyspark.conf import SparkConf
conf = SparkConf()
conf.set('spark.yarn.executor.memoryOverhead', 1024)
conf.set('spark.executor.memory', '4g')
Out[2]:
<pyspark.conf.SparkConf at 0x30b96d0>

The SparkContext 'sc' is our entry point to the Hadoop cluster. The sparkConf defines some parameters.

In [3]:
sc = pyspark.SparkContext(conf=conf)
In [4]:
sqlCtx = pyspark.SQLContext(sc)
In [5]:
!hdfs dfs -ls /tapdata/tsviewer/combined-stats/stats.parquet
Found 169 items
-rw-r--r--   3 mep_tsviewer vito          0 2017-03-22 00:20 /tapdata/tsviewer/combined-stats/stats.parquet/_SUCCESS
-rw-r--r--   3 mep_tsviewer vito   25357414 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00000.parquet
-rw-r--r--   3 mep_tsviewer vito   25341813 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00001.parquet
-rw-r--r--   3 mep_tsviewer vito   25344785 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00002.parquet
-rw-r--r--   3 mep_tsviewer vito   25325074 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00003.parquet
-rw-r--r--   3 mep_tsviewer vito   25382644 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00004.parquet
-rw-r--r--   3 mep_tsviewer vito   25368452 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00005.parquet
-rw-r--r--   3 mep_tsviewer vito   25338382 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00006.parquet
-rw-r--r--   3 mep_tsviewer vito   25364026 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00007.parquet
-rw-r--r--   3 mep_tsviewer vito   25343301 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00008.parquet
-rw-r--r--   3 mep_tsviewer vito   25337037 2017-03-22 00:07 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00009.parquet
-rw-r--r--   3 mep_tsviewer vito   25361096 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00010.parquet
-rw-r--r--   3 mep_tsviewer vito   25355121 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00011.parquet
-rw-r--r--   3 mep_tsviewer vito   25374168 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00012.parquet
-rw-r--r--   3 mep_tsviewer vito   25373881 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00013.parquet
-rw-r--r--   3 mep_tsviewer vito   25343740 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00014.parquet
-rw-r--r--   3 mep_tsviewer vito   25375775 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00015.parquet
-rw-r--r--   3 mep_tsviewer vito   25376607 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00016.parquet
-rw-r--r--   3 mep_tsviewer vito   25332092 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00017.parquet
-rw-r--r--   3 mep_tsviewer vito   25363406 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00018.parquet
-rw-r--r--   3 mep_tsviewer vito   25364663 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00019.parquet
-rw-r--r--   3 mep_tsviewer vito   25345982 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00020.parquet
-rw-r--r--   3 mep_tsviewer vito   25364568 2017-03-22 00:08 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00021.parquet
-rw-r--r--   3 mep_tsviewer vito   25345751 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00022.parquet
-rw-r--r--   3 mep_tsviewer vito   25359454 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00023.parquet
-rw-r--r--   3 mep_tsviewer vito   25362565 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00024.parquet
-rw-r--r--   3 mep_tsviewer vito   25342732 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00025.parquet
-rw-r--r--   3 mep_tsviewer vito   25371436 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00026.parquet
-rw-r--r--   3 mep_tsviewer vito   25364058 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00027.parquet
-rw-r--r--   3 mep_tsviewer vito   25314740 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00028.parquet
-rw-r--r--   3 mep_tsviewer vito   25362923 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00029.parquet
-rw-r--r--   3 mep_tsviewer vito   25376056 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00030.parquet
-rw-r--r--   3 mep_tsviewer vito   25339711 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00031.parquet
-rw-r--r--   3 mep_tsviewer vito   25397722 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00032.parquet
-rw-r--r--   3 mep_tsviewer vito   25379909 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00033.parquet
-rw-r--r--   3 mep_tsviewer vito   25335473 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00034.parquet
-rw-r--r--   3 mep_tsviewer vito   25364742 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00035.parquet
-rw-r--r--   3 mep_tsviewer vito   25383452 2017-03-22 00:09 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00036.parquet
-rw-r--r--   3 mep_tsviewer vito   25353719 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00037.parquet
-rw-r--r--   3 mep_tsviewer vito   25371553 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00038.parquet
-rw-r--r--   3 mep_tsviewer vito   25364836 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00039.parquet
-rw-r--r--   3 mep_tsviewer vito   25354000 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00040.parquet
-rw-r--r--   3 mep_tsviewer vito   25372447 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00041.parquet
-rw-r--r--   3 mep_tsviewer vito   25352126 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00042.parquet
-rw-r--r--   3 mep_tsviewer vito   25354362 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00043.parquet
-rw-r--r--   3 mep_tsviewer vito   25362814 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00044.parquet
-rw-r--r--   3 mep_tsviewer vito   25347834 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00045.parquet
-rw-r--r--   3 mep_tsviewer vito   25337259 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00046.parquet
-rw-r--r--   3 mep_tsviewer vito   25349579 2017-03-22 00:10 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00047.parquet
-rw-r--r--   3 mep_tsviewer vito   25338758 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00048.parquet
-rw-r--r--   3 mep_tsviewer vito   25359736 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00049.parquet
-rw-r--r--   3 mep_tsviewer vito   25340555 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00050.parquet
-rw-r--r--   3 mep_tsviewer vito   25359822 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00051.parquet
-rw-r--r--   3 mep_tsviewer vito   25377064 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00052.parquet
-rw-r--r--   3 mep_tsviewer vito   25361970 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00053.parquet
-rw-r--r--   3 mep_tsviewer vito   25338795 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00054.parquet
-rw-r--r--   3 mep_tsviewer vito   25345861 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00055.parquet
-rw-r--r--   3 mep_tsviewer vito   25361802 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00056.parquet
-rw-r--r--   3 mep_tsviewer vito   25356596 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00057.parquet
-rw-r--r--   3 mep_tsviewer vito   25364383 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00058.parquet
-rw-r--r--   3 mep_tsviewer vito   25383945 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00059.parquet
-rw-r--r--   3 mep_tsviewer vito   25357862 2017-03-22 00:11 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00060.parquet
-rw-r--r--   3 mep_tsviewer vito   25344097 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00061.parquet
-rw-r--r--   3 mep_tsviewer vito   25394550 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00062.parquet
-rw-r--r--   3 mep_tsviewer vito   25361621 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00063.parquet
-rw-r--r--   3 mep_tsviewer vito   25328248 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00064.parquet
-rw-r--r--   3 mep_tsviewer vito   25360018 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00065.parquet
-rw-r--r--   3 mep_tsviewer vito   25353412 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00066.parquet
-rw-r--r--   3 mep_tsviewer vito   25353366 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00067.parquet
-rw-r--r--   3 mep_tsviewer vito   25366552 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00068.parquet
-rw-r--r--   3 mep_tsviewer vito   25336160 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00069.parquet
-rw-r--r--   3 mep_tsviewer vito   25352581 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00070.parquet
-rw-r--r--   3 mep_tsviewer vito   25356921 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00071.parquet
-rw-r--r--   3 mep_tsviewer vito   25337459 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00072.parquet
-rw-r--r--   3 mep_tsviewer vito   25352658 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00073.parquet
-rw-r--r--   3 mep_tsviewer vito   25362806 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00074.parquet
-rw-r--r--   3 mep_tsviewer vito   25332933 2017-03-22 00:12 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00075.parquet
-rw-r--r--   3 mep_tsviewer vito   25337050 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00076.parquet
-rw-r--r--   3 mep_tsviewer vito   25351615 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00077.parquet
-rw-r--r--   3 mep_tsviewer vito   25372093 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00078.parquet
-rw-r--r--   3 mep_tsviewer vito   25400785 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00079.parquet
-rw-r--r--   3 mep_tsviewer vito   25355679 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00080.parquet
-rw-r--r--   3 mep_tsviewer vito   25363894 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00081.parquet
-rw-r--r--   3 mep_tsviewer vito   25374263 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00082.parquet
-rw-r--r--   3 mep_tsviewer vito   25349140 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00083.parquet
-rw-r--r--   3 mep_tsviewer vito   25386733 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00084.parquet
-rw-r--r--   3 mep_tsviewer vito   25380380 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00085.parquet
-rw-r--r--   3 mep_tsviewer vito   25349083 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00086.parquet
-rw-r--r--   3 mep_tsviewer vito   25399829 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00087.parquet
-rw-r--r--   3 mep_tsviewer vito   25378706 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00088.parquet
-rw-r--r--   3 mep_tsviewer vito   25345343 2017-03-22 00:13 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00089.parquet
-rw-r--r--   3 mep_tsviewer vito   25358393 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00090.parquet
-rw-r--r--   3 mep_tsviewer vito   25346919 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00091.parquet
-rw-r--r--   3 mep_tsviewer vito   25332395 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00092.parquet
-rw-r--r--   3 mep_tsviewer vito   25355408 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00093.parquet
-rw-r--r--   3 mep_tsviewer vito   25350583 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00094.parquet
-rw-r--r--   3 mep_tsviewer vito   25337139 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00095.parquet
-rw-r--r--   3 mep_tsviewer vito   25349278 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00096.parquet
-rw-r--r--   3 mep_tsviewer vito   25350053 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00097.parquet
-rw-r--r--   3 mep_tsviewer vito   25361480 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00098.parquet
-rw-r--r--   3 mep_tsviewer vito   25375630 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00099.parquet
-rw-r--r--   3 mep_tsviewer vito   25375661 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00100.parquet
-rw-r--r--   3 mep_tsviewer vito   25345687 2017-03-22 00:14 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00101.parquet
-rw-r--r--   3 mep_tsviewer vito   25342268 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00102.parquet
-rw-r--r--   3 mep_tsviewer vito   25334411 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00103.parquet
-rw-r--r--   3 mep_tsviewer vito   25357199 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00104.parquet
-rw-r--r--   3 mep_tsviewer vito   25375453 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00105.parquet
-rw-r--r--   3 mep_tsviewer vito   25365719 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00106.parquet
-rw-r--r--   3 mep_tsviewer vito   25381858 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00107.parquet
-rw-r--r--   3 mep_tsviewer vito   25353112 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00108.parquet
-rw-r--r--   3 mep_tsviewer vito   25354219 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00109.parquet
-rw-r--r--   3 mep_tsviewer vito   25376862 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00110.parquet
-rw-r--r--   3 mep_tsviewer vito   25370664 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00111.parquet
-rw-r--r--   3 mep_tsviewer vito   25357291 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00112.parquet
-rw-r--r--   3 mep_tsviewer vito   25363815 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00113.parquet
-rw-r--r--   3 mep_tsviewer vito   25379000 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00114.parquet
-rw-r--r--   3 mep_tsviewer vito   25346606 2017-03-22 00:15 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00115.parquet
-rw-r--r--   3 mep_tsviewer vito   25338025 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00116.parquet
-rw-r--r--   3 mep_tsviewer vito   25353108 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00117.parquet
-rw-r--r--   3 mep_tsviewer vito   25351054 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00118.parquet
-rw-r--r--   3 mep_tsviewer vito   25331492 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00119.parquet
-rw-r--r--   3 mep_tsviewer vito   25352670 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00120.parquet
-rw-r--r--   3 mep_tsviewer vito   25341291 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00121.parquet
-rw-r--r--   3 mep_tsviewer vito   25325189 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00122.parquet
-rw-r--r--   3 mep_tsviewer vito   25341454 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00123.parquet
-rw-r--r--   3 mep_tsviewer vito   25334407 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00124.parquet
-rw-r--r--   3 mep_tsviewer vito   25371710 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00125.parquet
-rw-r--r--   3 mep_tsviewer vito   25400455 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00126.parquet
-rw-r--r--   3 mep_tsviewer vito   25334676 2017-03-22 00:16 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00127.parquet
-rw-r--r--   3 mep_tsviewer vito   25362616 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00128.parquet
-rw-r--r--   3 mep_tsviewer vito   25363886 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00129.parquet
-rw-r--r--   3 mep_tsviewer vito   25336678 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00130.parquet
-rw-r--r--   3 mep_tsviewer vito   25352216 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00131.parquet
-rw-r--r--   3 mep_tsviewer vito   25371067 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00132.parquet
-rw-r--r--   3 mep_tsviewer vito   25373034 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00133.parquet
-rw-r--r--   3 mep_tsviewer vito   25366880 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00134.parquet
-rw-r--r--   3 mep_tsviewer vito   25341113 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00135.parquet
-rw-r--r--   3 mep_tsviewer vito   25390356 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00136.parquet
-rw-r--r--   3 mep_tsviewer vito   25375039 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00137.parquet
-rw-r--r--   3 mep_tsviewer vito   25334315 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00138.parquet
-rw-r--r--   3 mep_tsviewer vito   25370581 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00139.parquet
-rw-r--r--   3 mep_tsviewer vito   25370598 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00140.parquet
-rw-r--r--   3 mep_tsviewer vito   25318438 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00141.parquet
-rw-r--r--   3 mep_tsviewer vito   25367727 2017-03-22 00:17 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00142.parquet
-rw-r--r--   3 mep_tsviewer vito   25361599 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00143.parquet
-rw-r--r--   3 mep_tsviewer vito   25327646 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00144.parquet
-rw-r--r--   3 mep_tsviewer vito   25366143 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00145.parquet
-rw-r--r--   3 mep_tsviewer vito   25373734 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00146.parquet
-rw-r--r--   3 mep_tsviewer vito   25350089 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00147.parquet
-rw-r--r--   3 mep_tsviewer vito   25347975 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00148.parquet
-rw-r--r--   3 mep_tsviewer vito   25336055 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00149.parquet
-rw-r--r--   3 mep_tsviewer vito   25321908 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00150.parquet
-rw-r--r--   3 mep_tsviewer vito   25364109 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00151.parquet
-rw-r--r--   3 mep_tsviewer vito   25366125 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00152.parquet
-rw-r--r--   3 mep_tsviewer vito   25376436 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00153.parquet
-rw-r--r--   3 mep_tsviewer vito   25388034 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00154.parquet
-rw-r--r--   3 mep_tsviewer vito   25346212 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00155.parquet
-rw-r--r--   3 mep_tsviewer vito   25346212 2017-03-22 00:18 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00156.parquet
-rw-r--r--   3 mep_tsviewer vito   25364542 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00157.parquet
-rw-r--r--   3 mep_tsviewer vito   25361136 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00158.parquet
-rw-r--r--   3 mep_tsviewer vito   25357436 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00159.parquet
-rw-r--r--   3 mep_tsviewer vito   25357944 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00160.parquet
-rw-r--r--   3 mep_tsviewer vito   25367168 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00161.parquet
-rw-r--r--   3 mep_tsviewer vito   25368636 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00162.parquet
-rw-r--r--   3 mep_tsviewer vito   25342537 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00163.parquet
-rw-r--r--   3 mep_tsviewer vito   25343544 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00164.parquet
-rw-r--r--   3 mep_tsviewer vito   25355247 2017-03-22 00:19 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00165.parquet
-rw-r--r--   3 mep_tsviewer vito   25353852 2017-03-22 00:20 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00166.parquet
-rw-r--r--   3 mep_tsviewer vito   25326912 2017-03-22 00:20 /tapdata/tsviewer/combined-stats/stats.parquet/part-r-00167.parquet
In [6]:
data = sqlCtx.read.parquet("hdfs:/tapdata/tsviewer/combined-stats/stats.parquet")
data
Out[6]:
DataFrame[landcover: string, zone: int, date: string, BIOPAR_ALB_BHV_V1: double, BIOPAR_ALB_BHV_V1_area: bigint, BIOPAR_ALB_BHV_V1_validArea: bigint, BIOPAR_ALB_DHV_V1: double, BIOPAR_ALB_DHV_V1_area: bigint, BIOPAR_ALB_DHV_V1_validArea: bigint, BIOPAR_BA_V1: double, BIOPAR_BA_V1_area: bigint, BIOPAR_BA_V1_validArea: bigint, BIOPAR_DMP: double, BIOPAR_DMP_area: bigint, BIOPAR_DMP_validArea: bigint, BIOPAR_FAPAR_V1: double, BIOPAR_FAPAR_V1_area: bigint, BIOPAR_FAPAR_V1_validArea: bigint, BIOPAR_FCOVER_V1: double, BIOPAR_FCOVER_V1_area: bigint, BIOPAR_FCOVER_V1_validArea: bigint, BIOPAR_LAI_V1: double, BIOPAR_LAI_V1_area: bigint, BIOPAR_LAI_V1_validArea: bigint, BIOPAR_NDVI_V1: double, BIOPAR_NDVI_V1_area: bigint, BIOPAR_NDVI_V1_validArea: bigint, BIOPAR_NDVI_V2: double, BIOPAR_NDVI_V2_area: bigint, BIOPAR_NDVI_V2_validArea: bigint, BIOPAR_SWI10_V3: double, BIOPAR_SWI10_V3_area: bigint, BIOPAR_SWI10_V3_validArea: bigint, BIOPAR_TOCR_B0: double, BIOPAR_TOCR_B0_area: bigint, BIOPAR_TOCR_B0_validArea: bigint, BIOPAR_TOCR_B2: double, BIOPAR_TOCR_B2_area: bigint, BIOPAR_TOCR_B2_validArea: bigint, BIOPAR_TOCR_B3: double, BIOPAR_TOCR_B3_area: bigint, BIOPAR_TOCR_B3_validArea: bigint, BIOPAR_TOCR_MIR: double, BIOPAR_TOCR_MIR_area: bigint, BIOPAR_TOCR_MIR_validArea: bigint, BIOPAR_VCI: double, BIOPAR_VCI_area: bigint, BIOPAR_VCI_validArea: bigint, BIOPAR_VPI: double, BIOPAR_VPI_area: bigint, BIOPAR_VPI_validArea: bigint, BIOPAR_WB_V1: double, BIOPAR_WB_V1_area: bigint, BIOPAR_WB_V1_validArea: bigint, BIOPAR_WB_V2: double, BIOPAR_WB_V2_area: bigint, BIOPAR_WB_V2_validArea: bigint, PROBAV_L3_S10_TOC_RED_333M: double, PROBAV_L3_S10_TOC_RED_333M_area: bigint, PROBAV_L3_S10_TOC_RED_333M_validArea: bigint, PROBAV_L3_S10_TOC_NIR_333M: double, PROBAV_L3_S10_TOC_NIR_333M_area: bigint, PROBAV_L3_S10_TOC_NIR_333M_validArea: bigint, PROBAV_L3_S10_TOC_BLUE_333M: double, PROBAV_L3_S10_TOC_BLUE_333M_area: bigint, PROBAV_L3_S10_TOC_BLUE_333M_validArea: bigint, PROBAV_L3_S10_TOC_SWIR_333M: double, PROBAV_L3_S10_TOC_SWIR_333M_area: bigint, PROBAV_L3_S10_TOC_SWIR_333M_validArea: bigint, PROBAV_L3_S10_TOC_NDVI_333M: double, PROBAV_L3_S10_TOC_NDVI_333M_area: bigint, PROBAV_L3_S10_TOC_NDVI_333M_validArea: bigint, RAINFALL: double, RAINFALL_area: bigint, RAINFALL_validArea: bigint]

Next, we want to register this data in a temporary table so we can query it using Spark's SQLContext.

In [7]:
data.registerTempTable("data")

Now although we instructed Spark where the data was and to register it, it did not actually read anything. This is because Spark is lazy: it will postpone doing any task until it the last moment. Requesting data will force it to load it, as follows:

In [8]:
data.take(1)
Out[8]:
[Row(landcover=u'Water_bodies', zone=3219, date=u'2007-11-01', BIOPAR_ALB_BHV_V1=0.14052462320017708, BIOPAR_ALB_BHV_V1_area=11163, BIOPAR_ALB_BHV_V1_validArea=11076, BIOPAR_ALB_DHV_V1=0.14695226555861848, BIOPAR_ALB_DHV_V1_area=11163, BIOPAR_ALB_DHV_V1_validArea=11076, BIOPAR_BA_V1=0.0009853981904505958, BIOPAR_BA_V1_area=11163, BIOPAR_BA_V1_validArea=11, BIOPAR_DMP=None, BIOPAR_DMP_area=None, BIOPAR_DMP_validArea=None, BIOPAR_FAPAR_V1=0.21222769551344417, BIOPAR_FAPAR_V1_area=11163, BIOPAR_FAPAR_V1_validArea=10995, BIOPAR_FCOVER_V1=0.11557305349856734, BIOPAR_FCOVER_V1_area=11163, BIOPAR_FCOVER_V1_validArea=10995, BIOPAR_LAI_V1=0.28491411937776656, BIOPAR_LAI_V1_area=11163, BIOPAR_LAI_V1_validArea=10995, BIOPAR_NDVI_V1=0.27314899691044103, BIOPAR_NDVI_V1_area=11163, BIOPAR_NDVI_V1_validArea=10995, BIOPAR_NDVI_V2=None, BIOPAR_NDVI_V2_area=None, BIOPAR_NDVI_V2_validArea=None, BIOPAR_SWI10_V3=None, BIOPAR_SWI10_V3_area=None, BIOPAR_SWI10_V3_validArea=None, BIOPAR_TOCR_B0=None, BIOPAR_TOCR_B0_area=None, BIOPAR_TOCR_B0_validArea=None, BIOPAR_TOCR_B2=None, BIOPAR_TOCR_B2_area=None, BIOPAR_TOCR_B2_validArea=None, BIOPAR_TOCR_B3=None, BIOPAR_TOCR_B3_area=None, BIOPAR_TOCR_B3_validArea=None, BIOPAR_TOCR_MIR=None, BIOPAR_TOCR_MIR_area=None, BIOPAR_TOCR_MIR_validArea=None, BIOPAR_VCI=None, BIOPAR_VCI_area=None, BIOPAR_VCI_validArea=None, BIOPAR_VPI=None, BIOPAR_VPI_area=None, BIOPAR_VPI_validArea=None, BIOPAR_WB_V1=None, BIOPAR_WB_V1_area=None, BIOPAR_WB_V1_validArea=None, BIOPAR_WB_V2=None, BIOPAR_WB_V2_area=None, BIOPAR_WB_V2_validArea=None, PROBAV_L3_S10_TOC_RED_333M=None, PROBAV_L3_S10_TOC_RED_333M_area=None, PROBAV_L3_S10_TOC_RED_333M_validArea=None, PROBAV_L3_S10_TOC_NIR_333M=None, PROBAV_L3_S10_TOC_NIR_333M_area=None, PROBAV_L3_S10_TOC_NIR_333M_validArea=None, PROBAV_L3_S10_TOC_BLUE_333M=None, PROBAV_L3_S10_TOC_BLUE_333M_area=None, PROBAV_L3_S10_TOC_BLUE_333M_validArea=None, PROBAV_L3_S10_TOC_SWIR_333M=None, PROBAV_L3_S10_TOC_SWIR_333M_area=None, PROBAV_L3_S10_TOC_SWIR_333M_validArea=None, PROBAV_L3_S10_TOC_NDVI_333M=None, PROBAV_L3_S10_TOC_NDVI_333M_area=None, PROBAV_L3_S10_TOC_NDVI_333M_validArea=None, RAINFALL=2.1919554834854207, RAINFALL_area=1406, RAINFALL_validArea=1406)]

Using Spark's SQLContext, we can query our data with SQL-like expressions. Here we are interested in the zone, landcover, date, FAPAR and area columns.

We will also filter out unwanted rows by specifying that we do not want rows for which we do not have a FAPAR value.

In [11]:
data = sqlCtx.sql("SELECT zone, landcover, date date, BIOPAR_FAPAR_V1 fapar, BIOPAR_FAPAR_V1 " + 
                  "FROM data " +
                  "WHERE BIOPAR_FAPAR_V1 is not null").cache()
data.take(1)
Out[11]:
[Row(zone=3219, landcover=u'Water_bodies', date=u'2006-08-11', fapar=0.29711740277250515, BIOPAR_FAPAR_V1=0.29711740277250515)]

As you can see, although null values have been filtered out, some FAPAR values still remain nan. We can filter them out using the filter method from Spark. For that, we can either use the DataFrame's method filter, or we can use the DataFrame as a regular RDD.

The difference is that the DataFrame contains Rows, for which we can refer to the columns with their names, whereas the RDD contains tuples, where we have to select columns based on their index. However, with a regular RDD, we can pass any method we want to filter, map, reduce, ..., whereas with a DataFrame, we have to use pre-defined functions.

In [12]:
def fapar_not_nan(row):
    return row[3] is not None and not np.isnan(row[3])

withoutNaNs = data.rdd.filter(fapar_not_nan).cache()
withoutNaNs.take(1)
Out[12]:
[Row(zone=3219, landcover=u'Water_bodies', date=u'2006-06-21', fapar=0.24140815792410703, BIOPAR_FAPAR_V1=0.24140815792410703)]

For grouping, we could use spark's groupBy method, or we could first map our data to a (Key, Value) pair and use the groupByKey method. We will use the latter method as we have more fined control over what goes into Value.

In [13]:
def by_zone(row):
    return ((row[0]), (row[1], row[2], row[3], row[4]))

by_zone = withoutNaNs.map(by_zone).groupByKey().cache()
by_zone.take(1)
Out[13]:
[(672, <pyspark.resultiterable.ResultIterable at 0x601efd0>)]

The result is a list of pairs (Zone, Values) where Values is a list containing all the values for the specific Zone we grouped by.

Spark will combine the different values into a ResultIterable object, not really suited for the analysis we will be doing. Instead, we are going to use Pandas and Numpy. So first thing we want to do is to convert our ResultIterable into an actual Pandas DataFrame.

In [14]:
def iterable_to_pd(row):
    return (row[0], pd.DataFrame(list(row[1]), columns=["landcover", "date", "fapar", "area"]))

by_zone = by_zone.map(iterable_to_pd).cache()
by_zone.take(1)
Out[14]:
[(672,                                                landcover        date  \
  0        Tree_cover_broadleaved_evergreen_closed_to_open  2000-07-11   
  1        Tree_cover_broadleaved_evergreen_closed_to_open  2006-11-01   
  2        Tree_cover_broadleaved_evergreen_closed_to_open  2010-11-21   
  3        Tree_cover_broadleaved_evergreen_closed_to_open  2011-02-11   
  4                Mosaic_herbaceous_cover__tree_and_shrub  2000-06-01   
  5                Mosaic_herbaceous_cover__tree_and_shrub  2004-06-21   
  6                Mosaic_herbaceous_cover__tree_and_shrub  2010-10-11   
  7                Mosaic_herbaceous_cover__tree_and_shrub  2011-01-01   
  8                Mosaic_herbaceous_cover__tree_and_shrub  2015-01-21   
  9          Sparse_vegetation_tree_shrub_herbaceous_cover  2000-04-21   
  10         Sparse_vegetation_tree_shrub_herbaceous_cover  2006-08-11   
  11         Sparse_vegetation_tree_shrub_herbaceous_cover  2012-12-01   
  12         Sparse_vegetation_tree_shrub_herbaceous_cover  2016-12-21   
  13                                             Shrubland  2002-10-11   
  14                                             Shrubland  2003-01-01   
  15                                             Shrubland  2007-01-21   
  16                                             Shrubland  2013-05-11   
  17                                      Cropland_rainfed  1999-05-01   
  18                                      Cropland_rainfed  2003-05-21   
  19                                      Cropland_rainfed  2009-09-11   
  20                             Mosaic_natural_vegetation  2008-04-11   
  21                             Mosaic_natural_vegetation  2014-08-01   
  22                                           Urban_areas  2002-04-01   
  23                                           Urban_areas  2006-04-21   
  24                                           Urban_areas  2012-08-11   
  25       Tree_cover_broadleaved_deciduous_closed_to_open  2001-02-21   
  26       Tree_cover_broadleaved_deciduous_closed_to_open  2007-06-11   
  27       Tree_cover_broadleaved_deciduous_closed_to_open  2013-10-01   
  28                                          Water_bodies  2000-05-11   
  29                                          Water_bodies  2006-09-01   
  ...                                                  ...         ...   
  11685             Mosaic_tree_and_shrub_herbaceous_cover  2002-01-01   
  11686             Mosaic_tree_and_shrub_herbaceous_cover  2006-01-21   
  11687             Mosaic_tree_and_shrub_herbaceous_cover  2012-05-11   
  11688  Shrub_or_herbaceous_cover_flooded_fresh_saline...  2000-06-11   
  11689  Shrub_or_herbaceous_cover_flooded_fresh_saline...  2006-10-01   
  11690  Shrub_or_herbaceous_cover_flooded_fresh_saline...  2010-10-21   
  11691  Shrub_or_herbaceous_cover_flooded_fresh_saline...  2011-01-11   
  11692              Tree_cover_broadleaved_deciduous_open  2002-06-11   
  11693              Tree_cover_broadleaved_deciduous_open  2008-10-01   
  11694              Tree_cover_broadleaved_deciduous_open  2012-10-21   
  11695              Tree_cover_broadleaved_deciduous_open  2013-01-11   
  11696                                    Mosaic_cropland  1999-02-21   
  11697                                    Mosaic_cropland  2005-06-11   
  11698                                    Mosaic_cropland  2011-10-01   
  11699                                    Mosaic_cropland  2015-10-21   
  11700                                    Mosaic_cropland  2016-01-11   
  11701            Tree_cover_broadleaved_deciduous_closed  2000-08-01   
  11702            Tree_cover_broadleaved_deciduous_closed  2004-08-21   
  11703            Tree_cover_broadleaved_deciduous_closed  2010-12-11   
  11704            Tree_cover_broadleaved_deciduous_closed  2011-03-01   
  11705            Tree_cover_broadleaved_deciduous_closed  2015-03-21   
  11706                                          Grassland  2004-12-11   
  11707                                          Grassland  2005-03-01   
  11708                                          Grassland  2009-03-21   
  11709                                          Grassland  2015-07-11   
  11710                    Tree_cover_flooded_saline_water  2000-04-11   
  11711                    Tree_cover_flooded_saline_water  2006-08-01   
  11712                    Tree_cover_flooded_saline_water  2010-08-21   
  11713                    Tree_cover_flooded_saline_water  2016-12-11   
  11714                    Tree_cover_flooded_saline_water  2017-03-01   
  
            fapar      area  
  0      0.781023  0.781023  
  1      0.739872  0.739872  
  2      0.714261  0.714261  
  3      0.795091  0.795091  
  4      0.682756  0.682756  
  5      0.617121  0.617121  
  6      0.384000  0.384000  
  7      0.568926  0.568926  
  8      0.498292  0.498292  
  9      0.512471  0.512471  
  10     0.376571  0.376571  
  11     0.415657  0.415657  
  12     0.412000  0.412000  
  13     0.553395  0.553395  
  14     0.598780  0.598780  
  15     0.606932  0.606932  
  16     0.673394  0.673394  
  17     0.674976  0.674976  
  18     0.636602  0.636602  
  19     0.527672  0.527672  
  20     0.731046  0.731046  
  21     0.628583  0.628583  
  22     0.549303  0.549303  
  23     0.660212  0.660212  
  24     0.509302  0.509302  
  25     0.627513  0.627513  
  26     0.692247  0.692247  
  27     0.549743  0.549743  
  28     0.559624  0.559624  
  29     0.394943  0.394943  
  ...         ...       ...  
  11685  0.469723  0.469723  
  11686  0.665650  0.665650  
  11687  0.653948  0.653948  
  11688  0.638316  0.638316  
  11689  0.405949  0.405949  
  11690  0.503323  0.503323  
  11691  0.501129  0.501129  
  11692  0.602000  0.602000  
  11693  0.296000  0.296000  
  11694  0.546000  0.546000  
  11695  0.572000  0.572000  
  11696  0.630002  0.630002  
  11697  0.814023  0.814023  
  11698  0.605181  0.605181  
  11699  0.597991  0.597991  
  11700  0.645325  0.645325  
  11701  0.576503  0.576503  
  11702  0.611666  0.611666  
  11703  0.615693  0.615693  
  11704  0.598025  0.598025  
  11705  0.624073  0.624073  
  11706  0.503609  0.503609  
  11707  0.602599  0.602599  
  11708  0.483708  0.483708  
  11709  0.498409  0.498409  
  11710  0.722473  0.722473  
  11711  0.675400  0.675400  
  11712  0.662609  0.662609  
  11713  0.610634  0.610634  
  11714  0.619562  0.619562  
  
  [11715 rows x 4 columns])]

Now that we are ready to write our Combine task, let's detail the algorithm we are going to write to process each zone:

  1. Group the dataset by landcover, and for each:
    1. Resample the data monthly (instead of 10-daily) with padding in order to fill some void if there are any
  2. Reweight the FAPAR values by the area the landcover takes over the total area
  3. Regroup the dataset by date and take the (monthly) sum
  4. Resample the data yearly and remove the 1st and last year of data (to only include full years)
In [15]:
def combine_fapar(zone):
    series = zone[1]
    series["date"] = pd.to_datetime(series["date"])
    series = series.set_index("date")
    
    # Group the dataset by landcover, and for each:
    by_landcover = series.groupby("landcover")
    
    # Resample the data monthly (instead of 10-daily) with padding in order to fill some void if there are any
    by_landcover = by_landcover.resample("1m", how="mean").fillna(method='pad')
    series = by_landcover.reset_index()
    
    # Reweight the FAPAR values by the area the landcover takes over the total area
    total_area = by_landcover["area"].mean().sum()
    series["fapar"] *= series["area"] / total_area
    series = series.drop("area", axis=1)
    
    # Regroup the dataset by date and take the (monthly) sum
    series = series.groupby("date").sum()
    
    # Resample the data yearly and remove the 1st and last year of data (to only include full years)
    yearly_sum = pd.DataFrame(series["fapar"].resample("12m", "mean")[1:-1])
    
    return (zone[0], series, yearly_sum)

combined = by_zone.map(combine_fapar).cache()
combined.take(1)
Out[15]:
[(672,                 fapar
  date                 
  1999-01-31   9.717959
  1999-02-28   8.868936
  1999-03-31  10.370095
  1999-04-30  12.420800
  1999-05-31  12.032194
  1999-06-30   9.968254
  1999-07-31   9.284237
  1999-08-31   8.657068
  1999-09-30   6.982681
  1999-10-31   5.440120
  1999-11-30   5.537638
  1999-12-31   8.929892
  2000-01-31  10.512340
  2000-02-29  11.302637
  2000-03-31  13.014589
  2000-04-30  14.539168
  2000-05-31  14.625471
  2000-06-30  13.036348
  2000-07-31  10.837346
  2000-08-31   9.071030
  2000-09-30   6.932734
  2000-10-31   8.162503
  2000-11-30   7.231002
  2000-12-31  10.725195
  2001-01-31  11.366307
  2001-02-28  10.276144
  2001-03-31  10.517193
  2001-04-30  10.990723
  2001-05-31  10.960789
  2001-06-30  11.189507
  ...               ...
  2014-10-31   8.205650
  2014-11-30   8.801627
  2014-12-31  11.202606
  2015-01-31  10.501179
  2015-02-28   9.387215
  2015-03-31  10.382771
  2015-04-30  11.646911
  2015-05-31  11.264060
  2015-06-30  12.828613
  2015-07-31  11.889360
  2015-08-31  10.821384
  2015-09-30   8.733411
  2015-10-31   7.264628
  2015-11-30   6.536714
  2015-12-31   7.512584
  2016-01-31   7.997100
  2016-02-29  10.642866
  2016-03-31   9.514068
  2016-04-30   9.566114
  2016-05-31   8.566643
  2016-06-30   7.837038
  2016-07-31   8.002204
  2016-08-31   6.676644
  2016-09-30   5.125340
  2016-10-31   5.245660
  2016-11-30   6.372412
  2016-12-31   9.893308
  2017-01-31  10.128927
  2017-02-28   9.886582
  2017-03-31  10.078414
  
  [219 rows x 1 columns],                 fapar
  date                 
  2000-01-31   9.083688
  2001-01-31  10.903694
  2002-01-31   9.291626
  2003-01-31  10.591492
  2004-01-31   9.401098
  2005-01-31  11.984670
  2006-01-31  12.621007
  2007-01-31  10.920098
  2008-01-31  10.022505
  2009-01-31  10.516419
  2010-01-31  11.414949
  2011-01-31  11.175359
  2012-01-31  11.270612
  2013-01-31  11.411795
  2014-01-31  11.138742
  2015-01-31  10.768098
  2016-01-31   9.688729
  2017-01-31   8.130935)]

Finally, we can compute the trend using a linear regression (OLS).

In [16]:
def min_years(datas):
    return len(datas[2]) >=5 

def ols(datas):    
    yearly_series = datas[2]
    x, y = yearly_series.index.astype(int) // (10**9 * 60 * 24 * 30), yearly_series["fapar"]
    X = np.c_[np.ones(len(x)), x]
    return (datas[0], datas[1], datas[2], np.linalg.pinv(X).dot(y))

trends = combined.filter(min_years).map(ols).cache()

Next we will plot the trends. We are interested in the highest and lowest trends in our data.

Let's define our plotting function and plot the trends by descending order.

In [21]:
#copy auxiliary data from hdfs
!hdfs dfs -copyToLocal /tapdata/GAUL2013/zonecodes.csv ./
17/03/22 13:25:52 WARN hdfs.DFSClient: DFSInputStream has been closed already
In [23]:
zonecodes = pd.read_csv("./zonecodes.csv")
def plotTopTrends(trendsCollected):
    for zone, series, yearly_series, trend in trendsCollected:
        if not np.isnan(trend[0]) and not np.isnan(trend[1]):
            codes = zonecodes[zonecodes['ADM1_CODE'] == zone][["ADM0_NAME", "ADM1_NAME"]]
            codes["Trend"] = trend[1]
            print codes
                
            pdf = series.copy()
            pdf["ols"] = pdf.index.astype(int) / (10**9 * 60 * 24 * 30) * trend[1] + trend[0]
            pdf[["fapar", "ols"]].plot(figsize=(16,4))

            #sns.regplot(x="unix_date", y="fapar", data=series, scatter=False)
            plt.show()
In [24]:
plotTopTrends(trends.sortBy(lambda x: x[3][1], ascending=False).take(10))
          ADM0_NAME                          ADM1_NAME     Trend
117  American Samoa  Administrative unit not available  0.001467
/opt/rh/python27/root/usr/lib64/python2.7/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))
    ADM0_NAME                  ADM1_NAME     Trend
793     Egypt  Al Bahr/al Ahmar (redsea)  0.000554
     ADM0_NAME                          ADM1_NAME     Trend
1311  Kiribati  Administrative unit not available  0.000551
       ADM0_NAME ADM1_NAME     Trend
1662  Montserrat  Plymouth  0.000495
     ADM0_NAME ADM1_NAME     Trend
1827  Pakistan    Punjab  0.000426
     ADM0_NAME  ADM1_NAME     Trend
1101     India  Rajasthan  0.000421
    ADM0_NAME                 ADM1_NAME     Trend
805     Egypt  As Ismailiyah (ismailia)  0.000421
    ADM0_NAME ADM1_NAME     Trend
104   Algeria     Saida  0.000418
    ADM0_NAME ADM1_NAME     Trend
112   Algeria    Tiaret  0.000405
     ADM0_NAME ADM1_NAME   Trend
2561    Turkey   Denizli  0.0004
In [25]:
plotTopTrends(trends.sortBy(lambda x: x[3][1], ascending=True).take(10))
            ADM0_NAME                          ADM1_NAME     Trend
574  Christmas Island  Administrative unit not available -0.001041
            ADM0_NAME                          ADM1_NAME     Trend
904  French Polynesia  Administrative unit not available -0.000726
       ADM0_NAME   ADM1_NAME     Trend
3314  Martinique  La Trinite -0.000519
    ADM0_NAME ADM1_NAME     Trend
731  Djibouti     Obock -0.000301
     ADM0_NAME ADM1_NAME     Trend
3102    Guinea   Conakry -0.000285
    ADM0_NAME     ADM1_NAME     Trend
984   Grenada  St. George's -0.000231
       ADM0_NAME  ADM1_NAME     Trend
1674  Mozambique  Inhambane -0.000223
       ADM0_NAME ADM1_NAME     Trend
1676  Mozambique    Maputo -0.000222