mlcourse.ai – Open Machine Learning Course

Author: Mariya Mansurova, Analyst & developer in Yandex.Metrics team. Translated by Ivan Zakharov, ML enthusiast.
This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Assignment #9 (demo). Solution

Time series analysis

Same assignment as a Kaggle Kernel + solution.

Fill cells marked with "Your code here" and submit your answers to the questions through the web form.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import os

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly import graph_objs as go
import requests
import pandas as pd

print(__version__) # need 1.9.0 or greater
init_notebook_mode(connected = True)


def plotly_df(df, title = ''):
    data = []
    
    for column in df.columns:
        trace = go.Scatter(
            x = df.index,
            y = df[column],
            mode = 'lines',
            name = column
        )
        data.append(trace)
    
    layout = dict(title = title)
    fig = dict(data = data, layout = layout)
    iplot(fig, show_link=False)
3.2.1

Data preparation

In [2]:
df = pd.read_csv('../../data/wiki_machine_learning.csv', sep = ' ')
df = df[df['count'] != 0]
df.head()
Out[2]:
date count lang page rank month title
81 2015-01-01 1414 en Machine_learning 8708 201501 Machine_learning
80 2015-01-02 1920 en Machine_learning 8708 201501 Machine_learning
79 2015-01-03 1338 en Machine_learning 8708 201501 Machine_learning
78 2015-01-04 1404 en Machine_learning 8708 201501 Machine_learning
77 2015-01-05 2264 en Machine_learning 8708 201501 Machine_learning
In [3]:
df.shape
Out[3]:
(383, 7)

Predicting with FB Prophet

We will train at first 5 months and predict the number of trips for June.

In [4]:
df.date = pd.to_datetime(df.date)
In [5]:
plotly_df(df.set_index('date')[['count']])
In [6]:
from fbprophet import Prophet
In [7]:
predictions = 30

df = df[['date', 'count']]
df.columns = ['ds', 'y']
df.tail()
Out[7]:
ds y
382 2016-01-16 1644
381 2016-01-17 1836
376 2016-01-18 2983
375 2016-01-19 3389
372 2016-01-20 3559
In [8]:
train_df = df[:-predictions].copy()
In [9]:
m = Prophet()
m.fit(train_df);
INFO:fbprophet.forecaster:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
In [10]:
future = m.make_future_dataframe(periods=predictions)
future.tail()
Out[10]:
ds
378 2016-01-16
379 2016-01-17
380 2016-01-18
381 2016-01-19
382 2016-01-20
In [11]:
forecast = m.predict(future)
forecast.tail()
Out[11]:
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper weekly weekly_lower weekly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
378 2016-01-16 2979.319903 1699.549889 2515.614415 2960.111058 3000.391112 -861.693311 -861.693311 -861.693311 -861.693311 -861.693311 -861.693311 0.0 0.0 0.0 2117.626592
379 2016-01-17 2984.727592 1857.732571 2711.387501 2964.373639 3006.562727 -720.705354 -720.705354 -720.705354 -720.705354 -720.705354 -720.705354 0.0 0.0 0.0 2264.022238
380 2016-01-18 2990.135281 2887.943979 3672.427351 2968.268706 3013.634791 281.357669 281.357669 281.357669 281.357669 281.357669 281.357669 0.0 0.0 0.0 3271.492950
381 2016-01-19 2995.542971 3141.643599 3928.242112 2972.105376 3020.139689 541.423837 541.423837 541.423837 541.423837 541.423837 541.423837 0.0 0.0 0.0 3536.966808
382 2016-01-20 3000.950660 3022.330110 3814.884513 2976.363355 3026.585093 425.547911 425.547911 425.547911 425.547911 425.547911 425.547911 0.0 0.0 0.0 3426.498570

Question 1: What is the prediction of the number of views of the wiki page on January 20? Round to the nearest integer.

  • 4947
  • 3426 [+]
  • 5229
  • 2744
In [12]:
m.plot(forecast)
Out[12]:
In [13]:
m.plot_components(forecast)
Out[13]:
In [14]:
cmp_df = forecast.set_index('ds')[['yhat', 'yhat_lower', 'yhat_upper']].join(df.set_index('ds'))
In [15]:
cmp_df['e'] = cmp_df['y'] - cmp_df['yhat']
cmp_df['p'] = 100 * cmp_df['e'] / cmp_df['y']
print('MAPE = ', round(np.mean(abs(cmp_df[-predictions:]['p'])), 2))
print('MAE = ', round(np.mean(abs(cmp_df[-predictions:]['e'])), 2))
MAPE =  34.5
MAE =  599.84

Estimate the quality of the prediction with the last 30 points.

Question 2: What is MAPE equal to?

  • 34.5 [+]
  • 42.42
  • 5.39
  • 65.91

Question 3: What is MAE equal to?

  • 355
  • 4007
  • 600 [+]
  • 903

Predicting with ARIMA

In [16]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
plt.rcParams['figure.figsize'] = (15, 10)

Question 4: Let's verify the stationarity of the series using the Dickey-Fuller test. Is the series stationary? What is the p-value?

  • Series is stationary, p_value = 0.107
  • Series is not stationary, p_value = 0.107 [+]
  • Series is stationary, p_value = 0.001
  • Series is not stationary, p_value = 0.001
In [17]:
sm.tsa.seasonal_decompose(train_df['y'].values, freq=7).plot();
print("Dickey-Fuller test: p=%f" % sm.tsa.stattools.adfuller(train_df['y'])[1])
Dickey-Fuller test: p=0.107392

But the seasonally differentiated series will already be stationary.

In [18]:
train_df.set_index('ds', inplace=True)
In [19]:
train_df['y_diff'] = train_df.y - train_df.y.shift(7)
sm.tsa.seasonal_decompose(train_df.y_diff[7:].values, freq=7).plot();
print("Dickey-Fuller test: p=%f" % sm.tsa.stattools.adfuller(train_df.y_diff[8:])[1])
Dickey-Fuller test: p=0.000000
In [20]:
ax = plt.subplot(211)
sm.graphics.tsa.plot_acf(train_df.y_diff[13:].values.squeeze(), lags=48, ax=ax)

ax = plt.subplot(212)
sm.graphics.tsa.plot_pacf(train_df.y_diff[13:].values.squeeze(), lags=48, ax=ax)
Out[20]:

Initial values:

  • Q = 1
  • q = 3
  • P = 3
  • p = 1
In [21]:
ps = range(0, 2)
ds = range(0, 2)
qs = range(0, 4)
Ps = range(0, 4)
Ds = range(0, 3)
Qs = range(0, 2)
In [22]:
from itertools import product

parameters = product(ps, ds, qs, Ps, Ds, Qs)
parameters_list = list(parameters)
len(parameters_list)
Out[22]:
384
In [23]:
%%time
import warnings
from tqdm import tqdm
results1 = []
best_aic = float("inf")
warnings.filterwarnings('ignore')

for param in tqdm(parameters_list):
    #try except is necessary, because on some sets of parameters the model can not be trained
    try:
        model=sm.tsa.statespace.SARIMAX(train_df['y'], order=(param[0], param[1], param[2]), 
                                        seasonal_order=(param[3], param[4], param[5], 7)).fit(disp=-1)
    #print parameters on which the model is not trained and proceed to the next set
    except (ValueError, np.linalg.LinAlgError):
        continue
    aic = model.aic
    #save the best model, aic, parameters
    if aic < best_aic:
        best_model = model
        best_aic = aic
        best_param = param
    results1.append([param, model.aic])
100%|██████████| 384/384 [07:13<00:00,  1.13s/it]
CPU times: user 28min 58s, sys: 50min 9s, total: 1h 19min 8s
Wall time: 7min 13s

In [24]:
result_table1 = pd.DataFrame(results1)
result_table1.columns = ['parameters', 'aic']
print(result_table1.sort_values(by = 'aic', ascending=True).head())
             parameters          aic
138  (0, 1, 2, 3, 2, 1)  4961.632628
271  (1, 1, 1, 3, 2, 1)  4962.829098
161  (0, 1, 3, 3, 2, 1)  4969.534608
182  (1, 0, 0, 3, 2, 1)  4973.212241
77   (0, 0, 3, 3, 2, 1)  4978.036900

If we consider the variants proposed in the form:

In [25]:
result_table1[result_table1['parameters'].isin([(1, 0, 2, 3, 1, 0),
                                                (1, 1, 2, 3, 2, 1),
                                                (1, 1, 2, 3, 1, 1),
                                                (1, 0, 2, 3, 0, 0)])]
Out[25]:
parameters aic
215 (1, 0, 2, 3, 1, 0) 5022.312524
284 (1, 1, 2, 3, 1, 1) 5019.555903
286 (1, 1, 2, 3, 2, 1) 4988.963972

Now do the same, but for the series with Box-Cox transformation.

In [26]:
import scipy.stats
train_df['y_box'], lmbda = scipy.stats.boxcox(train_df['y']) 
print("The optimal Box-Cox transformation parameter: %f" % lmbda)
The optimal Box-Cox transformation parameter: 0.732841
In [27]:
results2 = []
best_aic = float("inf")

for param in tqdm(parameters_list):
    #try except is necessary, because on some sets of parameters the model can not be trained
    try:
        model=sm.tsa.statespace.SARIMAX(train_df['y_box'], order=(param[0], param[1], param[2]), 
                                        seasonal_order=(param[3], param[4], param[5], 7)).fit(disp=-1)
    #print parameters on which the model is not trained and proceed to the next set
    except (ValueError, np.linalg.LinAlgError):
        continue
    aic = model.aic
    #save the best model, aic, parameters
    if aic < best_aic:
        best_model = model
        best_aic = aic
        best_param = param
    results2.append([param, model.aic])
    
warnings.filterwarnings('default')
100%|██████████| 384/384 [05:39<00:00,  1.13it/s]
In [28]:
result_table2 = pd.DataFrame(results2)
result_table2.columns = ['parameters', 'aic']
print(result_table2.sort_values(by = 'aic', ascending=True).head())
             parameters          aic
222  (1, 0, 2, 3, 2, 1)  3528.650815
181  (1, 0, 0, 3, 2, 1)  3530.524249
201  (1, 0, 1, 3, 2, 1)  3532.092279
276  (1, 1, 1, 3, 2, 1)  3534.434781
290  (1, 1, 2, 3, 2, 1)  3534.540192

If we consider the variants proposed in the form:

In [29]:
result_table2[result_table2['parameters'].isin([(1, 0, 2, 3, 1, 0),
                                                (1, 1, 2, 3, 2, 1),
                                                (1, 1, 2, 3, 1, 1),
                                                (1, 0, 2, 3, 0, 0)])].sort_values(by='aic')
Out[29]:
parameters aic
290 (1, 1, 2, 3, 2, 1) 3534.540192
219 (1, 0, 2, 3, 1, 0) 3556.880030
289 (1, 1, 2, 3, 1, 1) 3557.849670

Next, we turn to the construction of the SARIMAX model (sm.tsa.statespace.SARIMAX).
Question 5: What parameters are the best for the model according to the AIC criterion?

  • D = 1, d = 0, Q = 0, q = 2, P = 3, p = 1
  • D = 2, d = 1, Q = 1, q = 2, P = 3, p = 1 [+]
  • D = 1, d = 1, Q = 1, q = 2, P = 3, p = 1
  • D = 0, d = 0, Q = 0, q = 2, P = 3, p = 1

Let's look at the forecast of the best AIC model.

In [30]:
print(best_model.summary())
                                 Statespace Model Results                                
=========================================================================================
Dep. Variable:                             y_box   No. Observations:                  353
Model:             SARIMAX(1, 0, 2)x(3, 2, 1, 7)   Log Likelihood               -1756.325
Date:                           Mon, 19 Nov 2018   AIC                           3528.651
Time:                                   11:46:52   BIC                           3559.259
Sample:                                        0   HQIC                          3540.848
                                           - 353                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.8196      0.118      6.963      0.000       0.589       1.050
ma.L1         -0.3354      0.124     -2.703      0.007      -0.579      -0.092
ma.L2         -0.2021      0.091     -2.221      0.026      -0.380      -0.024
ar.S.L7       -0.6465      0.039    -16.430      0.000      -0.724      -0.569
ar.S.L14      -0.4304      0.059     -7.332      0.000      -0.545      -0.315
ar.S.L21      -0.2661      0.043     -6.182      0.000      -0.350      -0.182
ma.S.L7       -0.9996      3.032     -0.330      0.742      -6.943       4.944
sigma2      1633.0342   4932.868      0.331      0.741   -8035.210    1.13e+04
===================================================================================
Ljung-Box (Q):                       52.85   Jarque-Bera (JB):               528.34
Prob(Q):                              0.08   Prob(JB):                         0.00
Heteroskedasticity (H):               0.74   Skew:                             0.95
Prob(H) (two-sided):                  0.12   Kurtosis:                         8.81
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [31]:
plt.subplot(211)
best_model.resid[13:].plot()
plt.ylabel(u'Residuals')

ax = plt.subplot(212)
sm.graphics.tsa.plot_acf(best_model.resid[13:].values.squeeze(), lags=48, ax=ax)

print("Student's test: p=%f" % stats.ttest_1samp(best_model.resid[13:], 0)[1])
print("Dickey-Fuller test: p=%f" % sm.tsa.stattools.adfuller(best_model.resid[13:])[1])
Student's test: p=0.114851
Dickey-Fuller test: p=0.000000
In [32]:
def invboxcox(y,lmbda):
    # reverse Box Cox transformation
    if lmbda == 0:
        return(np.exp(y))
    else:
        return(np.exp(np.log(lmbda * y + 1) / lmbda))
In [33]:
train_df['arima_model'] = invboxcox(best_model.fittedvalues, lmbda)

train_df.y.tail(200).plot()
train_df.arima_model[13:].tail(200).plot(color='r')
plt.ylabel('wiki pageviews');