はじめに¶

このnotebookはこちらの記事で使用したものです。

In [1]:

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import itertools

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_profiling as pdp
import statsmodels.api as sm
from plotly import tools
from plotly.graph_objs import Bar, Figure, Layout, Scatter
from plotly.offline import init_notebook_mode, iplot
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.arima_model import ARIMA, ARMA
from statsmodels.tsa.statespace.sarimax import SARIMAX

init_notebook_mode(connected=True)
save_image = None # 'png' if save image

データ準備¶

dataはfloatにしておく
- TypeError: Cannot cast ufunc subtract output from dtype('float64') to dtype('int64') with casting rule 'same_kind'
indexはdate_rangeで作ったほうが良い
- 重複があると（抜け漏れも？）モデリングのときエラーが発生する
- ValueError: There is no frequency for these dates and date 2018-04-01 00:00:00 is not in dates index. Try giving a date that is in the dates index or use an integer

In [2]:

data_file = '../data/processed/data.csv'
df = pd.read_csv(data_file)

df.date = pd.to_datetime(df.date)
df = df.set_index('date')
date_index = pd.date_range('2017-10-01', end='2018-04-30', freq='D')
df = df.reindex(date_index)

df.pv = df.pv.astype(np.float64)
df['log_pv'] = np.log(df.pv)

In [3]:

print(df.pv.head())

2017-10-01     2.0
2017-10-02     9.0
2017-10-03     3.0
2017-10-04    21.0
2017-10-05    13.0
Freq: D, Name: pv, dtype: float64

In [4]:

df.head()

Out[4]:

	pv	entry	pv_per_entry	log_pv
2017-10-01	2.0	8	0.250	0.693147
2017-10-02	9.0	8	1.125	2.197225
2017-10-03	3.0	8	0.375	1.098612
2017-10-04	21.0	8	2.625	3.044522
2017-10-05	13.0	8	1.625	2.564949

In [5]:

df.dtypes

Out[5]:

pv              float64
entry             int64
pv_per_entry    float64
log_pv          float64
dtype: object

In [6]:

df_train = df[df.index < '2018-04-01']
df_test = df[df.index >= '2018-04-01']

データ確認¶

レポート¶

In [7]:

pdp.ProfileReport(df)

Out[7]:

Overview

Dataset info

Number of variables	5
Number of observations	212
Total Missing (%)	0.0%
Total size in memory	8.4 KiB
Average record size in memory	40.4 B

Variables types

Numeric	3
Categorical	0
Boolean	0
Date	1
Text (Unique)	0
Rejected	1
Unsupported	0

Warnings

pv_per_entry is highly correlated with pv (ρ = 0.98508) Rejected

Variables

entry
Numeric

Distinct count	29
Unique (%)	13.7%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	25.599
Minimum	8
Maximum	36
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	8
5-th percentile	10
Q1	17
Median	29.5
Q3	33
95-th percentile	35
Maximum	36
Range	28
Interquartile range	16

Descriptive statistics

Standard deviation	8.963
Coef of variation	0.35013
Kurtosis	-1.1647
Mean	25.599
MAD	8.0162
Skewness	-0.56399
Sum	5427
Variance	80.336
Memory size	1.7 KiB

Value	Count	Frequency (%)
33	40	18.9%
35	24	11.3%
32	10	4.7%
31	9	4.2%
36	8	3.8%
34	8	3.8%
11	8	3.8%
25	8	3.8%
13	7	3.3%
16	7	3.3%
Other values (19)	83	39.2%

Minimum 5 values

Value	Count	Frequency (%)
8	6	2.8%
9	4	1.9%
10	3	1.4%
11	8	3.8%
12	6	2.8%

Maximum 5 values

Value	Count	Frequency (%)
32	10	4.7%
33	40	18.9%
34	8	3.8%
35	24	11.3%
36	8	3.8%

index
Date

Distinct count	212
Unique (%)	100.0%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Minimum	2017-10-01 00:00:00
Maximum	2018-04-30 00:00:00

Toggle details

log_pv
Numeric

Distinct count	164
Unique (%)	77.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	4.5803
Minimum	0
Maximum	6.9412
Zeros (%)	0.9%

Toggle details

Quantile statistics

Minimum	0
5-th percentile	2.4849
Q1	3.801
Median	4.9452
Q3	5.5225
95-th percentile	6.0012
Maximum	6.9412
Range	6.9412
Interquartile range	1.7214

Descriptive statistics

Standard deviation	1.2433
Coef of variation	0.27144
Kurtosis	1.2054
Mean	4.5803
MAD	0.99438
Skewness	-1.0761
Sum	971.02
Variance	1.5457
Memory size	1.7 KiB

Value	Count	Frequency (%)
2.5649493574615367	4	1.9%
4.61512051684126	3	1.4%
2.4849066497880004	3	1.4%
3.1354942159291497	3	1.4%
3.4339872044851463	3	1.4%
3.5263605246161616	3	1.4%
5.869296913133774	3	1.4%
0.0	2	0.9%
5.135798437050262	2	0.9%
4.48863636973214	2	0.9%
Other values (154)	184	86.8%

Minimum 5 values

Value	Count	Frequency (%)
0.0	2	0.9%
0.6931471805599453	1	0.5%
1.0986122886681098	2	0.9%
1.3862943611198906	1	0.5%
2.1972245773362196	2	0.9%

Maximum 5 values

Value	Count	Frequency (%)
6.100318952020064	1	0.5%
6.124683390894205	1	0.5%
6.272877006546167	1	0.5%
6.428105272684596	1	0.5%
6.9411900550683745	1	0.5%

pv
Numeric

Distinct count	164
Unique (%)	77.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	165
Minimum	1
Maximum	1034
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	1
5-th percentile	12
Q1	44.75
Median	140.5
Q3	250.25
95-th percentile	403.9
Maximum	1034
Range	1033
Interquartile range	205.5

Descriptive statistics

Standard deviation	143.5
Coef of variation	0.86966
Kurtosis	5.4718
Mean	165
MAD	111.47
Skewness	1.5815
Sum	34981
Variance	20592
Memory size	1.7 KiB

Value	Count	Frequency (%)
13.0	4	1.9%
23.0	3	1.4%
354.0	3	1.4%
101.0	3	1.4%
12.0	3	1.4%
31.0	3	1.4%
34.0	3	1.4%
43.0	2	0.9%
181.0	2	0.9%
67.0	2	0.9%
Other values (154)	184	86.8%

Minimum 5 values

Value	Count	Frequency (%)
1.0	2	0.9%
2.0	1	0.5%
3.0	2	0.9%
4.0	1	0.5%
9.0	2	0.9%

Maximum 5 values

Value	Count	Frequency (%)
446.0	1	0.5%
457.0	1	0.5%
530.0	1	0.5%
619.0	1	0.5%
1034.0	1	0.5%

~~pv_per_entry~~
Highly correlated

This variable is highly correlated with pv and should be ignored for analysis

Correlation	0.98508

Correlations

Sample

	pv	entry	pv_per_entry	log_pv
2017-10-01	2.0	8	0.250	0.693147
2017-10-02	9.0	8	1.125	2.197225
2017-10-03	3.0	8	0.375	1.098612
2017-10-04	21.0	8	2.625	3.044522
2017-10-05	13.0	8	1.625	2.564949

グラフ¶

In [8]:

data = [Scatter(x=df.index, y=df.pv, name='pv')]
fig = Figure(data=data, layout=Layout(title='PV数'))
iplot(fig, image=save_image, filename='PV数')

In [9]:

fig = tools.make_subplots(rows=3, cols=1)
fig.append_trace(Scatter(x=df.index, y=df.pv, name='pv'), 1, 1)
fig.append_trace(Scatter(x=df.index, y=df.log_pv, name='pv(log)'), 2, 1)
fig.append_trace(Scatter(x=df.index, y=df.pv.diff(), name='pv(diff)'), 3, 1)
fig['layout'].update(title='PV数の推移')
iplot(fig, image=save_image, filename='PV数の推移')

This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]
[ (3,1) x3,y3 ]

In [10]:

df.pv.index[:-1]

Out[10]:

DatetimeIndex(['2017-10-01', '2017-10-02', '2017-10-03', '2017-10-04',
               '2017-10-05', '2017-10-06', '2017-10-07', '2017-10-08',
               '2017-10-09', '2017-10-10',
               ...
               '2018-04-20', '2018-04-21', '2018-04-22', '2018-04-23',
               '2018-04-24', '2018-04-25', '2018-04-26', '2018-04-27',
               '2018-04-28', '2018-04-29'],
              dtype='datetime64[ns]', length=211, freq='D')

In [11]:

data = [Scatter(x=df.pv[:-1], y=df.pv[1:], mode='markers')]
corr = np.corrcoef(df.pv[:-1], df.pv[1:])[0][1]
fig = Figure(data=data, layout=Layout(title='ラグ1の散布図(corr={:.2})'.format(corr)))
iplot(fig, image=save_image, filename='ラグ1の散布図')

コレログラム¶

In [12]:

n = 40

fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)
sm.graphics.tsa.plot_acf(df.pv, lags=n, ax=ax1)
sm.graphics.tsa.plot_pacf(df.pv, lags=n, ax=ax2)
if save_image:
    fig.savefig('コレログラム1.png')

acf = sm.tsa.stattools.acf(df.pv, nlags=n)
pacf = sm.tsa.stattools.pacf(df.pv, nlags=n, method='ywm')

fig = tools.make_subplots(rows=2, cols=1)
fig.append_trace(Bar(y=acf, name='自己相関'), 1, 1)
fig.append_trace(Bar(y=pacf, name='偏自己相関'), 2, 1)
fig['layout'].update(
    title='コレログラム',
    shapes=[{
        'xref': 'x1',
        'yref': 'y1',
        'type': 'ract',
        'x0': 0,
        'y0': -1.96 / np.sqrt(len(df)),
        'x1': n,
        'y1': 1.96 / np.sqrt(len(df)),
        'fillcolor': '#808080',
        'opacity': 0.5,
        'line': {
            'width': 0
        }
    },{
        'xref': 'x2',
        'yref': 'y2',
        'type': 'ract',
        'x0': 0,
        'y0': -1.96 / np.sqrt(len(df)),
        'x1': n,
        'y1': 1.96 / np.sqrt(len(df)),
        'fillcolor': '#808080',
        'opacity': 0.5,
        'line': {
            'width': 0
        }
    }])
iplot(fig, image=save_image, filename='コレログラム2')

This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]

定常性の確認¶

adfullerを用いて、Dickey-Fuller検定を行う
pv: p-value > 0.1なので有意水準10%で帰無仮説（定常性を満たす）は棄却されず、定常ではない
log_pv: p-value < 0.1なので、有意水準10%で帰無仮説が棄却され、定常を満たす

In [13]:

res = sm.tsa.stattools.adfuller(df.pv)
print('p-value = {:.4}'.format(res[1]))

p-value = 0.746

In [14]:

res = sm.tsa.stattools.adfuller(df.log_pv)
print('p-value = {:.4}'.format(res[1]))

p-value = 0.02229

In [15]:

res = sm.tsa.stattools.adfuller(df.pv.diff().dropna())
print('p-value = {:.4}'.format(res[1]))

p-value = 2.166e-06

成分分解¶

seasonal_decomposeを用いて、移動平均による季節成分の分解を行う

In [16]:

res = sm.tsa.seasonal_decompose(df.pv, freq=7)
fig = tools.make_subplots(rows=4, cols=1)
fig.append_trace(Scatter(x=res.observed.index, y=res.observed, name='Original'), 1, 1)
fig.append_trace(Scatter(x=res.trend.index, y=res.trend, name='Trend'), 2, 1)
fig.append_trace(Scatter(x=res.seasonal.index, y=res.seasonal, name='Seasonal'), 3, 1)
fig.append_trace(Scatter(x=res.resid.index, y=res.resid, name='Resid'), 4, 1)
fig['layout'].update(title='成分分解')
iplot(fig, image=save_image, filename='成分分解')

This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]
[ (3,1) x3,y3 ]
[ (4,1) x4,y4 ]

In [17]:

res = sm.tsa.seasonal_decompose(df.log_pv, freq=7)
fig = tools.make_subplots(rows=4, cols=1)
fig.append_trace(Scatter(x=res.observed.index, y=res.observed, name='Original'), 1, 1)
fig.append_trace(Scatter(x=res.trend.index, y=res.trend, name='Trend'), 2, 1)
fig.append_trace(Scatter(x=res.seasonal.index, y=res.seasonal, name='Seasonal'), 3, 1)
fig.append_trace(Scatter(x=res.resid.index, y=res.resid, name='Resid'), 4, 1)
fig['layout'].update(title='成分分解(対数)')
iplot(fig, image=save_image, filename='成分分解(対数)')

This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]
[ (3,1) x3,y3 ]
[ (4,1) x4,y4 ]

差分系列+季節調整¶

In [18]:

data = [Scatter(x=df.index, y=df.pv.diff(), name='pv(diff)')]
fig = Figure(data=data, layout=Layout(title='差分データ'))
iplot(fig, image=save_image, filename='差分データ')

In [19]:

res = sm.tsa.seasonal_decompose(df.pv, freq=7)
s = df.pv - res.ｓeasonal
data = [Scatter(x=s.index, y=s, name='pv(ｓeasonal)')]
fig = Figure(data=data, layout=Layout(title='季節調整済みデータ'))
iplot(fig, image=save_image, filename='季節調整済みデータ')

In [20]:

res = sm.tsa.seasonal_decompose(df.pv, freq=7)
s_diff = (df.pv - res.seasonal).diff().dropna()
data = [Scatter(x=s_diff.index, y=s_diff, name='pv(ｓeasonal+diff)')]
fig = Figure(data=data, layout=Layout(title='差分＋季節調整済みデータ'))
iplot(fig, image=save_image, filename='差分＋季節調整済みデータ')

In [21]:

res = sm.tsa.stattools.adfuller(s_diff)
print('p-value = {:.4}'.format(res[1]))

p-value = 3.963e-10

In [22]:

n = 40

fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)
sm.graphics.tsa.plot_acf(s_diff, lags=n, ax=ax1)
sm.graphics.tsa.plot_pacf(s_diff, lags=n, ax=ax2)
if save_image:
    fig.savefig('コレログラム(s_diff).png')

PV / エントリ数¶

In [23]:

data = [
    Scatter(x=df.index, y=df.pv_per_entry, name='pv / entry'),
    Scatter(x=df.index, y=df.entry, name='entry')
]
fig = Figure(data=data, layout=Layout(title='1エントリあたりのPV数'))
iplot(fig, image=save_image, filename='1エントリあたりのPV数')

In [24]:

res = sm.tsa.seasonal_decompose(df.pv_per_entry, freq=7)
fig = tools.make_subplots(rows=4, cols=1)
fig.append_trace(Scatter(x=res.observed.index, y=res.observed, name='Original'), 1, 1)
fig.append_trace(Scatter(x=res.trend.index, y=res.trend, name='Trend'), 2, 1)
fig.append_trace(Scatter(x=res.seasonal.index, y=res.seasonal, name='Seasonal'), 3, 1)
fig.append_trace(Scatter(x=res.resid.index, y=res.resid, name='Resid'), 4, 1)
fig['layout'].update(title='成分分解(PV / エントリ数)')
iplot(fig, image=save_image, filename='成分分解(PV / エントリ数)')

This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]
[ (3,1) x3,y3 ]
[ (4,1) x4,y4 ]

予測¶

季節性をもつデータなので、SARIMAで予測を行う

SARIMA¶

In [25]:

ts = df.pv
ts_train = df_train.pv
ts_test = df_test.pv

In [26]:

def fit_model(ts_train, order, seasonal_order):
    model = SARIMAX(
        ts_train,
        order=order,
        seasonal_order=seasonal_order,
        enforce_stationarity=False,
        enforce_invertibility=False)
    result = model.fit()
    return result

def eval_model(ts_train, ts_test, result):
    train_pred = result.predict()
    test_pred = result.forecast(len(ts_test))
    test_pred_ci = result.get_forecast(len(ts_test)).conf_int()

    train_rmse = np.sqrt(mean_squared_error(ts_train, train_pred))
    test_rmse = np.sqrt(mean_squared_error(ts_test, test_pred))
    print('RMSE(train):\t{:.5}\nRMSE(test):\t{:.5}'.format(
        train_rmse, test_rmse))
    return train_pred, test_pred, test_pred_ci

def make_fig(ts, train_pred, test_pred, test_pred_ci, result, title='SARIMA'):
    data = [
        Scatter(x=ts.index, y=ts, name='Original', line=dict(dash='dot')),
        Scatter(x=train_pred.index, y=train_pred, name='Predict(train)'),
        Scatter(x=test_pred.index, y=test_pred, name='Predict(test)'),
        # Scatter(x=result.resid.index, y=result.resid, name='Resid')
    ]
    fig1 = Figure(data=data, layout=Layout(title=title))

    fig2, ax = plt.subplots()
    ts.plot(ax=ax, label='Original', linestyle="dashed")
    train_pred.plot(ax=ax, label='Predict(train)')
    test_pred.plot(ax=ax, label='Predict(test)')
    ax.fill_between(
        test_pred_ci.index,
        test_pred_ci.iloc[:, 0],
        test_pred_ci.iloc[:, 1],
        color='k',
        alpha=.2)
    ax.legend()
    ax.set_title(title)

    fig3, ax = plt.subplots()
    ts['2018-04-01':].plot(ax=ax, label='Original', linestyle="dashed")
    test_pred.plot(ax=ax, label='Predict(test)', color='g')
    ax.fill_between(
        test_pred_ci.index,
        test_pred_ci.iloc[:, 0],
        test_pred_ci.iloc[:, 1],
        color='k',
        alpha=.2)
    ax.legend()
    ax.set_title(title)

    return fig1, fig2, fig3

arma_order_select_ic¶

arma_order_select_icを用いて、パラメータを決める
AIC min: p=2, q=6

In [28]:

warnings.filterwarnings('ignore')

s_diff_train = s_diff[s_diff.index < '2018-04-01']
res = sm.tsa.arma_order_select_ic(s_diff_train, max_ar=5, max_ma=7, ic='aic')
res.aic_min_order

Out[28]:

(2, 6)

In [29]:

order=(2,1,6)
seasonal_order=(1,1,1,7)
result = fit_model(ts_train, order, seasonal_order)
train_pred, test_pred, test_pred_ci = eval_model(ts_train, ts_test, result)
result.summary()

RMSE(train):	58.447
RMSE(test):	150.03

Out[29]:

Statespace Model Results
Dep. Variable:	pv	No. Observations:	182
Model:	SARIMAX(2, 1, 6)x(1, 1, 1, 7)	Log Likelihood	-890.102
Date:	Sun, 06 May 2018	AIC	1802.205
Time:	21:06:17	BIC	1837.449
Sample:	10-01-2017	HQIC	1816.492
	- 03-31-2018
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
ar.L1	-0.6489	4.479	-0.145	0.885	-9.427	8.129
ar.L2	0.3797	4.560	0.083	0.934	-8.558	9.317
ma.L1	0.1962	13.725	0.014	0.989	-26.704	27.096
ma.L2	-0.9070	14.113	-0.064	0.949	-28.568	26.754
ma.L3	-0.1296	2.655	-0.049	0.961	-5.332	5.073
ma.L4	-0.0245	1.562	-0.016	0.987	-3.085	3.036
ma.L5	-0.0177	0.970	-0.018	0.985	-1.918	1.883
ma.L6	-0.0197	0.420	-0.047	0.963	-0.844	0.804
ar.S.L7	-0.2623	0.231	-1.135	0.256	-0.715	0.190
ma.S.L7	-0.6646	0.232	-2.868	0.004	-1.119	-0.210
sigma2	3837.1275	5.99e+04	0.064	0.949	-1.14e+05	1.21e+05

Ljung-Box (Q):	18.96	Jarque-Bera (JB):	24679.04
Prob(Q):	1.00	Prob(JB):	0.00
Heteroskedasticity (H):	18.05	Skew:	6.31
Prob(H) (two-sided):	0.00	Kurtosis:	62.52

In [30]:

title = 'SARIMA order={}, seasonal_order={}'.format(order, seasonal_order)
fig1, fig2, fig3 = make_fig(ts, train_pred, test_pred, test_pred_ci, result, title)
iplot(fig1, image=save_image, filename='SARIMA1-1')
if save_image:
    fig2.savefig('SARIMA1-2')
    fig3.savefig('SARIMA1-3')

In [31]:

n = 40
fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)
sm.graphics.tsa.plot_acf(result.resid, lags=n, ax=ax1)
sm.graphics.tsa.plot_pacf(result.resid, lags=n, ax=ax2)
if save_image:
    fig.savefig('残差1')

Grid Search¶

Grid Searchでパラメータ探索
- test-rmseも計算するが使わない（本来は未知であるため）

In [32]:

p_range = range(0, 6)
d_range = range(0, 4)
q_range = range(0, 8)
sp_range = range(0, 2)
sd_range = range(0, 2)
sq_range = range(0, 2)
history = pd.DataFrame()
for p, d, q, sp, sd, sq in itertools.product(p_range, d_range, q_range, sp_range, sd_range, sq_range):
    if p == 0 and q == 0:
        continue
    model = SARIMAX(
        ts_train,
        order=(p, d, q),
        seasonal_order=(sp, sd, sq, 7),
        enforce_stationarity=False,
        enforce_invertibility=False)
    result = model.fit()
    train_pred = result.predict()
    test_pred = result.forecast(len(df_test))
    history = history.append(
        pd.Series({
            'p': p,
            'd': d,
            'q': q,
            'sp': sp,
            'sd': sd,
            'sq': sq,
            'aic': result.aic,
            'bic': result.bic,
            'train-rmse': np.sqrt(mean_squared_error(ts_train, train_pred)),
            'test-rmse': np.sqrt(mean_squared_error(ts_test, test_pred))
        }),
        ignore_index=True)

In [33]:

history.sort_values(by='aic').head()

Out[33]:

	aic	bic	d	p	q	sd	sp	sq	test-rmse	train-rmse
607	1788.577241	1827.025322	1.0	2.0	7.0	1.0	1.0	1.0	213.233196	56.551452
859	1788.745876	1827.193957	1.0	3.0	7.0	1.0	0.0	1.0	113.702525	56.803243
1115	1789.378411	1831.030497	1.0	4.0	7.0	1.0	0.0	1.0	135.805266	56.125318
1119	1789.942875	1834.798969	1.0	4.0	7.0	1.0	1.0	1.0	157.814221	56.008527
111	1790.310009	1822.350076	1.0	0.0	7.0	1.0	1.0	1.0	156.144135	58.798796

In [34]:

history.sort_values(by='train-rmse').head()

Out[34]:

	aic	bic	d	p	q	sp	sq	test-rmse	train-rmse
1113	1859.642120	1901.294207	1.0	4.0	7.0	0.0	1.0	113.475712	53.923458
1369	1856.598062	1901.454155	1.0	5.0	7.0	0.0	1.0	112.341062	53.937692
1373	1858.913330	1906.973430	1.0	5.0	7.0	1.0	1.0	115.611538	53.990284
1361	1869.243645	1910.895732	1.0	5.0	6.0	0.0	1.0	112.119264	53.996537
1372	1886.285779	1931.141873	1.0	5.0	7.0	1.0	0.0	126.945872	54.306539

In [35]:

history.sort_values(by='test-rmse').head()

Out[35]:

	aic	bic	d	p	q	sd	test-rmse	train-rmse
226	2033.558278	2039.966291	0.0	1.0	0.0	1.0	59.381694	80.715324
234	2024.857090	2034.469110	0.0	1.0	1.0	1.0	59.520844	80.706696
482	2024.849634	2034.461654	0.0	2.0	0.0	1.0	59.561665	80.705050
554	2018.883701	2031.699728	1.0	2.0	1.0	1.0	61.536567	80.813581
306	2008.047838	2020.863864	1.0	1.0	2.0	1.0	62.031098	80.816687

In [36]:

order=(2,1,7)
seasonal_order=(1,1,1,7)
result = fit_model(ts_train, order, seasonal_order)
train_pred, test_pred, test_pred_ci = eval_model(ts_train, ts_test, result)
result.summary()

RMSE(train):	56.551
RMSE(test):	213.23

Out[36]:

Statespace Model Results
Dep. Variable:	pv	No. Observations:	182
Model:	SARIMAX(2, 1, 7)x(1, 1, 1, 7)	Log Likelihood	-882.289
Date:	Sun, 06 May 2018	AIC	1788.577
Time:	21:31:03	BIC	1827.025
Sample:	10-01-2017	HQIC	1804.164
	- 03-31-2018
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
ar.L1	-0.7341	0.602	-1.220	0.223	-1.913	0.445
ar.L2	-0.6721	0.353	-1.902	0.057	-1.365	0.021
ma.L1	0.3005	0.646	0.465	0.642	-0.965	1.566
ma.L2	0.1380	0.287	0.482	0.630	-0.424	0.699
ma.L3	-0.7480	0.381	-1.961	0.050	-1.496	-0.000
ma.L4	-0.2445	0.425	-0.575	0.565	-1.077	0.588
ma.L5	-0.1649	0.322	-0.512	0.609	-0.796	0.466
ma.L6	-0.1059	0.260	-0.408	0.684	-0.615	0.404
ma.L7	0.1592	0.357	0.446	0.656	-0.540	0.859
ar.S.L7	-0.6176	0.249	-2.484	0.013	-1.105	-0.130
ma.S.L7	-0.6251	0.271	-2.309	0.021	-1.156	-0.095
sigma2	4736.6494	822.241	5.761	0.000	3125.087	6348.212

Ljung-Box (Q):	17.50	Jarque-Bera (JB):	14827.61
Prob(Q):	1.00	Prob(JB):	0.00
Heteroskedasticity (H):	14.90	Skew:	5.27
Prob(H) (two-sided):	0.00	Kurtosis:	49.12

In [38]:

title = 'SARIMA order={}, seasonal_order={}'.format(order, seasonal_order)
fig1, fig2, fig3 = make_fig(ts, train_pred, test_pred, test_pred_ci, result, title)
iplot(fig1, image=save_image, filename='SARIMA2-1')
if save_image:
    fig2.savefig('SARIMA2-2')
    fig3.savefig('SARIMA2-3')

In [ ]:

Table of Contents

はじめに¶

データ準備¶

データ確認¶

レポート¶

Overview

Variables

Correlations

Sample

グラフ¶

コレログラム¶

定常性の確認¶

成分分解¶

差分系列+季節調整¶

PV / エントリ数¶

予測¶

SARIMA¶

arma_order_select_ic¶

Grid Search¶