This tutorial we add the ARIMA model to our analysis of the Australian inflation data from a previous tutorial.
Data: Australian CPI inflation
ARIMA model identification
Estimation
Model diagnostics
Model validation
Forecast
This notebook relies on the following imports and settings.
# Packages
import warnings
warnings.filterwarnings("ignore")
import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
# Plot settings
sns.set_context('notebook')
sns.set_style('ticks')
red='#D62728'
blue='#1F77B4'
%matplotlib inline
data=pd.read_csv('inflation.csv', index_col='Date', parse_dates=True, dayfirst=True)
data.index=data.index.to_period(freq='Q') # converting the index to quarterly period instead of dates
data=data['01-1980':] # filtering the use data from Jan/1980 onwards
data.tail()
Inflation | |
---|---|
Date | |
2016Q2 | 0.4 |
2016Q3 | 0.7 |
2016Q4 | 0.5 |
2017Q1 | 0.5 |
2017Q2 | 0.2 |
y=data['Inflation']
fig, ax= plt.subplots(figsize=(8,5))
y.plot(color=red)
ax.set_xlabel('')
ax.set_ylabel('Inflation')
ax.set_title('Australian Quarterly CPI Inflation')
ax.set_xticks([], minor=True) # I prefer to remove the minor ticks for a cleaner plot
sns.despine()
plt.show()
The next cell plots the sample autocorrelations (ACF) and partial autocorrelations (PACF) for the entire series. The autocorrelations decay slowly, reflecting the non-stationarity of the series.
fig, ax = plt.subplots(1,2, figsize=(12,5))
sm.graphics.tsa.plot_acf(y, lags=40, ax=ax[0])
sm.graphics.tsa.plot_pacf(y, lags=40, ax=ax[1])
sns.despine()
fig.tight_layout()
plt.show()
We then consider a first difference transform, which seems to lead to a stationary series for this data. The pandas shift method is the tool for obtaining lagged values of a column.
The ACF plot for the first difference series has a cut-off after lag one, while the PACF shows a gradual decrease, suggesting a MA(1) for the data.
diff=y-y.shift(1)
diff=diff.dropna()
fig, ax= plt.subplots(figsize=(8,5))
diff.plot(color=red)
ax.set_xlabel('')
ax.set_title('First differenced series')
ax.set_xticks([], minor=True) # I prefer to remove the minor ticks for a cleaner plot
sns.despine()
plt.show()
fig, ax = plt.subplots(1,2, figsize=(12,5))
sm.graphics.tsa.plot_acf(diff, lags=40, ax=ax[0])
sm.graphics.tsa.plot_pacf(diff, lags=40, ax=ax[1])
sns.despine()
fig.tight_layout()
plt.show()
In the EDA, we saw that the series behaves differently in the post-80s period, so that we can also consider focusing on this part of the dataset only. The need for first differencing is less clear in this case, so that we will consider specifications with and without the transformation
fig, ax = plt.subplots(1,2, figsize=(12,5))
sm.graphics.tsa.plot_acf(y['1991':], lags=40, ax=ax[0])
sm.graphics.tsa.plot_pacf(y['1991':], lags=40, ax=ax[1])
sns.despine()
fig.tight_layout()
plt.show()
With the first difference, the ACF and PACF are again consistent with an MA(1) process.
fig, ax = plt.subplots(1,2, figsize=(12,5))
sm.graphics.tsa.plot_acf(diff['1991':], lags=40, ax=ax[0])
sm.graphics.tsa.plot_pacf(diff['1991':], lags=40, ax=ax[1])
sns.despine()
fig.tight_layout()
plt.show()
We now estimate the ARIMA(0,1,1) model for the full data using the statsmodels package.
arima1 = sm.tsa.ARIMA(y, order=(0, 1, 1)).fit(trend='nc') # trend='nc' option estimates a model without a drift
print(arima1.summary())
ARIMA Model Results ============================================================================== Dep. Variable: D.Inflation No. Observations: 149 Model: ARIMA(0, 1, 1) Log Likelihood -148.146 Method: css-mle S.D. of innovations 0.652 Date: Thu, 19 Oct 2017 AIC 300.293 Time: 08:10:28 BIC 306.301 Sample: 06-30-1980 HQIC 302.734 - 06-30-2017 ===================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------- ma.L1.D.Inflation -0.7091 0.072 -9.858 0.000 -0.850 -0.568 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- MA.1 1.4102 +0.0000j 1.4102 0.0000 -----------------------------------------------------------------------------
Restricting attention to the data since 1991, we consider two specifications: ARIMA(1,0,1) and ARIMA(0,1,1). Note that for the ARIMA(1,0,1) estimation, I remove the first observation to make it comparable to the ARIMA(0,1,1) output (which loses the first observation due to first differencing).
The AIC and BIC criteria select the ARIMA(0,1,1) specification. Using only the 1991-2016 period substantially changes the estimate of the MA(1) coefficient.
arima2 = sm.tsa.ARIMA(y['1991':].iloc[1:], order=(1, 0, 1)) .fit(trend='nc')
print(arima2.summary())
ARMA Model Results ============================================================================== Dep. Variable: Inflation No. Observations: 105 Model: ARMA(1, 1) Log Likelihood -84.753 Method: css-mle S.D. of innovations 0.538 Date: Thu, 19 Oct 2017 AIC 175.506 Time: 08:10:28 BIC 183.468 Sample: 06-30-1991 HQIC 178.732 - 06-30-2017 =================================================================================== coef std err z P>|z| [0.025 0.975] ----------------------------------------------------------------------------------- ar.L1.Inflation 0.9963 0.009 112.880 0.000 0.979 1.014 ma.L1.Inflation -0.9170 0.087 -10.514 0.000 -1.088 -0.746 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 1.0037 +0.0000j 1.0037 0.0000 MA.1 1.0905 +0.0000j 1.0905 0.0000 -----------------------------------------------------------------------------
arima3 = sm.tsa.ARIMA(y['1991':], order=(0, 1, 1)).fit(trend='nc')
print(arima3.summary())
ARIMA Model Results ============================================================================== Dep. Variable: D.Inflation No. Observations: 105 Model: ARIMA(0, 1, 1) Log Likelihood -85.183 Method: css-mle S.D. of innovations 0.540 Date: Thu, 19 Oct 2017 AIC 174.367 Time: 08:10:28 BIC 179.674 Sample: 06-30-1991 HQIC 176.517 - 06-30-2017 ===================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------- ma.L1.D.Inflation -0.9190 0.080 -11.556 0.000 -1.075 -0.763 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- MA.1 1.0881 +0.0000j 1.0881 0.0000 -----------------------------------------------------------------------------
To obtain the residual series:
resid=arima3.resid
As diagnostics, we plot the residual series, the ACF and PACF plots, and the histogram of the residual data. We find no autocorrelation patterns in the residuals, confirming that the ARIMA(0,1,1) adequately captures the time series dependence in the series. The histogram shows that the residual series is non-Gaussian, so that we should not rely on this assumption for building forecast intervals.
fig, ax= plt.subplots(figsize=(8,5))
resid.plot(color=blue)
ax.set_xlabel('')
ax.set_xticks([], minor=True)
ax.set_title('Residual plot')
sns.despine()
plt.show()
fig, ax = plt.subplots(1,2, figsize=(12,5))
sm.graphics.tsa.plot_acf(resid, lags=40, ax=ax[0])
sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax[1])
plt.tight_layout()
plt.show()
from forecast import histogram, qq_plot
histogram(resid)
plt.show()
qq_plot(resid)
plt.show()
We implement a real time forecasting exercise to select the best model for the final forecast.
import forecast
# Real time forecasting
validation=y['2004Q1':].index # the validation period is Q1 2004 onwards
start = y.index.get_loc('2004Q1') # numerical index corresponding to Q1 2004
recent = y.index.get_loc('2000Q1')
results=pd.DataFrame(0.0, index=validation, columns=['RW', 'SES', 'Trend corrected',
'ARIMA', 'ARIMA (1991-)','Actual'])
results['Actual'] = y.iloc[start:]
for i in range(start, len(y)):
j=i-start
# random walk forecast
results.iloc[j,0]=y.iloc[i-1]
# simple exponential smoothing
model = forecast.ses(y.iloc[:i])
model.fit()
results.iloc[j,1]= model.forecast(1)[0]
# trend corrrected
model = forecast.holt(y.iloc[:i])
model.fit()
results.iloc[j,2]=model.forecast(1)[0]
# ARIMA
model = sm.tsa.ARIMA(y.iloc[:i], order=(0, 1, 1)).fit()
results.iloc[j,3]=model.forecast()[0][0]
# ARIMA (recent sample)
model = sm.tsa.ARIMA(y.iloc[recent:i], order=(0, 1, 1)).fit()
results.iloc[j,4]= model.forecast()[0][0]
from statlearning import rmse_jack
table = pd.DataFrame(0.0, index=results.columns[:-1], columns=['RMSE','SE'])
for i in range(5):
table.iloc[i,0], table.iloc[i,1] = rmse_jack(results.iloc[:,i], results.iloc[:,-1])
table.round(3)
RMSE | SE | |
---|---|---|
RW | 0.538 | 0.058 |
SES | 0.499 | 0.055 |
Trend corrected | 0.499 | 0.055 |
ARIMA | 0.467 | 0.055 |
ARIMA (1991-) | 0.467 | 0.046 |
We compute the final forecast using the ARIMA(0,1,1) model estimated on the full sample.
h=8
pred, stderr, intv1 = arima1.forecast(steps=h, alpha=0.2)
pred, stderr, intv2 = arima1.forecast(steps=h, alpha=0.1)
pred, stderr, intv3 = arima1.forecast(steps=h, alpha=0.01)
test=pd.period_range(start=y.index[-1]+1, periods=h, freq='Q')
pred=pd.Series(pred, index=test)
intv1=pd.DataFrame(intv1, index=test)
intv2=pd.DataFrame(intv2, index=test)
intv3=pd.DataFrame(intv3, index=test)
fig, ax = forecast.fanchart(y['2012':], pred, intv1, intv2, intv3)
ax.set_xlabel('')
ax.set_xticks([], minor=True)
ax.set_ylabel('Quartely inflation')
sns.despine()
plt.title('ARIMA(0,1,1) CPI inflation forecast')
plt.show()