Using Google Trends Index and High-frequency State-level Data to Predict National Unemployment Rates¶

A common issue with macro-economic data is that it is reported at low frequencies, generally quarterly: GDP and unemployment data are the most prominent examples of such macroeconomic indicators. However, policy makers, financial institutions and investors are often interested in having more frequent data on these indicators to be able to make more informed decisions in their respective fields.

In this exercise, I will focus on unemployment rates specifically. Previous literature has forecasted unemployment rates using either structural models (such as DSGE models), or non-structural ones which make use of some econometric tools (such as VAR models) to create predictive models. In a 2017 paper, CH17 use multiple neural network architectures to predict civilian unemployment rates in the US using solely the unemployment rate series. The neural networks that CH17 use are LSTM, Convolutional, Fully Connected and Encoder-Decoder.

Instead, I use a simpler architecture, Multi-Layer Perceptron, and expand their research by using the following data:

State-level monthly unemployment data from the Bureau of Labour Statistics. This data is retrieved using the Quandl API, and is expressed in percentages.
Google Trends Index for unemployment-related keywords. This data is expressed as a number from 1 to 100, where 100 represents the popularity of the keyword relative to other word searches within a period and a given location.

The Google trends method for predicting unemployment is inspired by the The Research Institute of the Finnish Economy, which uses unemployment-related keywords to make nowcasts of European unemployment rates (KU16).

To test whether the Google trends data truly has any predictive power, I first run a model with state-level unemployment data which predicts the US national unemployment rate. Then, I use a model which includes Google trends index results for every US state. Lastly, in model 3 I include present and past months of state unemployment and Google Index data to make predictions.

Findings:

Using Google Trends data reduces the mean squared error (MSE) at every forecast horizon (1-, 2-, and 3- month).
Adding past and present values of state unemployment and Google Trends data to the model considerably improves the prediction.

References¶

[CH17] Thomas R. Cook and Aaron Smalter Hall (2017) Macroeconomic Indicator Forecasting with Deep Neural Networks.

[KU16] Tuhkuri, Joonas (2016) ETLAnow: A Model for Forecasting with Big Data – Forecasting Unemployment with Google Searches in Europe.

Download and Clean Data¶

Unemployment Data¶

First, I use Quandl's API to download monthly unadjusted unemployment rate data for each US state since 2010. I choose January 2010 as my starting data because Quandl's access to Bureau of Labour Statistics data is restricted to 2010-present.

In [0]:

%%capture
!pip install quandl 
import quandl 

In [0]:

# Get US national monthly unadjusted unemployment rate data 
quandl.ApiConfig.api_key = '2BYtKNrxRuVRhzkA9_2s'
unrate_us = quandl.get('FRED/UNRATE', start_date='2010-01-01', end_date='2019-04-01')
unrate_us.columns = ["US"]
unrate_us.index = unrate_us.index.strftime("%Y-%m")

In [0]:

# Second, we get the monthly state unemployment data 
codes_states = ['01', '02', '04', '05', '06', '08', '09', '10', '11', '12',
              '13', '15', '16', '17', '18', '19', '20', '21', '22', '23',
              '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', 
              '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', 
              '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', 
               '56']
list_states = ("Alabama, Alaska, Arizona, Arkansas, California, "
       "Colorado, Connecticut, Delaware, District of Columbia, "
       "Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, "
       "Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, "
       "Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, "
       "Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, "
       "North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, "
       "South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, "
       "Washington, West Virginia, Wisconsin, Wyoming").split(", ")

api_codes = ['BLSE/LAUST{}0000000000003'.format(x) for x in codes_states]
quandl.ApiConfig.api_key = '2BYtKNrxRuVRhzkA9_2s'
unrate_states = quandl.get(api_codes, start_date='2004-01-01', end_date='2019-04-01')
unrate_states.columns = list_states
unrate_states.index = unrate_states.index.strftime("%Y-%m")

In [6]:

# Merge the two datasets
unrate = unrate_us.merge(unrate_states, left_index = True, right_index = True)
unrate.tail()

Out[6]:

	US	Alabama	Alaska	Arizona	Arkansas	California	Colorado	Connecticut	Delaware	District of Columbia	Florida	Georgia	Hawaii	Idaho	Illinois	Indiana	Iowa	Kansas	Kentucky	Louisiana	Maine	Maryland	Massachusetts	Michigan	Minnesota	Mississippi	Missouri	Montana	Nebraska	Nevada	New Hampshire	New Jersey	New Mexico	New York	North Carolina	North Dakota	Ohio	Oklahoma	Oregon	Pennsylvania	Rhode Island	South Carolina	South Dakota	Tennessee	Texas	Utah	Vermont	Virginia	Washington	West Virginia	Wisconsin	Wyoming
2018-11	3.7	3.3	6.4	4.7	3.2	3.9	3.5	3.1	3.0	5.1	3.3	3.4	2.7	2.7	4.0	3.3	1.9	3.0	3.5	4.3	3.3	3.4	2.6	3.4	2.3	4.2	2.5	3.6	2.3	4.2	2.2	3.3	4.8	3.5	3.6	2.1	4.2	2.9	4.0	3.6	3.9	3.0	2.8	3.0	3.5	2.8	2.4	2.5	4.4	4.6	2.5	4.0
2018-12	3.9	3.5	6.5	4.9	3.6	4.1	3.8	3.2	3.1	4.9	3.3	3.7	2.3	2.9	4.4	3.4	2.5	3.1	3.8	4.4	3.4	3.5	2.7	4.1	3.2	4.7	3.1	3.9	2.7	4.3	2.1	3.6	4.8	3.9	3.7	2.4	4.8	3.0	4.3	3.9	3.8	3.3	3.0	3.0	3.6	2.9	2.3	2.6	4.8	4.9	2.8	4.0
2019-01	4.0	4.4	7.4	5.7	4.5	4.8	4.2	4.7	3.7	6.0	3.9	4.5	2.9	4.1	5.2	4.1	3.3	3.8	4.5	5.2	4.3	4.3	3.6	4.6	4.3	5.2	3.8	5.0	3.1	4.6	2.9	4.6	5.5	4.6	4.5	3.0	5.4	3.8	5.0	4.2	4.9	3.6	3.3	3.7	4.2	3.2	2.8	3.2	5.3	6.0	3.5	4.5
2019-02	3.8	4.0	7.5	4.7	4.3	4.4	3.5	4.4	3.4	6.0	3.4	3.9	2.8	3.7	4.7	4.1	2.7	3.7	4.3	4.4	3.9	4.1	3.2	4.1	4.0	4.8	3.7	4.4	3.1	4.1	2.9	4.7	5.2	4.4	4.2	3.0	4.7	3.5	4.7	4.2	4.3	3.3	3.3	3.2	3.9	3.1	2.5	3.2	5.2	6.1	3.3	3.8
2019-03	3.8	3.8	7.2	4.6	3.4	4.6	3.0	4.2	3.2	5.6	3.3	3.7	2.8	3.4	4.5	3.9	2.8	3.7	4.4	3.9	3.8	3.9	3.1	4.4	4.1	4.8	3.9	4.1	3.2	3.7	2.8	4.1	4.6	4.1	4.1	3.0	4.1	3.2	4.6	3.9	3.7	3.3	3.1	3.3	3.5	3.0	2.3	3.0	5.2	5.3	3.3	3.8

Google Trends Data¶

In [0]:

# API library retrieved from: https://github.com/GeneralMills/pytrends
%%capture
!pip install pytrends # --upgrade #--user git+https://github.com/GeneralMills/pytrends
from pytrends.request import TrendReq
pytrends = TrendReq(hl='en-US', tz=360)

In [0]:

# I use the top 5 most popular terms (extracted from the list of 13 most
# searched Google phrases as shows in Tuhkuri(2016) )

keywords = ('unemployment benefits, unemployment office, '
            'unemployment compensation, unemployment insurance, '
            'unemployment eligibility').split(', ')

pytrends.build_payload(kw_list = keywords, geo = 'US', timeframe = '2010-01-01 2019-01-31')
us = pytrends.interest_over_time()
us_mean = us.mean(axis = 1)
us_mean = us_mean.to_frame(name = 'US')

In [9]:

import matplotlib.pyplot as plt
plt.style.use('ggplot')

fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (10,5), sharey = True)
us.plot(ax = ax[0])
us_mean.plot(ax = ax[1], label = 'Mean Index', color = 'cornflowerblue')
ax[0].legend(['Unemployment Benefits', 'Unemployment Office', 'Unemployment Compensation',
              'Unemployment Insurance', 'Unemployment Eligibility'])
ax[0].set_ylabel('Google Trends Index')
ax[0].set_xlabel('Year')
ax[1].legend(['Mean Index'])
ax[1].set_xlabel('Year')

plt.tight_layout()

In [0]:

# Now, I use the mean index for each US state, and merge the time series into one dataset

state_codes = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
               "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
               "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
               "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
               "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
state_list = ['US-{}'.format(x) for x in state_codes]

google_index = us_mean 

for i,j in enumerate(state_list):
  pytrends.build_payload(kw_list = keywords, geo = state_list[i], timeframe = '2010-01-01 2019-01-31')
  state = pytrends.interest_over_time()
  state_mean = state.mean(axis = 1)
  state_mean = state_mean.to_frame(name = j)
  google_index = google_index.merge(state_mean, right_index = True, left_index = True)

google_index.columns = ["US"] + list_states 

In [11]:

# Merge the google index data to the unemployment data 
import pandas as pd
pd.options.display.max_columns = None

data = unrate.merge(google_index, left_index = True, right_index = True)
data.head()

Out[11]:

	US_x	Alabama_x	Alaska_x	Arizona_x	Arkansas_x	California_x	Colorado_x	Connecticut_x	Delaware_x	District of Columbia_x	Florida_x	Georgia_x	Hawaii_x	Idaho_x	Illinois_x	Indiana_x	Iowa_x	Kansas_x	Kentucky_x	Louisiana_x	Maine_x	Maryland_x	Massachusetts_x	Michigan_x	Minnesota_x	Mississippi_x	Missouri_x	Montana_x	Nebraska_x	Nevada_x	New Hampshire_x	New Jersey_x	New Mexico_x	New York_x	North Carolina_x	North Dakota_x	Ohio_x	Oklahoma_x	Oregon_x	Pennsylvania_x	Rhode Island_x	South Carolina_x	South Dakota_x	Tennessee_x	Texas_x	Utah_x	Vermont_x	Virginia_x	Washington_x	West Virginia_x	Wisconsin_x	Wyoming_x	US_y	Alabama_y	Alaska_y	Arizona_y	Arkansas_y	California_y	Colorado_y	Connecticut_y	Delaware_y	District of Columbia_y	Florida_y	Georgia_y	Hawaii_y	Idaho_y	Illinois_y	Indiana_y	Iowa_y	Kansas_y	Kentucky_y	Louisiana_y	Maine_y	Maryland_y	Massachusetts_y	Michigan_y	Minnesota_y	Mississippi_y	Missouri_y	Montana_y	Nebraska_y	Nevada_y	New Hampshire_y	New Jersey_y	New Mexico_y	New York_y	North Carolina_y	North Dakota_y	Ohio_y	Oklahoma_y	Oregon_y	Pennsylvania_y	Rhode Island_y	South Carolina_y	South Dakota_y	Tennessee_y	Texas_y	Utah_y	Vermont_y	Virginia_y	Washington_y	West Virginia_y	Wisconsin_y	Wyoming_y
2010-01-01	9.8	12.3	8.9	11.1	9.0	12.7	8.9	9.5	9.3	10.2	11.5	10.9	7.2	10.5	12.2	11.8	7.3	7.7	11.5	8.2	9.2	8.1	9.6	14.4	8.5	11.7	10.3	8.4	5.4	13.5	7.0	10.3	8.1	9.6	12.0	4.8	12.0	7.6	11.9	9.5	12.1	12.4	5.9	11.2	8.6	8.4	7.0	7.9	11.3	9.4	10.1	8.1	24.333333	30.833333	29.500000	17.666667	17.666667	22.333333	23.166667	26.166667	26.500000	17.000000	16.166667	17.500000	22.500000	18.500000	22.666667	18.166667	22.000000	29.666667	25.166667	33.666667	15.666667	24.833333	21.000000	20.166667	27.666667	11.500000	25.000000	33.333333	18.000000	24.000000	21.166667	17.833333	32.166667	20.166667	24.500000	18.166667	22.333333	28.333333	38.166667	27.333333	27.000000	29.333333	31.500000	20.833333	16.833333	25.666667	15.000000	27.000000	26.166667	35.500000	44.333333	25.666667
2010-02-01	9.8	12.1	9.1	10.7	9.0	12.6	8.9	9.5	9.4	9.9	11.2	10.9	6.9	10.4	11.9	11.8	7.1	7.5	11.9	7.2	9.5	8.3	9.2	14.2	8.4	11.3	10.5	8.4	5.3	13.5	6.9	10.3	8.1	9.5	12.0	4.8	11.9	7.6	11.9	9.5	11.9	12.1	5.9	10.8	8.4	8.3	6.9	7.9	11.2	10.0	10.4	7.8	21.833333	26.666667	22.666667	13.166667	12.333333	20.166667	25.166667	22.333333	23.833333	16.666667	15.333333	16.500000	31.166667	17.166667	21.500000	14.000000	28.500000	18.833333	23.166667	30.666667	20.500000	19.666667	16.666667	19.333333	23.500000	12.666667	20.500000	22.666667	18.666667	17.166667	18.166667	17.000000	24.666667	19.500000	24.333333	25.500000	20.666667	26.333333	26.666667	25.166667	15.500000	22.833333	27.500000	17.666667	15.333333	21.833333	18.666667	19.000000	21.500000	26.166667	37.833333	30.166667
2010-03-01	9.9	11.5	8.9	10.4	8.3	12.6	9.1	9.2	9.0	9.5	11.0	10.4	6.9	10.0	11.5	11.6	6.9	7.5	11.0	7.3	9.2	8.1	8.9	14.1	8.6	10.8	10.2	8.4	5.2	13.6	6.7	10.1	8.2	9.1	11.5	4.8	11.4	7.0	11.9	9.2	11.7	11.5	5.9	10.6	8.3	8.4	7.0	7.7	10.9	9.3	10.3	7.6	24.166667	25.166667	14.833333	16.333333	12.500000	20.833333	20.000000	23.833333	25.000000	18.000000	16.666667	18.333333	23.333333	15.500000	25.500000	16.666667	22.500000	26.166667	25.833333	26.166667	17.500000	26.166667	24.166667	23.333333	21.333333	11.833333	19.333333	19.333333	22.833333	14.333333	24.166667	19.500000	14.000000	24.500000	28.000000	15.166667	22.500000	31.500000	27.333333	29.000000	17.666667	24.166667	20.666667	22.833333	17.166667	19.166667	18.833333	21.666667	23.833333	21.333333	30.500000	29.333333
2010-04-01	9.9	10.4	8.3	10.1	7.8	12.0	8.5	8.9	8.5	8.7	10.6	9.9	6.7	9.4	10.5	10.6	5.9	6.9	10.0	7.1	8.5	7.3	8.3	12.9	7.5	9.8	9.3	7.5	4.6	13.4	5.9	9.4	7.9	8.5	10.9	3.8	10.4	6.5	11.0	8.3	11.2	10.9	4.9	9.6	7.9	7.8	6.6	7.0	10.0	8.4	9.0	7.0	21.333333	27.333333	18.666667	13.833333	12.000000	20.000000	20.000000	18.833333	23.000000	11.666667	15.333333	20.000000	20.333333	15.333333	22.166667	13.666667	19.166667	17.833333	22.000000	27.000000	19.333333	24.166667	17.500000	18.666667	16.333333	10.666667	19.500000	28.000000	19.666667	15.166667	17.500000	17.166667	28.833333	19.666667	21.833333	19.166667	20.166667	22.000000	21.333333	23.500000	22.666667	27.000000	19.333333	19.666667	18.333333	20.333333	14.666667	20.833333	19.166667	24.333333	34.333333	15.000000
2010-05-01	9.6	9.9	7.8	10.0	8.0	11.7	8.5	8.9	8.1	8.9	10.5	10.1	6.7	8.3	10.1	10.3	5.6	6.8	10.0	7.6	8.1	7.4	8.2	12.6	7.0	10.4	9.1	6.8	4.5	13.3	5.6	9.2	8.1	8.2	10.9	3.4	9.9	6.8	10.4	8.6	11.2	10.7	4.8	9.3	7.9	7.8	5.9	7.1	9.9	8.3	8.5	6.5	18.166667	26.500000	18.000000	12.833333	14.666667	16.666667	18.000000	19.166667	18.500000	8.333333	15.333333	14.333333	18.166667	9.666667	17.833333	13.000000	17.000000	14.333333	19.666667	26.833333	11.666667	20.166667	13.333333	14.833333	13.666667	12.166667	17.000000	18.000000	11.000000	17.833333	20.666667	13.666667	19.000000	15.500000	20.666667	9.666667	15.833333	27.500000	21.333333	21.000000	9.333333	25.500000	35.166667	17.000000	15.166667	10.666667	12.000000	15.666667	19.833333	19.000000	28.000000	20.500000

It would be interesting to see how correlated Google Index results are correlated with the unemployment rates. If the correlation is positive and strong, there is a good reason to use it as an additional explanatory variable in our model. However, if the index is fully correlated with unemployment, it will not add any more information.

In [12]:

# Graphically

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
plt.style.use('ggplot')

ax2 = ax.twinx()


data.plot(kind = 'line', y = 'US_x', ax = ax, color = 'red', label='US Unemployment Rate (Left)')
data.plot(kind = 'line', y = 'US_y', ax = ax2, color = 'cornflowerblue', label='Google Trends Index (Right)')

ax.set_ylabel('Unemployment Rate (%)')
ax2.set_ylabel('Google Trends Index (out of 100)')

ax2.grid(None)

'''Legend'''
lines, labels = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.get_legend().remove()
ax.get_legend().remove()
plt.legend(lines + lines2, labels + labels2, loc = 'right')

Out[12]:

<matplotlib.legend.Legend at 0x7f1c4867b898>

In [13]:

# Numerically

corr_vector = []
for i in list_states:
  corr = data['{}_x'.format(i)].corr(data['{}_y'.format(i)])
  corr_vector.append(corr)

corr = pd.Series(corr_vector).to_frame(name = 'Correlation')
c = corr.mean(axis = 0)
print('Mean Correlation between Google Trends Index and State-level Unemployment Rates is: \n', c) 

Mean Correlation between Google Trends Index and State-level Unemployment Rates is: 
 Correlation    0.762996
dtype: float64

Models¶

The outcome variable of our models will always be US national unemployment rate. All models use the Multi-Layer Perceptron network.

Model 1 uses current values of US state unemployment to forecast US national unemployment in the next months
Model 2 uses current values of US state unemployment and Google Index values to forecast US national unemployment in the next months
Model 3 uses current and past values of US state unemployment rate and Google Index values to forecast US national unemployment in the next months

Choice of parameters:

Number of layers: I chose 2 layers (1 hidden) because this simple neural network architecture can be a universal approximator for any given function in $R^n$.
Number of inputs: The number of inputs per layer is not very important as long as it is large enough. I tested numbers above 50 and the results are unchanged. Only when choosing values closer to 10 or 20 is there any clear deterioration in the MSE results.
Alpha penalization parameter: Initially, this parameter was set at 0.001, which lead to very low MSE on the train data (over-fitting). Every time I multiplied it by a factor of 10, the MSE train improved and it stopped improving for alpha's greater than 1. An interesting outcome of increasing the alpha is that the MSE of Models 1 and 2 started converging.
Activation function: I chose the logistic function as it is part of the standard sigmoid function family.
Max Iterations: Initially, I chose 100 iterations, then varied the number up to 10,000. There is no significant improvement after 1,000 iterations.
Solver: The LBFGS algorithm works better for small datasets.

Model 1¶

In [0]:

import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn import neural_network
from sklearn import metrics

In [0]:

def model1(lag):
  
  df = data.copy() 
  
  '''Forecast Horizon''' 
  # First, choose the forecast period (1, 2, or 3 months ahead), and lag the US  
  # national unemployment rate by that period

  df['US_x'] = df['US_x'].shift(-lag)
   
  '''Scaling the data'''
  # Notice that I apply the Scaler to the Train and Test data separately 
  scaler = StandardScaler()

  '''Prepare the train data'''
  split_date = pd.Timestamp('2017-01-01')
  train = df.loc[ :split_date] 
  train_clean = train.drop(['US_y']+['{}_y'.format(x) for x in list_states], axis = 1).copy()
  train_sc = scaler.fit_transform(train_clean)
  train_sc = pd.DataFrame(train_sc)
  train_sc['date'] = train.index
  train_sc.set_index('date', inplace = True)
  
  X_train = train_sc.drop([0], axis = 1).copy() 
  y_train = train_sc.drop(list(range(1,52)), axis = 1).copy()

  '''Prepare the test data'''
  test = df.loc[split_date: ]
  test_clean = test.drop(['US_y']+['{}_y'.format(x) for x in list_states], axis = 1).copy()
  test_sc = scaler.fit_transform(test_clean)
  test_sc = pd.DataFrame(test_sc)
  test_sc['date'] = test.index
  test_sc.set_index('date', inplace = True)
  test_sc = test_sc[:-lag]

  X_test = test_sc.drop([0], axis = 1).copy()
  y_test = test_sc.drop(list(range(1,52)), axis = 1).copy()

  '''Parameters for Neural Network'''
  _alpha = 1
  _activation = "logistic"
  _max_iter = 10000
  _solver = 'lbfgs'

  '''Fitting the MLP model'''
  nn_model = neural_network.MLPRegressor((50, 50), activation = _activation, 
                                         solver = _solver, alpha = _alpha,  max_iter = _max_iter)
  
  return (nn_model, X_train, y_train, X_test, y_test)

In [0]:

def model1_mse(lag):
  
  nn_model, X_train, y_train, X_test, y_test = model1(lag)
  nn_model.fit(X_train, y_train)
  
  '''View MSE for Train and Test data'''
  mse_train = metrics.mean_squared_error(y_train, nn_model.predict(X_train))
  mse_test = metrics.mean_squared_error(y_test, nn_model.predict(X_test))
  
  return [mse_train, mse_test]

def model1_plt(lag):
  
  nn_model, X_train, y_train, X_test, y_test = model1(lag)
  nn_model.fit(X_train, y_train)
  
  '''Plot the True and Predicted values in the Test data'''
  nn_y_pred_test = nn_model.predict(X_test)
  nn_y_pred_test = pd.DataFrame(nn_y_pred_test)
   
  nn_y_pred_test['date'] = y_test.index
  nn_y_pred_test.set_index('date', inplace = True)
 
  plt.plot(y_test.shift(periods = lag), label = 'True', color = 'red')
  plt.plot(nn_y_pred_test.shift(periods = lag), label = '{}-Month Forecast'.format(lag), 
          color = 'mediumturquoise')

Model 2¶

In [0]:

def model2(lag):
  
  df = data.copy() 
  
  '''Forecast Horizon''' 
  # First, choose the forecast period (1, 2, or 3 months ahead), and lag the US  
  # national unemployment rate by that period

  df['US_x'] = df['US_x'].shift(-lag)
   
  '''Scaling the data'''
  # Notice that I apply the Scaler to the Train and Test data separately 
  scaler = StandardScaler()

  '''Prepare the train data'''
  split_date = pd.Timestamp('2017-01-01')
  train = df.loc[ :split_date] 
  train_clean = train.drop(['US_y'], axis = 1).copy()
  train_sc = scaler.fit_transform(train_clean)
  train_sc = pd.DataFrame(train_sc)
  train_sc['date'] = train.index
  train_sc.set_index('date', inplace = True)

  X_train = train_sc.drop([0], axis = 1).copy() 
  y_train = train_sc.drop(list(range(1,103)), axis = 1).copy()

  '''Prepare the test data'''
  test = df.loc[split_date: ]
  test_clean = test.drop(['US_y'], axis = 1).copy()
  test_sc = scaler.fit_transform(test_clean)
  test_sc = pd.DataFrame(test_sc)
  test_sc['date'] = test.index
  test_sc.set_index('date', inplace = True)
  test_sc = test_sc[:-lag]

  X_test = test_sc.drop([0], axis = 1).copy()
  y_test = test_sc.drop(list(range(1,103)), axis = 1).copy()

  '''Parameters for Neural Network'''
  _alpha = 1
  _activation = "logistic"
  _max_iter = 10000 
  _solver = 'lbfgs'

  '''Fitting the MLP model'''
  nn_model = neural_network.MLPRegressor((50, 50), activation = _activation, 
                                         solver = _solver, alpha = _alpha,  max_iter = _max_iter)
  
  return (nn_model, X_train, y_train, X_test, y_test)

In [0]:

def model2_mse(lag):
  
  nn_model, X_train, y_train, X_test, y_test = model2(lag)
  nn_model.fit(X_train, y_train)
  
  '''View MSE for Train and Test data'''
  mse_train = metrics.mean_squared_error(y_train, nn_model.predict(X_train))
  mse_test = metrics.mean_squared_error(y_test, nn_model.predict(X_test))
  
  return [mse_train, mse_test]

def model2_plt(lag):
  
  nn_model, X_train, y_train, X_test, y_test = model2(lag)
  nn_model.fit(X_train, y_train)
  
  '''Plot the True and Predicted values in the Test data'''
  nn_y_pred_test = nn_model.predict(X_test)
  nn_y_pred_test = pd.DataFrame(nn_y_pred_test)
   
  nn_y_pred_test['date'] = y_test.index
  nn_y_pred_test.set_index('date', inplace = True)
 
  plt.plot(nn_y_pred_test.shift(periods = lag), label = '{}-Month Forecast - Google Index'.format(lag), 
          color = 'cornflowerblue')

Model 3¶

In [0]:

def model3(lag):
  
  df = data.copy() 
  df = df.drop(['US_y'], axis = 1)

  '''Create data for t-1 and t-2 '''
  df1 = df.shift(1)
  name_col = df1.columns 
  name_col = [s + "_1" for s in name_col]
  df1.columns = name_col
  
  df2 = df.shift(2)
  name_col = df2.columns 
  name_col = [s + "_2" for s in name_col]
  df2.columns = name_col
  
  df = pd.concat([df, df1, df2], axis=1)
  
  '''Forecast Horizon''' 
  # First, choose the forecast period (1, 2, or 3 months ahead), and lag the US  
  # national unemployment rate by that period
  
  df['US_x'] = df['US_x'].shift(-lag)
  df.dropna(axis = 0, inplace = True)
  
  '''Scaling the data'''
  # Notice that I apply the Scaler to the Train and Test data separately 
  scaler = StandardScaler()
  
  '''Prepare the train data'''
  split_date = pd.Timestamp('2017-01-01')
  train = df.loc[ :split_date] 
  train_sc = scaler.fit_transform(train)
  train_sc = pd.DataFrame(train_sc)
  train_sc['date'] = train.index
  train_sc.set_index('date', inplace = True)

  X_train = train_sc.drop([0], axis = 1).copy() 
  y_train = train_sc.drop(list(range(1,309)), axis = 1).copy()

  '''Prepare the test data'''
  test = df.loc[split_date: ]
  test_sc = scaler.fit_transform(test)
  test_sc = pd.DataFrame(test_sc)
  test_sc['date'] = test.index
  test_sc.set_index('date', inplace = True)
  test_sc = test_sc[:-lag]

  X_test = test_sc.drop([0], axis = 1).copy()
  y_test = test_sc.drop(list(range(1,309)), axis = 1).copy()

  '''Parameters for Neural Network'''
  _alpha = 1
  _activation = "logistic"
  _max_iter = 10000 
  _solver = 'lbfgs'

  '''Fitting the MLP model'''
  nn_model = neural_network.MLPRegressor((50, 50), activation = _activation, 
                                         solver = _solver, alpha = _alpha,  max_iter = _max_iter)
  
  return (nn_model, X_train, y_train, X_test, y_test)

In [0]:

def model3_mse(lag):
  
  nn_model, X_train, y_train, X_test, y_test = model3(lag)
  nn_model.fit(X_train, y_train)
  
  '''View MSE for Train and Test data'''
  mse_train = metrics.mean_squared_error(y_train, nn_model.predict(X_train))
  mse_test = metrics.mean_squared_error(y_test, nn_model.predict(X_test))
  
  return [mse_train, mse_test]

def model3_plt(lag):
  
  nn_model, X_train, y_train, X_test, y_test = model3(lag)
  nn_model.fit(X_train, y_train)
  
  '''Plot the True and Predicted values in the Test data'''
  nn_y_pred_test = nn_model.predict(X_test)
  nn_y_pred_test = pd.DataFrame(nn_y_pred_test)
   
  nn_y_pred_test['date'] = y_test.index
  nn_y_pred_test.set_index('date', inplace = True)
 
  plt.plot(nn_y_pred_test.shift(periods = lag), label = '{}-Month Forecast - Google Index - Past Values'.format(lag), 
          color = 'tab:purple')

MSE Comparison¶

In [21]:

import warnings
warnings.filterwarnings('ignore')

## MSE results 
 
mse_list1 = []
for _lag in range(1,4):
  mse = model1_mse(lag = _lag)
  mse_list1.append(mse)

mse_list1 = pd.DataFrame(mse_list1)
mse_list1.index = ['1-month forecast', '2-month forecast', '3-month forecast']
mse_list1.columns = ['MSE Train', 'MSE Test']
mse_list1

mse_list2 = []
for _lag in range(1,4):
  mse = model2_mse(lag = _lag)
  mse_list2.append(mse)

mse_list2 = pd.DataFrame(mse_list2)
mse_list2.index = ['1-month forecast', '2-month forecast', '3-month forecast']
mse_list2.columns = ['MSE Train', 'MSE Test']

mse_list3 = []
for _lag in range(1,4):
  mse = model3_mse(lag = _lag)
  mse_list3.append(mse)

mse_list3 = pd.DataFrame(mse_list3)
mse_list3.index = ['1-month forecast', '2-month forecast', '3-month forecast']
mse_list3.columns = ['MSE Train', 'MSE Test']

mse = mse_list1.merge(mse_list2, left_index = True, right_index = True)
mse = mse.merge(mse_list3, left_index = True, right_index = True)

models = ['Model1', 'Model2', 'Model3']
mse.columns = pd.MultiIndex.from_product([models, mse_list1.columns])
mse

Out[21]:

	Model1		Model2		Model3
	MSE Train	MSE Test	MSE Train	MSE Test	MSE Train	MSE Test
1-month forecast	0.022589	0.491333	0.018063	0.442910	0.008238	0.386611
2-month forecast	0.025131	0.724105	0.018332	0.652488	0.008118	0.476660
3-month forecast	0.025455	0.824748	0.019124	0.746085	0.008547	0.379550

Graphic Comparison¶

In [22]:

## One-month forecasts

import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(figsize = (12,6))
model1_plt(lag = 1)
model2_plt(lag = 1)
model3_plt(lag = 1)
ax.legend()
ax.set_ylabel('Unemployment Rate (%)')


ax.set_yticklabels(list(np.round(np.linspace(3.7, 4.7, 10), decimals = 1)))
ax.spines['right'].set_visible(False)

In [23]:

## Three-month forecasts

import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(figsize = (12,6))
model1_plt(lag = 3)
model2_plt(lag = 3)
model3_plt(lag = 3)

ax.legend()
ax.set_ylabel('Unemployment Rate (%)')

ax.set_yticklabels(list(np.round(np.linspace(3.6, 4.4, 8), decimals = 1)))
ax.spines['right'].set_visible(False)

Conclusion¶

Model 2 is an improvement on Model 1, which points to the usefulness of Google Index data in forecating short-term unemployment rates.

Model 3 is a leap from the previous two models, showing that historical data is very useful in MLP networks that do not have memory features.

A potential area of improvement would be to use an LSTM model which has 'memory' of previous patterns and is more suitable to time series data.