Concept

The data used here shows various outcomes from firms' financial statements combined with information about data breaches that happen to the firms. I will split the data for firms at some relative point in time (eg 75% of the way through the data for each firm). Then I will use the first section of the data as a training set, and the second section of the data as a test set to see how well the models perform.

In [1]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import metrics
from sklearn import neural_network
from sklearn import pipeline
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor 
warnings.filterwarnings('ignore')
In [2]:
comp = pd.read_csv('data/COMPUSTAT_vars.csv')
comp['datadate'] = pd.to_datetime(comp['datadate'])
comp['breachdate'] = pd.to_datetime(comp['Date.Made.Public'])

This shows the number of firms we have data for

In [4]:
len(comp['GVKEY'].unique())
Out[4]:
414

Create a time variable that is between 0 and 1 for every firm. Split this by time period by some arbitrary amount, then use the first part as training, and the second part as validation/testing. Also convert the time of the data breach into this scaled time, since this is useful information, and all non numeric data gets thrown out.

In [5]:
def make_scaled_time(sl):
    start_date = sl['datadate'].min()
    end_date = sl['datadate'].max()
    length = end_date - start_date
    scaled_time = (sl['datadate'] - start_date)/length
    breachdate_scaled = (sl['breachdate'] - start_date)/length
    
    return pd.concat([scaled_time, breachdate_scaled, sl['datadate'], sl['GVKEY']], axis=1)
In [6]:
comp = comp.reset_index()
scaled_time = comp.groupby(by='GVKEY').apply(make_scaled_time)
scaled_time.columns = ['scaled_time', 'breachdate_scaled', 'datadate', 'GVKEY']
comp = pd.merge(comp, scaled_time, how='left', on=['datadate', 'GVKEY'])

# 5 companies have only one observation and therefore scaled_time is nan
# Just going to remove these because they aren't interesting anyway
comp = comp[comp['scaled_time'].notnull()]
comp = comp[comp['breachdate_scaled'].notnull()]
comp['scaled_time'] = comp['scaled_time'].astype(float)
comp['breachdate_scaled'] = comp['breachdate_scaled'].astype(float)

# Get rid of any useless columns
try:
    del comp['Unnamed: 0']
    del comp['index']
except KeyError:
    pass

Functions to build and test the ML models

In [7]:
# Fit models and return cv mse
def fit_and_report_mses(mod, X_train, X_test, y_train, y_test, cv=5):
    mod.fit(X_train, y_train)
    return dict(
        mse_train=metrics.mean_squared_error(y_train, mod.predict(X_train)),
        mse_test=metrics.mean_squared_error(y_test, mod.predict(X_test)),
        mse_cv = -np.mean(cross_val_score(mod, X_train, y_train, cv=cv, scoring='neg_mean_squared_error')),
        model = mod
    )

# Pipeline with input scaling
def make_nn_model(params, activation='logistic', solver='adam', alpha=0.0, learning_rate='constant'):
    nn_scaled_model = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    neural_network.MLPRegressor(params,
                                activation=activation,
                                solver=solver,
                                alpha=alpha,
                                learning_rate=learning_rate)
    )
    return nn_scaled_model

Now we are ready to create the training and test sets. Because I have a lot of input data I create effectively two data sets with different input variables. The "firm" data includes all the variables in the data which are firm specific, for example, quarterly revenue. With this as the input it should be very easy to predict the outcome (profit), but not particularly useful, as we are unlikely to know everything about a firm other than profit. In other words this is basically a warm-up to make sure the models work when they have a large amount of information availible to them. The "(ext)ernal" data includes all data that is external to the firm, for example, GDP. These consist mainly of macroeconomic controls that I collected for my thesis. These contain much less information, so I expect the models to not fit as well, but it is much more useful, because if these models are good it could allow us to predict a firm's profit in the future without knowing anything about the firm itself in the future. For both I also include information about data breaches since the data breaches will happen in the data, and information about them should provide some useful information to the models.

In [8]:
breach_data = comp.columns[397:405].tolist() + comp.columns[410:418].tolist() + ['scaled_time']
firm_data = comp.columns[0:386].tolist() + ['scaled_time']
ext_data = comp.columns[419:442].tolist() + ['GVKEY', 'scaled_time'] 
In [9]:
# Use 75% of data for training
training_cutoff = 0.75
outcomes = ['niq'] # Profit will be the oucome
excluded = ['fyr', 'datadate', 'indfmt', 'consol', 'popsrc', 'datafmt',
            'cusip', 'curcdq', 'curncdq', 'currtrq', 'datacqtr', 'datafqtr',
            'Date.Made.Public', 'fic', 'Type.of.organization', 'Type.of.breach', 
            'costat', 'tic']

inputs_firm = [x for x in set(breach_data).union(set(firm_data)) if x not in outcomes and x not in excluded]
inputs_ext = [x for x in set(breach_data).union(set(ext_data)) if x not in outcomes and x not in excluded]

# Remove non-numeric data, none of these are actually useful anyway
comp = comp.select_dtypes(include=['float64', 'int64'])
# Deal with NAs
try:
    comp = comp.interpolate(method='nearest')
except ValueError:
    # entire column is empty -> ignore and fill with zeros
    pass
comp = comp.fillna(0)

X_firm_train = comp[comp['scaled_time'] < training_cutoff][inputs_firm]
X_firm_test = comp[comp['scaled_time'] >= training_cutoff][inputs_firm]
X_ext_train = comp[comp['scaled_time'] < training_cutoff][inputs_ext]
X_ext_test = comp[comp['scaled_time'] >= training_cutoff][inputs_ext]

y_train = comp[comp['scaled_time'] <  training_cutoff][outcomes[0]]
y_test = comp[comp['scaled_time'] >= training_cutoff][outcomes[0]]

comp_train = comp[comp['scaled_time'] < training_cutoff]
comp_test = comp[comp['scaled_time'] >= training_cutoff]

Finally, we will begin training the models. First the neural net, and then random forest, using first the firm specific data, and then the external data. For the neural networks I use a small grid search over two possible network shapes and some values for the l1 normalization parameter in order to get a better estimate of the optimal model. The final model I choose is the one with the lowest cross validation MSE. For the random forests the cv mse will be more or less monotonically decreasing in the number of trees, so it is sufficient to estimate one model with a large number of trees.

In [10]:
params = [[100], [50, 25]]
alphas = [10 ** -i for i in range(1, 6)]
firm_training_results = []
models = []
for param in params:
    for alpha in alphas:
        models.append({
            'param': param,
            'alpha': alpha,
            'model': make_nn_model(param, alpha=alpha, learning_rate='constant')
        })
                
for d in models:
    print('Training network with shape params {} and alpha {}'.format(d['param'], d['alpha']))
    trained = fit_and_report_mses(d['model'], X_firm_train, X_firm_test, y_train, y_test)
    trained['params'] = d['param']
    trained['alpha'] = d['alpha']
    firm_training_results.append(trained)
    
firm_mse_nn = pd.DataFrame(firm_training_results)
Training network with shape params [100] and alpha 0.1
Training network with shape params [100] and alpha 0.01
Training network with shape params [100] and alpha 0.001
Training network with shape params [100] and alpha 0.0001
Training network with shape params [100] and alpha 1e-05
Training network with shape params [50, 25] and alpha 0.1
Training network with shape params [50, 25] and alpha 0.01
Training network with shape params [50, 25] and alpha 0.001
Training network with shape params [50, 25] and alpha 0.0001
Training network with shape params [50, 25] and alpha 1e-05
In [11]:
optimal_firm_nn = firm_mse_nn.iloc[firm_mse_nn['mse_cv'].idxmin()]
print('Optimal neural network parameters {} with alpha {}'.format(optimal_firm_nn['params'], optimal_firm_nn['alpha']))
optimal_firm_nn_model = optimal_firm_nn['model']
comp_train['firm_nn_predicted_niq'] = optimal_firm_nn_model.predict(X_firm_train)
comp_test['firm_nn_predicted_niq'] = optimal_firm_nn_model.predict(X_firm_test)
Optimal neural network parameters [100] with alpha 1e-05
In [12]:
print('Training Random Forest with {} trees on firm specific data'.format(50))
firm_rf = fit_and_report_mses(RandomForestRegressor(n_estimators=50), X_firm_train, X_firm_test, y_train, y_test)
firm_rf_model = firm_rf['model']
comp_train['firm_rf_predicted_niq'] = firm_rf_model.predict(X_firm_train)
comp_test['firm_rf_predicted_niq'] = firm_rf_model.predict(X_firm_test)
Training Random Forest with 50 trees on firm specific data
In [13]:
ext_training_results = []

for d in models:
    print('Training network with shape params {} and alpha {}'.format(d['param'], d['alpha']))
    trained = fit_and_report_mses(d['model'], X_ext_train, X_ext_test, y_train, y_test)
    trained['params'] = d['param']
    trained['alpha'] = d['alpha']
    ext_training_results.append(trained)
    
ext_mse_nn = pd.DataFrame(ext_training_results)
Training network with shape params [100] and alpha 0.1
Training network with shape params [100] and alpha 0.01
Training network with shape params [100] and alpha 0.001
Training network with shape params [100] and alpha 0.0001
Training network with shape params [100] and alpha 1e-05
Training network with shape params [50, 25] and alpha 0.1
Training network with shape params [50, 25] and alpha 0.01
Training network with shape params [50, 25] and alpha 0.001
Training network with shape params [50, 25] and alpha 0.0001
Training network with shape params [50, 25] and alpha 1e-05
In [14]:
optimal_ext_nn = ext_mse_nn.iloc[ext_mse_nn['mse_cv'].idxmin()]
print('Optimal neural network parameters {} with alpha {}'.format(optimal_ext_nn['params'], optimal_ext_nn['alpha']))
optimal_ext_nn_model = optimal_ext_nn['model']
comp_train['ext_nn_predicted_niq'] = optimal_ext_nn_model.predict(X_ext_train)
comp_test['ext_nn_predicted_niq'] = optimal_ext_nn_model.predict(X_ext_test)
Optimal neural network parameters [100] with alpha 0.1
In [15]:
print('Training Random Forest with {} trees on external data'.format(50))
ext_rf = fit_and_report_mses(RandomForestRegressor(n_estimators=50), X_ext_train, X_ext_test, y_train, y_test)
ext_rf_model = ext_rf['model']
comp_train['ext_rf_predicted_niq'] = ext_rf_model.predict(X_ext_train)
comp_test['ext_rf_predicted_niq'] = ext_rf_model.predict(X_ext_test)
Training Random Forest with 50 trees on external data

Now that we have trained the 4 models we can plot actual and predicted profit against scaled time in order to visualize how well the models are performing. This code shows plots for each model for a randomly selected firm that changes each time it is re-run.

In [63]:
# Pick a random firm to show each time this is run
gvkeys = len(comp_train['GVKEY'].unique())
rand_key = np.random.randint(low=0, high=gvkeys)
gvkey = comp_train['GVKEY'].unique()[rand_key]
# Uncomment to get the same firm every time
# gvkey = 1230

colors = ['red', 'blue']
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

comp_train[comp_train['GVKEY'] == gvkey].plot(y='niq', x='scaled_time', c=colors[0], ax=axes[0,0])
comp_test[comp_test['GVKEY'] == gvkey].plot(y='niq', x='scaled_time', c=colors[0], ax=axes[0,0], label='_nolegend_')
comp_train[comp_train['GVKEY'] == gvkey].plot(y='firm_nn_predicted_niq', x='scaled_time', c=colors[1], ax=axes[0,0])
comp_test[comp_test['GVKEY'] == gvkey].plot(y='firm_nn_predicted_niq', x='scaled_time', c=colors[1], ax=axes[0,0], label='_nolegend_')

comp_train[comp_train['GVKEY'] == gvkey].plot(y='niq', x='scaled_time', c=colors[0], ax=axes[0,1])
comp_test[comp_test['GVKEY'] == gvkey].plot(y='niq', x='scaled_time', c=colors[0], ax=axes[0,1], label='_nolegend_')
comp_train[comp_train['GVKEY'] == gvkey].plot(y='firm_rf_predicted_niq', x='scaled_time', c=colors[1], ax=axes[0,1])
comp_test[comp_test['GVKEY'] == gvkey].plot(y='firm_rf_predicted_niq', x='scaled_time', c=colors[1], ax=axes[0,1], label='_nolegend_')

comp_train[comp_train['GVKEY'] == gvkey].plot(y='niq', x='scaled_time', c=colors[0], ax=axes[1,0])
comp_test[comp_test['GVKEY'] == gvkey].plot(y='niq', x='scaled_time', c=colors[0], ax=axes[1,0], label='_nolegend_')
comp_train[comp_train['GVKEY'] == gvkey].plot(y='ext_nn_predicted_niq', x='scaled_time', c=colors[1], ax=axes[1,0])
comp_test[comp_test['GVKEY'] == gvkey].plot(y='ext_nn_predicted_niq', x='scaled_time', c=colors[1], ax=axes[1,0], label='_nolegend_')

comp_train[comp_train['GVKEY'] == gvkey].plot(y='niq', x='scaled_time', c=colors[0], ax=axes[1,1])
comp_test[comp_test['GVKEY'] == gvkey].plot(y='niq', x='scaled_time', c=colors[0], ax=axes[1,1], label='_nolegend_')
comp_train[comp_train['GVKEY'] == gvkey].plot(y='ext_rf_predicted_niq', x='scaled_time', c=colors[1], ax=axes[1,1])
comp_test[comp_test['GVKEY'] == gvkey].plot(y='ext_rf_predicted_niq', x='scaled_time', c=colors[1], ax=axes[1,1], label='_nolegend_')

for ax in axes.flat:
    ax.axvline(x=0.75, color='black')
    ax.set_xlabel('Scaled Time')
    ax.set_ylabel('Profit')
    ax.legend(['Actual Profit', 'Predicted Profit'])
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)

axes[0,0].set_title('Neural Network (Firm Data)', size=16)
axes[0,1].set_title('Random Forest (Firm Data)', size=16)
axes[1,0].set_title('Neural Network (External Data)', size=16)
axes[1,1].set_title('Random Forest (External Data)', size=16)

axes[0,0].annotate('Training Data', xy=(0.5, 0))
axes[0,0].annotate('Test Data', xy=(0.8, 0))

fig.tight_layout()

Since this only shows data for one firm at a time, we can also plot the predicted and actual values for all of the firms at once to see how the models perform more generally.

In [62]:
fig, axes = plt.subplots(2, 2, figsize=(20,16))

comp_train.plot(y='niq', x='scaled_time', kind='scatter', c=colors[0], alpha=0.3, ax = axes[0,0])
comp_train.plot(y='firm_nn_predicted_niq', x='scaled_time', kind='scatter', c=colors[1], alpha=0.3, ax=axes[0,0])
comp_test.plot(y='niq', x='scaled_time', kind='scatter', c=colors[0], alpha=0.3, ax = axes[0,0], label='_nolegend_')
comp_test.plot(y='firm_nn_predicted_niq', x='scaled_time', kind='scatter', c=colors[1], alpha=0.3, ax=axes[0,0], label='_nolegend_')

comp_train.plot(y='niq', x='scaled_time', kind='scatter', c=colors[0], alpha=0.3, ax = axes[0,1])
comp_train.plot(y='firm_rf_predicted_niq', x='scaled_time', kind='scatter', c=colors[1], alpha=0.3, ax=axes[0,1])
comp_test.plot(y='niq', x='scaled_time', kind='scatter', c=colors[0], alpha=0.3, ax = axes[0,1], label='_nolegend_')
comp_test.plot(y='firm_rf_predicted_niq', x='scaled_time', kind='scatter', c=colors[1], alpha=0.3, ax=axes[0,1], label='_nolegend_')

comp_train.plot(y='niq', x='scaled_time', kind='scatter', c=colors[0], alpha=0.3, ax = axes[1,0])
comp_train.plot(y='ext_nn_predicted_niq', x='scaled_time', kind='scatter', c=colors[1], alpha=0.3, ax=axes[1,0])
comp_test.plot(y='niq', x='scaled_time', kind='scatter', c=colors[0], alpha=0.3, ax = axes[1,0], label='_nolegend_')
comp_test.plot(y='ext_nn_predicted_niq', x='scaled_time', kind='scatter', c=colors[1], alpha=0.3, ax=axes[1,0], label='_nolegend_')

comp_train.plot(y='niq', x='scaled_time', kind='scatter', c=colors[0], alpha=0.3, ax = axes[1,1])
comp_train.plot(y='ext_rf_predicted_niq', x='scaled_time', kind='scatter', c=colors[1], alpha=0.3, ax=axes[1,1])
comp_test.plot(y='niq', x='scaled_time', kind='scatter', c=colors[0], alpha=0.3, ax = axes[1,1], label='_nolegend_')
comp_test.plot(y='ext_rf_predicted_niq', x='scaled_time', kind='scatter', c=colors[1], alpha=0.3, ax=axes[1,1], label='_nolegend_')

for ax in axes.flat:
    ax.set_ylim(-5000, 5000)
    ax.axvline(x=0.75, color='black', label='_nolegend_')
    ax.set_xlabel('Scaled Time', size=20)
    ax.set_ylabel('Profit', size=20)
    ax.legend(['Actual Profit', 'Predicted Profit'])
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)

axes[0,0].set_title('Neural Network (Firm Data)', size=24)
axes[0,1].set_title('Random Forest (Firm Data)', size=24)
axes[1,0].set_title('Neural Network (External Data)', size=24)
axes[1,1].set_title('Random Forest (External Data)', size=24)
axes[0,0].annotate('Training Data', xy=(0.5, -2000), size=16)
axes[0,0].annotate('Test Data', xy=(0.8, -2000), size=16)

fig.tight_layout()