Simple Machine Learning Example - Data Explanation

This notebook shows the data used to train the model in the algorithm by Grant Klein (May 31 2014).

We use the daily trading scenario here - it feels easier to understand. We show what data is used to train a machine learning model. Please point out any issues.

The issue

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns

from scipy import stats
from zipline.data import load_from_yahoo

Let's grab more than 400 days' worth of data from yahoo for the following stocks.

context.stocks = [ sid(19662),  # XLY Consumer Discrectionary SPDR Fund
                   sid(19656),  # XLF Financial SPDR Fund
                   sid(19658),  # XLK Technology SPDR Fund
                   sid(19655),  # XLE Energy SPDR Fund
                   sid(19661),  # XLV Health Care SPRD Fund
                   sid(19657),  # XLI Industrial SPDR Fund
                   sid(19659),  # XLP Consumer Staples SPDR Fund
                   sid(19654),  # XLB Materials SPDR Fund
                   sid(19660),  # XLU Utilities SPRD Fund
                   sid(8554) ]  # SPY SPDR S&P 500 ETF Trust
In [2]:
stocks = ['XLY', 'XLF', 'XLK', 'XLE', 'XLV', 'XLI', 'XLP', 'XLB', 'XLU', 'SPY']
In [3]:
end_date = pd.datetools.datetime(2014, 5, 1)

# Ensure there is enough data (number of days: 800)
start_date = end_date - pd.DateOffset(n=800)
In [4]:
data = load_from_yahoo(stocks=stocks, start=start_date, end=end_date)
XLY
XLF
XLK
XLE
XLV
XLI
XLP
XLB
XLU
SPY
In [5]:
data.plot()
Out[5]:
<matplotlib.axes.AxesSubplot at 0x10bc681d0>

Check to ensure there is enough price data.

In [6]:
print data.index[0]
print data.index[-1]
2012-02-21 00:00:00+00:00
2014-05-01 00:00:00+00:00

Take 400 days of the data (as in the quantopian example).

In [7]:
prices = data.as_matrix()[:400]
In [8]:
prices.shape
Out[8]:
(400, 10)

Calculate the z score of each value in the sample, relative to the sample mean and standard deviation.

In [9]:
changes_all = stats.zscore(prices, axis=0, ddof=1)
In [10]:
changes_all.shape # Quick look at the shape
Out[10]:
(400, 10)

A quick run through the next line.

changes_all[:,0:-1].shape – All the stocks except the SPY.
changes_all[:,-1] – The SPY z-scores.
(len(stocks)-1,1) = (9, 1) in this case.

Therefore, this line simply subtracts all the SPY z-scores from the z-scores of everything else.

In [11]:
changes = changes_all[:,0:-1] - np.tile(changes_all[:,-1], (len(stocks)-1,1)).T

Create a boolean matrix to indicate whether the z-score is positive or not. Indicate this as 1s or 0s.

In [12]:
changes = changes > 0
changes = changes.astype(np.int)

Check once more to make sure the shape is the same. There are 400 changes for each stock in our portfolio.

In [13]:
changes.shape
Out[13]:
(400, 9)

The training data

Here we look at what exactly we are using to train the model.

For each stock in the portfolio, now independent of the others, split the changes into 20 training samples and take the last sample as the labels. The original code:

for k in range(len(context.stocks)-1):
    X = np.split(changes[:,k],20)
    Y = X[-1]

    context.classifier.fit(X, Y)

    context.prediction[k] = context.classifier.predict(Y)

We are simple splitting the changes into 20 training samples.

In [14]:
X = np.split(changes[:,0], 20)
X
Out[14]:
[array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1]),
 array([0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]),
 array([0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]),
 array([0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0]),
 array([1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0]),
 array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]

If unsure, we can inspect and see that this is a straightforward split from the changes.

In [15]:
changes[:,0]
Out[15]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0])

Y is the last 20 changes - the final split.

In [16]:
Y = X[-1]
Y
Out[16]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

A model is then trained on the split changes, with the last 20 changes acting as the labels/classes. However, this doesn't seem right and we can see where the data is coming from by using the dates as the data.

Quickly reduce the dates to a list of easily readable strings.

In [17]:
full_dates = data.index[:400]
dates = []

for date in full_dates:
    new_date = str(date)[:10]
    dates.append(new_date)

dates = np.array(dates)

Check what they look like.

In [18]:
dates[:10]
Out[18]:
array(['2012-02-21', '2012-02-22', '2012-02-23', '2012-02-24',
       '2012-02-27', '2012-02-28', '2012-02-29', '2012-03-01',
       '2012-03-02', '2012-03-05'], 
      dtype='|S10')

Split the dates in the same way as the training data.

In [19]:
dates_split = np.array(np.split(dates, 20))
In [20]:
dates_split.shape
Out[20]:
(20, 20)

We can see that each split is a set of 20 consecutive days.

In [21]:
dates_split[:2]
Out[21]:
array([['2012-02-21', '2012-02-22', '2012-02-23', '2012-02-24',
        '2012-02-27', '2012-02-28', '2012-02-29', '2012-03-01',
        '2012-03-02', '2012-03-05', '2012-03-06', '2012-03-07',
        '2012-03-08', '2012-03-09', '2012-03-12', '2012-03-13',
        '2012-03-14', '2012-03-15', '2012-03-16', '2012-03-19'],
       ['2012-03-20', '2012-03-21', '2012-03-22', '2012-03-23',
        '2012-03-26', '2012-03-27', '2012-03-28', '2012-03-29',
        '2012-03-30', '2012-04-02', '2012-04-03', '2012-04-04',
        '2012-04-05', '2012-04-09', '2012-04-10', '2012-04-11',
        '2012-04-12', '2012-04-13', '2012-04-16', '2012-04-17']], 
      dtype='|S10')

The labels are the last split.

In [22]:
dates_labels = dates_split[-1]
dates_labels
Out[22]:
array(['2013-08-26', '2013-08-27', '2013-08-28', '2013-08-29',
       '2013-08-30', '2013-09-03', '2013-09-04', '2013-09-05',
       '2013-09-06', '2013-09-09', '2013-09-10', '2013-09-11',
       '2013-09-12', '2013-09-13', '2013-09-16', '2013-09-17',
       '2013-09-18', '2013-09-19', '2013-09-20', '2013-09-23'], 
      dtype='|S10')

View these labels alongside the dates. We can see that changes in Feb-March 2012 are being used to predict the change on 26 August 2013. Then March-April 2012 are being used to predict 27th August 2013. April-May 2012 used to predict 28th August 2013.

This gap slowly closes with changes for July-August 2013 to predict 20th September 2013. Finally, August-September 2013 to predict 23rd September 2013. A crucial note: the last training set actually includes the target value.

The labels are then used to predict the next day.

As far as we can tell, this is the structure of the data used for training a model and predicting the next value.

In [23]:
for x_dates, label in zip(dates_split, dates_labels):
    print 'Target:', label
    print 'Training:', x_dates
    print ''
Target: 2013-08-26
Training: ['2012-02-21' '2012-02-22' '2012-02-23' '2012-02-24' '2012-02-27'
 '2012-02-28' '2012-02-29' '2012-03-01' '2012-03-02' '2012-03-05'
 '2012-03-06' '2012-03-07' '2012-03-08' '2012-03-09' '2012-03-12'
 '2012-03-13' '2012-03-14' '2012-03-15' '2012-03-16' '2012-03-19']

Target: 2013-08-27
Training: ['2012-03-20' '2012-03-21' '2012-03-22' '2012-03-23' '2012-03-26'
 '2012-03-27' '2012-03-28' '2012-03-29' '2012-03-30' '2012-04-02'
 '2012-04-03' '2012-04-04' '2012-04-05' '2012-04-09' '2012-04-10'
 '2012-04-11' '2012-04-12' '2012-04-13' '2012-04-16' '2012-04-17']

Target: 2013-08-28
Training: ['2012-04-18' '2012-04-19' '2012-04-20' '2012-04-23' '2012-04-24'
 '2012-04-25' '2012-04-26' '2012-04-27' '2012-04-30' '2012-05-01'
 '2012-05-02' '2012-05-03' '2012-05-04' '2012-05-07' '2012-05-08'
 '2012-05-09' '2012-05-10' '2012-05-11' '2012-05-14' '2012-05-15']

Target: 2013-08-29
Training: ['2012-05-16' '2012-05-17' '2012-05-18' '2012-05-21' '2012-05-22'
 '2012-05-23' '2012-05-24' '2012-05-25' '2012-05-29' '2012-05-30'
 '2012-05-31' '2012-06-01' '2012-06-04' '2012-06-05' '2012-06-06'
 '2012-06-07' '2012-06-08' '2012-06-11' '2012-06-12' '2012-06-13']

Target: 2013-08-30
Training: ['2012-06-14' '2012-06-15' '2012-06-18' '2012-06-19' '2012-06-20'
 '2012-06-21' '2012-06-22' '2012-06-25' '2012-06-26' '2012-06-27'
 '2012-06-28' '2012-06-29' '2012-07-02' '2012-07-03' '2012-07-05'
 '2012-07-06' '2012-07-09' '2012-07-10' '2012-07-11' '2012-07-12']

Target: 2013-09-03
Training: ['2012-07-13' '2012-07-16' '2012-07-17' '2012-07-18' '2012-07-19'
 '2012-07-20' '2012-07-23' '2012-07-24' '2012-07-25' '2012-07-26'
 '2012-07-27' '2012-07-30' '2012-07-31' '2012-08-01' '2012-08-02'
 '2012-08-03' '2012-08-06' '2012-08-07' '2012-08-08' '2012-08-09']

Target: 2013-09-04
Training: ['2012-08-10' '2012-08-13' '2012-08-14' '2012-08-15' '2012-08-16'
 '2012-08-17' '2012-08-20' '2012-08-21' '2012-08-22' '2012-08-23'
 '2012-08-24' '2012-08-27' '2012-08-28' '2012-08-29' '2012-08-30'
 '2012-08-31' '2012-09-04' '2012-09-05' '2012-09-06' '2012-09-07']

Target: 2013-09-05
Training: ['2012-09-10' '2012-09-11' '2012-09-12' '2012-09-13' '2012-09-14'
 '2012-09-17' '2012-09-18' '2012-09-19' '2012-09-20' '2012-09-21'
 '2012-09-24' '2012-09-25' '2012-09-26' '2012-09-27' '2012-09-28'
 '2012-10-01' '2012-10-02' '2012-10-03' '2012-10-04' '2012-10-05']

Target: 2013-09-06
Training: ['2012-10-08' '2012-10-09' '2012-10-10' '2012-10-11' '2012-10-12'
 '2012-10-15' '2012-10-16' '2012-10-17' '2012-10-18' '2012-10-19'
 '2012-10-22' '2012-10-23' '2012-10-24' '2012-10-25' '2012-10-26'
 '2012-10-31' '2012-11-01' '2012-11-02' '2012-11-05' '2012-11-06']

Target: 2013-09-09
Training: ['2012-11-07' '2012-11-08' '2012-11-09' '2012-11-12' '2012-11-13'
 '2012-11-14' '2012-11-15' '2012-11-16' '2012-11-19' '2012-11-20'
 '2012-11-21' '2012-11-23' '2012-11-26' '2012-11-27' '2012-11-28'
 '2012-11-29' '2012-11-30' '2012-12-03' '2012-12-04' '2012-12-05']

Target: 2013-09-10
Training: ['2012-12-06' '2012-12-07' '2012-12-10' '2012-12-11' '2012-12-12'
 '2012-12-13' '2012-12-14' '2012-12-17' '2012-12-18' '2012-12-19'
 '2012-12-20' '2012-12-21' '2012-12-24' '2012-12-26' '2012-12-27'
 '2012-12-28' '2012-12-31' '2013-01-02' '2013-01-03' '2013-01-04']

Target: 2013-09-11
Training: ['2013-01-07' '2013-01-08' '2013-01-09' '2013-01-10' '2013-01-11'
 '2013-01-14' '2013-01-15' '2013-01-16' '2013-01-17' '2013-01-18'
 '2013-01-22' '2013-01-23' '2013-01-24' '2013-01-25' '2013-01-28'
 '2013-01-29' '2013-01-30' '2013-01-31' '2013-02-01' '2013-02-04']

Target: 2013-09-12
Training: ['2013-02-05' '2013-02-06' '2013-02-07' '2013-02-08' '2013-02-11'
 '2013-02-12' '2013-02-13' '2013-02-14' '2013-02-15' '2013-02-19'
 '2013-02-20' '2013-02-21' '2013-02-22' '2013-02-25' '2013-02-26'
 '2013-02-27' '2013-02-28' '2013-03-01' '2013-03-04' '2013-03-05']

Target: 2013-09-13
Training: ['2013-03-06' '2013-03-07' '2013-03-08' '2013-03-11' '2013-03-12'
 '2013-03-13' '2013-03-14' '2013-03-15' '2013-03-18' '2013-03-19'
 '2013-03-20' '2013-03-21' '2013-03-22' '2013-03-25' '2013-03-26'
 '2013-03-27' '2013-03-28' '2013-04-01' '2013-04-02' '2013-04-03']

Target: 2013-09-16
Training: ['2013-04-04' '2013-04-05' '2013-04-08' '2013-04-09' '2013-04-10'
 '2013-04-11' '2013-04-12' '2013-04-15' '2013-04-16' '2013-04-17'
 '2013-04-18' '2013-04-19' '2013-04-22' '2013-04-23' '2013-04-24'
 '2013-04-25' '2013-04-26' '2013-04-29' '2013-04-30' '2013-05-01']

Target: 2013-09-17
Training: ['2013-05-02' '2013-05-03' '2013-05-06' '2013-05-07' '2013-05-08'
 '2013-05-09' '2013-05-10' '2013-05-13' '2013-05-14' '2013-05-15'
 '2013-05-16' '2013-05-17' '2013-05-20' '2013-05-21' '2013-05-22'
 '2013-05-23' '2013-05-24' '2013-05-28' '2013-05-29' '2013-05-30']

Target: 2013-09-18
Training: ['2013-05-31' '2013-06-03' '2013-06-04' '2013-06-05' '2013-06-06'
 '2013-06-07' '2013-06-10' '2013-06-11' '2013-06-12' '2013-06-13'
 '2013-06-14' '2013-06-17' '2013-06-18' '2013-06-19' '2013-06-20'
 '2013-06-21' '2013-06-24' '2013-06-25' '2013-06-26' '2013-06-27']

Target: 2013-09-19
Training: ['2013-06-28' '2013-07-01' '2013-07-02' '2013-07-03' '2013-07-05'
 '2013-07-08' '2013-07-09' '2013-07-10' '2013-07-11' '2013-07-12'
 '2013-07-15' '2013-07-16' '2013-07-17' '2013-07-18' '2013-07-19'
 '2013-07-22' '2013-07-23' '2013-07-24' '2013-07-25' '2013-07-26']

Target: 2013-09-20
Training: ['2013-07-29' '2013-07-30' '2013-07-31' '2013-08-01' '2013-08-02'
 '2013-08-05' '2013-08-06' '2013-08-07' '2013-08-08' '2013-08-09'
 '2013-08-12' '2013-08-13' '2013-08-14' '2013-08-15' '2013-08-16'
 '2013-08-19' '2013-08-20' '2013-08-21' '2013-08-22' '2013-08-23']

Target: 2013-09-23
Training: ['2013-08-26' '2013-08-27' '2013-08-28' '2013-08-29' '2013-08-30'
 '2013-09-03' '2013-09-04' '2013-09-05' '2013-09-06' '2013-09-09'
 '2013-09-10' '2013-09-11' '2013-09-12' '2013-09-13' '2013-09-16'
 '2013-09-17' '2013-09-18' '2013-09-19' '2013-09-20' '2013-09-23']


End