Application of Neural Networks in Finance¶

Even the most simple NN's can be useful. This notebook will demonstrate a simple approach for a predictive trading model.

This notebook does the following:¶

-Dynamically load daily price history from Yahoo / Google Finance
-Apply some feature engineering
-Train XGBoost and NN Models to predict if the direction of the next day's price move
-Re-Train based on the most important Feature Columns
-Predict using the trained models and evaluate the financial performance

Learn how to:¶

-Do Basic Feature Engineering in Pandas
-Use XGBoost to detect most important features
-Stop Training a Model when the metrics fail to improve, and re-set model to what it was after the best epoch
-Visualise Model input and outputs
-Plot loss and accuracy charts
-Plot some cool charts like Auto Correlation & Scatter Matrix
-Make Money applying NN to Finance !

In [9]:

import numpy as np
from datetime import datetime
import operator
import pandas as pd
from pandas.tools.plotting import autocorrelation_plot
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from matplotlib.pyplot import legend
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Reshape, Dropout, Convolution2D, MaxPooling2D, LSTM
from keras.layers.normalization import BatchNormalization
from keras.optimizers import Adam, RMSprop
from keras.callbacks import ModelCheckpoint, EarlyStopping
from sklearn.linear_model import LinearRegression
from keras.utils import np_utils
import itertools
import copy
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import pandas_datareader as pdr
from IPython import display

Global Settings & Variables¶

In [10]:

useGoogle = True        # Use either Google or Yahoo Finance
#rates                  # DataFrame of Raw price data 
INST='^GSPC'            # Financial Instrument Code  eg IBM, F, FB  
useLSTM=False           # Use the LSTM model instead of the Dense NN
plotall=True            # Display all the Charts
epochs=1000             # Max number of epochs to run
waitEpochs=100          # Max number of epochs to wait for an improvement before stopping training
#AA                     # feature engineered DataFrame
#x, xi, xt              # np arrays for x, xi is the train set, xt is the test set  
#y, yi, yt              # np arrays for y, yi is the train set, yt is the test set
#split                  # array index the separates the train from test sets
#inputCols              # list of columns from AA that will be used to build x
#m                      # current model
#window=0               # LSTM window
#offset=0               # Ignore the first n rows of x
#posVect                # position predictions 1="LONG"  -1 = "SHORT"              
#predictions            # raw model predictions
#pnlVect                # profit & loss vector

Data Load & Inspection Functions¶

In [11]:

def GetData():
    global rates
    start = datetime(1990, 1, 1)
    end = datetime.now()
    if useGoogle:
        rates= pdr.get_data_google(INST, start, end)
        rates['Adj Close']=rates['Close']           #### Note Google does not have adj price for corporate actions
    else:
        rates= pdr.get_data_yahoo(INST, start, end)
    rates.rename(columns={'Open': 'OPEN', 'High': 'HIGH', 'Low': 'LOW', 'Adj Close': 'CLOSE'}, inplace=True)
    rates['Volume'].fillna(0, inplace=True)
    rates = rates.reset_index()

In [12]:

def Start(inst):  
    global INST
    INST=inst
    GetData()
    if plotall :
        fig, ax = plt.subplots(1,1) 
        ax.plot(rates['Date'],rates['CLOSE'])
        ax.set_title(inst)
        plt.show()

In [13]:

def PlotAutoCorrel():
    fig = plt.figure()
    _ = autocorrelation_plot(rates['CLOSE'], label=INST)
    plt.show()
    print "Describing " + INST
    display.display(rates.describe())

Feature Engineering Functions¶

In [14]:

def divMax(data,column):    # scale 
    data[column]=data[column]/max(data[column])   

def mavg(data,column,periods): #  moving average
    c=pd.Series.rolling(data[column],periods).mean()
    data[column+"mavg"+str(periods)]=c        
    
def logColumn(data,column):  # log
    data[column]=np.log(data[column]) 
    
def pct(data,column):         # % change 
        data[column+"pct"]=((data[column]-data[column].shift())/data[column])

def mom(data,column,MomPeriodOffset):   # Momentum 
    x=data[column].as_matrix()
    res=np.zeros(len(data))
    for i in range(len(x)):
        if (i >(MomPeriodOffset-1)):                
            iqr = np.subtract(*np.percentile(x[(i-3):i], [75, 25]))                                             
            if (iqr<=0.000000001):
                res[i]=0.0
            else:
                res[i]=(x[i]-x[i-MomPeriodOffset])/ iqr #np.std(x[i-3:i]) # todo divide by interquartile to avoid large and small values 
    data[column+"mom"+str(MomPeriodOffset)]=res

def bolWidth(data,column,windowsize):    # Bollenger Band Width 
    x=data[column].as_matrix()
    res=np.zeros(len(data))
    for i in range(len(x)):
        if (i >windowsize):
            std=np.std(x[i-windowsize-1:i])
            mean=np.mean(x[i-windowsize-1:i])
            bolup=mean+2*std
            boldown=mean-2*std
            bolwidth=(bolup-boldown)/mean
            res[i]=bolwidth
    data[column+"bolW"+str(windowsize)]=res
    
def TR (data):  #  True Range
    h=data['HIGH'].as_matrix()
    l=data['LOW'].as_matrix()
    pc=data['CLOSE'].as_matrix()
    res=np.zeros(len(data))
    for i in range(len(h)):
        if (i >1): 
            t=[(h[i]-l[i]),abs((h[i]-pc[i-1])),abs((l[i]-pc[i-1])) ]

            res[i]=np.amax(t)
    data['TR']=res

def ATR(data, window):   # Average True Range
    c=data['TR'].ewm(span = window, min_periods = window).mean()
    data["ATR"+str(window)]=c
    

In [15]:

def LR(y):   # Linear Regression
    X = np.matrix(range(len(y))).T
    m = LinearRegression()
    m.fit(X, y)
    p=np.array([len(y)]).reshape(-1,1)
    return m.predict(p)

def TSF(data,column, window):
    x=data[column].as_matrix()
    res=np.zeros(len(data))
    for i in range(len(data)):
        if (i>window):
            res[i]=LR(x[i-(window-1):i])
    data["TSF"+str(window)]=res
     

In [16]:

def norm(data,column,window):   # rolling normalisation function
    x=data[column].as_matrix()
    res=np.zeros(len(data))
    x[np.isnan(x)] = 0
    for i in range(len(data)):
        if (i>window):
            sub=x[i-window:i]
            avg=np.mean(sub)
            ma=np.amax(sub)
            mi=np.amin(sub)
            res[i]=(x[i]-avg)/(ma-mi)
            #res[i]=(x[i]-avg)/np.std(sub)
    data[column]=res

In [17]:

def objective(data):           # used if we want have a regressor rather than a classifier 
    x=data['CLOSE'].as_matrix()
    x2=data['ATR100'].as_matrix()
    res=np.zeros(len(data))
    for i in range(len(data)):
        if( i<len(x)-1):
            res[i]=(x[i+1]-x[i])/x2[i+1] 
    data['objective']= res

In [18]:

def Except(full_list, excludes):    # list "except"  function
    s = set(excludes)
    return (x for x in full_list if x not in s)

In [19]:

def PrepData(XColumns=""):
    global AA
    global split
    global inputCols
    AA=rates.copy()  
    pct(AA,'CLOSE')
    norm(AA,'CLOSE',100)
    norm(AA,'OPEN',100)
    norm(AA,'HIGH',100)
    norm(AA,'LOW',100)
    mom(AA,'CLOSE',3)
    mom(AA,'CLOSE',5)
    mom(AA,'CLOSE',10) 
    mom(AA,'CLOSE',30)     
    mom(AA,'CLOSE',100)
    TR(AA)     
    ATR(AA, 7)   
    ATR(AA, 10)   
    ATR(AA, 20)   
    ATR(AA, 100)   
    bolWidth(AA,'CLOSE',20)
    TSF(AA,'CLOSE',10)
    mavg(AA,'CLOSE',10)
    mavg(AA,'CLOSE',30)
    mavg(AA,'CLOSE',100)
    mavg(AA,'CLOSE',200)
    AA=AA.loc[200:,] # drop the first 200 rows
    AA.reset_index(drop=True, inplace=True)
    AA['ATR10v20']=AA['ATR10'] / AA['ATR20']
    AA['ATR10v100']=AA['ATR10']/AA['ATR100']
    AA['DeltaBolW20']= AA['CLOSEbolW20'].diff()
    if XColumns=="":  # use the default columns except Date and Closepct
        inputCols=list(Except(AA.columns.tolist(),['Date', 'CLOSEpct' ] ))# dont normalize the date and pct CLOSE columns
    else:
        inputCols = XColumns
    for c in Except(inputCols,['OPEN', 'HIGH', 'LOW','CLOSE' ] ): 
        norm(AA,c,200)
    objective(AA)   # add the objective column
    AA.reset_index(drop=True, inplace=True)
    split= int(len(AA)*.8)

In [20]:

def PlotAutoCorrelAfter():
    tmp=AA['CLOSE']
    tmp=tmp[201:]
    plt.plot(tmp)
    plt.show()
    _ = autocorrelation_plot(tmp, label=INST)
    plt.show()

Use XGBoost to Evaluate Feature Importance¶

In [21]:

def GetFeatureImportance(): 
    model = XGBClassifier()
    x=AA[inputCols]
    y=np.sign(AA['CLOSEpct'].shift(-1)) # set Y to be  tomorrow's close px  0 for down 1 for up 
    y[y<0]=0
    xi=x.loc[:split]
    yi=y.loc[:split]
    model.fit(xi, yi)
    featureImp = model.feature_importances_
    if plotall:
        xgb.plot_importance(model)
        plt.show()
    preds=model.predict(x.loc[split:(len(y)-2)])
    print "XGBoost Accuracy is : " + str(accuracy_score(preds,y.loc[split:(len(y)-2)].as_matrix()))
    return featureImp

Deep Learning Woohoo !¶

In [22]:

def MakeModel(cols):     #number of cols 
    model = Sequential()
    model.add(Dense(cols, input_dim=cols, kernel_initializer="normal", activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    model.add(Dense(1, kernel_initializer="normal", activation='sigmoid')) #
    optimizer = RMSprop(lr=0.001)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

In [23]:

def MakeLSTMModel(window, cols):
    model = Sequential()
    model.add(LSTM(100, input_shape=( window, cols)))   
    model.add(Dense(1, kernel_initializer="normal",  activation='sigmoid')) #
    optimizer = RMSprop(lr=0.001)
    model.compile(loss='binary_crossentropy',optimizer=optimizer, metrics=['accuracy'])
    return model

In [24]:

def TrainModel(cols):
    global yt
    global xt
    global x
    global y
    x=AA[inputCols].as_matrix()
    y=np.sign(AA['CLOSEpct'].shift(-1).as_matrix()) # set Y to be  tomorrow's close price change  0 for down 1 for up 
    y[y<0]=0
    filepath="weights-improvement.hdf5"
    if useLSTM:
        #x, y = SetLSTMInputs()
        x=np.reshape(x, (x.shape[0], 1, x.shape[1])) # add another dimension
        m=MakeLSTMModel(window, cols)
    else:
        m =MakeModel(cols)
    xi=x[201:split]     
    xt=x[split:len(x)-2]     
    yi=y[201:split]   
    yt=y[split:len(y)-2]    
    checkpoints = [ModelCheckpoint(filepath, monitor='val_acc', verbose=0, save_best_only=True, mode='max'),
                   EarlyStopping(monitor='val_acc', patience=waitEpochs, verbose=0)
                  ]    
    hist = m.fit(xi,yi,
            validation_data =(xt,yt),     
            epochs=epochs,
            verbose=0,
            callbacks=checkpoints
            )
    m.load_weights(filepath)
    return m, hist 

Helper Functions¶

In [25]:

def SetLSTMInputs():
    samps=len(x)-offset
    xn=[]
    yn=[]
    for i in range (samps):
        xn.append (x[(offset-window)+i:offset+i])
        yn.append (y[offset+i])    
    return np.array(xn) , np.array(yn)    

In [26]:

def GetNNAccuracy(history):
    global predictions
    predictions=m.predict(xt)
    if plotall:
        plt.plot(history.history['acc'])
        plt.plot(history.history['val_acc'])
        plt.title('model accuracy')
        plt.ylabel('accuracy')
        plt.xlabel('epoch')
        plt.legend(['train', 'test'], loc='upper left')
        plt.show()
        # summarize history for loss
        plt.plot(history.history['loss'])
        plt.plot(history.history['val_loss'])
        plt.title('model loss')
        plt.ylabel('loss')
        plt.xlabel('epoch')
        plt.legend(['train', 'test'], loc='upper left')
        plt.show()
    ytmp=np.expand_dims(yt, axis=1)
    for i in [0 , 0.05, 0.1, 0.2, 0.3, 0.4 ]:
        print "" 
        print "Signal Threshold " + str(i)
        idxs=np.any([predictions > (0.5+i), predictions < (0.5-i)], axis=0)
        preds= predictions[idxs]
        print "Matching Rows " + str(len(preds))
        preds[preds>0.5]=1
        preds[preds<0.5]=0
        print "NN Accuracy " + str (accuracy_score(preds,ytmp[idxs]) )   

In [27]:

def CheckPerformance(preds, thresh, holdPrevPos=True):
    period=len(preds)
    global posVect
    global predictions
    global pnlVect
    predictions=preds
    pnlVect=np.zeros(period)
    posVect=np.zeros(period)
    for i in range(period):
        if preds[i]>(0.5+thresh):
                posVect[i] = 1
        if preds[i]<(0.5-thresh):
                posVect[i] = -1
        if posVect[i]==0:
            if holdPrevPos:
                posVect[i]=posVect[i-1]
            else:
                posVect[i]=0
        pnlVect[i]=posVect[i]*AA['CLOSEpct'].shift(-1)[offset+split+i]
    return  np.cumsum(pnlVect),posVect 

In [28]:

def ChartNNPerformance(thresh,  holdPrevPos=True):
    pnl, positions = CheckPerformance(predictions,thresh, holdPrevPos)
    plt.plot(pnl)
    plt.plot(np.cumsum(AA['CLOSEpct'][offset+split:]).shift(-1).as_matrix())
    legend(["Model perf"] + [INST + " perf"], loc=2)
    plt.show()

In [29]:

def ChartCrossCorrel():
    x=AA[inputCols]
    colv= [x[c]for c in x.columns]              
    colv =pd.concat(colv, axis=1)              
    _ = scatter_matrix(colv, figsize=(20, 20), diagonal='kde')
    plt.show()

In [30]:

def ExamineInputs():
    print "Charts for last 200 days"  
    print ""
    print "Input Colums"  
    print inputCols
    for c in inputCols:
        plt.plot(AA.tail(200)[c])
    plt.show()
    print "predicted positions"  
    plt.bar(range(200),posVect[-200:])
    plt.show()

Functions to run the Analysis¶

In [31]:

def RunFullAnalysis(StockCode='^GSPC'):
    global m
    bestnCols=6
    Start(StockCode)
    if plotall:
        PlotAutoCorrel()
    PrepData()
    if plotall:
        PlotAutoCorrelAfter()
    featureImp = GetFeatureImportance()
    res = TrainModel(len(inputCols))
    m=res[0]
    GetNNAccuracy(res[1])
    ChartNNPerformance(0.0)
    dd ={}
    for i  in range(len(inputCols)):
        dd[inputCols[i]]=featureImp[i]
    
    dd = list(sorted(dd.items(), key=operator.itemgetter(1), reverse=True))
    MostImportantCols=[list(t)[0] for t in dd[:bestnCols]]
    print MostImportantCols
    PrepData(MostImportantCols)
    if plotall:
        ChartCrossCorrel()
    #res = TrainModel(bestnCols)
    #m=res[0]    
    #print "Using Restricted Columns"
    #GetNNAccuracy(res[1])
    #ChartNNPerformance(0.0)

In [32]:

def RunSingleModelWithFixedColumns(StockCode, Columns):
    global m
    Start(StockCode)
    PrepData(Columns)
    res = TrainModel(len(Columns))
    m=res[0]
    GetNNAccuracy(res[1])
    ChartNNPerformance(0.0)

Run Full Analysis¶

Try some random US stocks: NYSE:CAT, IBM, MS , GS, NYSE: XOM

Dense¶

In [50]:

useLSTM=False           
plotall=True           
epochs=1000            
waitEpochs=100  
window=0                
offset=0
RunFullAnalysis(StockCode='IBM') # Dense network
ExamineInputs()

Describing IBM

	OPEN	HIGH	LOW	Close	Volume	CLOSE
count	4000.000000	4000.000000	4000.000000	4000.000000	4.000000e+03	4000.000000
mean	129.121212	130.221605	128.128470	129.205377	6.149336e+06	129.205377
std	41.941123	42.078383	41.826448	41.961273	3.288254e+06	41.961273
min	55.070000	56.700000	54.010000	55.070000	0.000000e+00	55.070000
25%	90.222500	91.142500	89.495000	90.342500	3.962703e+06	90.342500
50%	121.780000	123.245000	120.735000	121.860000	5.382808e+06	121.860000
75%	165.255000	166.685000	164.107500	165.505000	7.498475e+06	165.505000
max	215.380000	215.900000	214.300000	215.800000	4.038760e+07	215.800000

XGBoost Accuracy is : 0.499341238472

Signal Threshold 0
Matching Rows 758
NN Accuracy 0.519788918206

Signal Threshold 0.05
Matching Rows 278
NN Accuracy 0.510791366906

Signal Threshold 0.1
Matching Rows 65
NN Accuracy 0.415384615385

Signal Threshold 0.2
Matching Rows 8
NN Accuracy 0.25

Signal Threshold 0.3
Matching Rows 3
NN Accuracy 0.333333333333

Signal Threshold 0.4
Matching Rows 2
NN Accuracy 0.5

['CLOSEmom5', 'TR', 'Volume', 'ATR100', 'CLOSEmom3', 'CLOSEmom100']

Charts for last 200 days

Input Colums
['CLOSEmom5', 'TR', 'Volume', 'ATR100', 'CLOSEmom3', 'CLOSEmom100']

predicted positions

In [58]:

#ChartNNPerformance(0.0,True)

LSTM¶

In [35]:

#useLSTM=True          
#plotall=True          
#epochs=20             
#waitEpochs=10  
#window=1                
#offset=0
#RunFullAnalysis(StockCode='NYSE:CAT') 
#ExamineInputs()

Fixed Set of Columns¶

Dense¶

In [36]:

#useLSTM=False           
#plotall=False           
#epochs=5000             
#waitEpochs=500  
#window=0                
#offset=0
#RunSingleModelWithFixedColumns(StockCode='NYSE:CAT', Columns=['ATR10v100', 'DeltaBolW20', 'TSF10', 'ATR7', 'CLOSEmom3', 'CLOSEmavg100']) 
#ExamineInputs()

LSTM¶

In [37]:

#useLSTM=True           
#plotall=True           
#epochs=200             
#waitEpochs=100  
#window=1                
#offset=0
#RunSingleModelWithFixedColumns(StockCode='NYSE:CAT', Columns=['CLOSEbolW20', 'CLOSEmom3', 'TSF10', 'ATR7', 'CLOSE']) 
#ExamineInputs()