This project is based on my 4th year thesis project which was looking at timeseries momentum for cryptocurrency returns. In the thesis, I found conclusive evidence that past returns do have some impact on future returns at the monthly and weekly levels and discovered this by basic linear regression. In this project, I aim to go further and look at the impact of past returns combined with past trading volume and see how well this predicts future daily returns. This project will mainly focus on using the machine learning models that I've been exposed to during this class.
The dataset is from Yahoo Finance - https://finance.yahoo.com/quote/BTC-USD/
First, we import the requried libraries and dataset.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import numpy as np
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.ensemble import GradientBoostingClassifier
import sklearn as sk
df = pd.read_csv('BTC-USD.csv')
#recognize the date column as a datetime object in pandas
df['Date'] = pd.to_datetime(df['Date'])
#set the date as the index
df.set_index('Date', inplace=True)
#we'll keep only the open price and volume
df.drop(columns=['High', 'Low', 'Close', 'Adj Close'], inplace=True)
Calculate the daily returns and percentage change in volume. Focusing on returns and percentage change in volume aims to make the data 'stationary' and facilitates more accurate prediction and inference.
df['daily_pct_chg'] = df['Open'].pct_change(periods=1)
df['daily_pct_chg_vol'] = df['Volume'].pct_change(periods=1)
df.dropna(inplace=True)
print(df.head())
Let's see how the returns look with a few graphs.
fig, ax = plt.subplots(ncols=2,nrows=2,figsize=(30,30))
sns.set(style="whitegrid", font_scale=2.5)
#first image
image1 = sns.boxplot(data=df['daily_pct_chg'], orient='v',ax=ax[0,0])
image1.set(ylabel='Returns (Decimal)', title='Bitcoin Daily Returns Boxplot')
#second image
image2 = sns.distplot(df['daily_pct_chg'],kde=False,norm_hist=True,bins=100,ax=ax[0,1])
image2.set(ylabel='Frequency', title='Bitcoin Daily Returns Distribution',xlabel='Returns')
#third image
image3 = sns.boxplot(data=df['daily_pct_chg_vol'], orient='v',ax=ax[1,0])
image3.set(ylabel='Percentage Change (Decimal)', title='Bitcoin Daily Volume % Change Boxplot')
#second image
image4 = sns.distplot(df['daily_pct_chg'],kde=False,norm_hist=True,bins=100,ax=ax[1,1])
image4.set(ylabel='Frequency', title='Bitcoin Daily Volume % Change Distribution',xlabel='Change in Volume')
plt.show()
As we can see from the above graphs, the daily returns and volume are centered around 0 which is what we would expect. From the first column of graphs, we also see that the volatility in volume is much greater than the volatility of returns by looking at the vertical spread of the box plot. The volatility in volume traded is part of the motivation to include it as an indicator in the timeseries momentum.
Lets also take a look to see how returns vary by days of the week.
df['Day'] = df.index.day_name()
fig, ax = plt.subplots(figsize=(7.5,7.5))
df.groupby('Day')['daily_pct_chg'].mean().plot(kind='bar',ax=ax)
plt.title('Bitcoin Daily Returns')
plt.xlabel('Day of Week')
plt.ylabel('Mean Returns (Decimal)')
From the plot above, we can see that there is quite a bit of variability in the mean returns by day. I will therefore add dummy variables indicating the day of the week to try extract more information from the dataset for the models.
#creating 6 dummies instead of 7 by specifying drop_first=True to avoid the dummy variable trap
df = pd.concat([df,pd.get_dummies(df['Day'],drop_first=True)],axis=1)
df.head()
Now create lagged values for daily returns
for i in range(1,5):
column_name = 'lag_return_' + str(i)
df[column_name] = df['daily_pct_chg'].shift(i)
del i
print(df.head(5))
Repeat for the lagged volume indicators
for i in range(1,5):
column_name = 'lag_return_vol_' + str(i)
df[column_name] = df['daily_pct_chg_vol'].shift(i)
df.dropna(inplace=True)
print(df.head(5))
The next part of this project will investigate the predictive accuracy of several models including logstic and neural networks. To make it easier to compare between models, I've decided to frame the problem as a classification problem where I will be predicting whether the daily returns move up or down, a binary outcome, instead of how large the returns are. I've printed out a few error scores for the models below after each cell, but I also compare them in a graph later on.
#create a binary column for daily returns where 0 represents the negative returns and 1 is the positive returns
def fn_binary(value):
if value < 0:
return 0
else:
return 1
df['target'] = df['daily_pct_chg'].apply(fn_binary)
print(df.columns)
First we'll make the training and test data for our models. We'll use a random 90% of the data for training and the remaining 10% of the data to see how our models generalize to data they haven't seen before.
#Split and sort into train, test
dftrain, dftest = train_test_split(df, test_size=0.1, random_state=101)
#training data, drop the columns we don't need
trainy = pd.to_numeric(dftrain['target'])
trainx = (dftrain.drop(['Open','Volume','daily_pct_chg','Day','target'], axis=1))
#Test data
testy = pd.to_numeric(dftest['target'])
testx = dftest.drop(['Open','Volume','daily_pct_chg','Day', 'target'], axis=1)
print(trainx.shape)
print(testx.shape)
Let's first train a logistic regression and analyze the results.
model_logistic = LogisticRegression()
model_logistic.fit(trainx, trainy)
logistic_train_accuracy = model_logistic.score(trainx,trainy)
logistic_test_accuracy = model_logistic.score(testx,testy)
print(f'Logistic Train Accuracy is: {logistic_train_accuracy}')
print(f'Logistic Test Accuracy is: {logistic_test_accuracy}')
log_loss_logistic_train = sk.metrics.log_loss(trainy, model_logistic.predict(trainx))
log_loss_logistic_test = sk.metrics.log_loss(testy, model_logistic.predict(testx))
print(f'Logistic LogLoss Train is: {log_loss_logistic_train}')
print(f'Logistic LogLoss Test is: {log_loss_logistic_test}')
Let's try a neural network classifier. We should scale the data to make it easier for the model to train.
scaler = StandardScaler()
#fit and transform on the input training data
trainx = scaler.fit_transform(trainx)
#use the same scaler to transform the input test data
testx = scaler.transform(testx)
#early stopping to try and avoid overfitting on the training data
#this should lead to better generalization to test data
model_nn = MLPClassifier(early_stopping=True, hidden_layer_sizes=(20,20,20))
model_nn.fit(trainx, trainy)
nn_train_accuracy = model_nn.score(trainx,trainy)
nn_test_accuracy = model_nn.score(testx,testy)
print(f'Neural Network Train Accuracy is: {nn_train_accuracy}')
print(f'Neural Network Test Accuracy is: {nn_test_accuracy}')
log_loss_nn_train = sk.metrics.log_loss(trainy, model_nn.predict(trainx))
log_loss_nn_test = sk.metrics.log_loss(testy, model_nn.predict(testx))
print(f'Neural Network LogLoss Train is: {log_loss_nn_train}')
print(f'Neural Network LogLoss Test is: {log_loss_nn_test}')
Now I'll try a Gradient Boosting Classifier, with an implementation of a grid search cross validation to try and tune some of the parameters of the model. When trying to use this model to make predictions for trading with real money, often the uncertainity around the prediction influences how much money is invested, therefore I've also implemented a log loss error for the GridSearchCV. The logloss penalizes the model when it predicts a high probability and is wrong and therefore emphasizes accountability.
opt_max_depth=2
opt_n_estimators=500
opt_learning_rate=0.1
#Optimize some parameters
param_test1 = {'n_estimators':[100,200,600],'learning_rate':[0.001,0.01,0.1],'max_depth':[1,3,4]}
gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(),
param_grid = param_test1,scoring='accuracy',
cv=5, verbose=0)
gsearch1.fit(trainx, trainy)
opt_n_estimators=gsearch1.best_params_['n_estimators']
opt_learning_rate=gsearch1.best_params_['learning_rate']
opt_max_depth=gsearch1.best_params_['max_depth']
model_tree = GradientBoostingClassifier(n_estimators=opt_n_estimators,
max_depth=opt_max_depth,learning_rate=opt_learning_rate)
model_tree.fit(trainx,trainy
)
tree_train_accuracy = model_nn.score(trainx,trainy)
tree_test_accuracy = model_nn.score(testx,testy)
print(f'Tree Train Accuracy is: {tree_train_accuracy}')
print(f'Tree Test Accuracy is: {tree_test_accuracy}')
log_loss_tree_train = sk.metrics.log_loss(trainy, model_tree.predict(trainx))
log_loss_tree_test = sk.metrics.log_loss(testy, model_tree.predict(testx))
print(f'Tree LogLoss Train is: {log_loss_tree_train}')
print(f'Tree LogLoss Test is: {log_loss_tree_test}')
Now we'll compare the error scores for the three models
train_accuracy = (logistic_train_accuracy, nn_train_accuracy, tree_train_accuracy)
test_accuracy = (logistic_test_accuracy, nn_test_accuracy, tree_test_accuracy)
train_logloss = (log_loss_logistic_train, log_loss_nn_train, log_loss_tree_train)
test_logloss = (log_loss_logistic_test, log_loss_nn_test, log_loss_tree_test)
ind = np.arange(len(train_accuracy)) # the x locations for the groups
width = 0.35 # the width of the bars
fig, ax = plt.subplots(figsize=(20,20),nrows=2)
rects1 = ax[0].bar(ind - width/2, train_accuracy, width,
label='train')
rects2 = ax[0].bar(ind + width/2, test_accuracy, width,
label='test')
rects3 = ax[1].bar(ind - width/2, train_logloss, width,
label='train')
rects4 = ax[1].bar(ind + width/2, test_logloss, width,
label='test')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax[0].set_ylabel('Error')
ax[0].set_title('Accuracy by Model')
ax[0].set_xticks(ind)
ax[0].set_xticklabels(('Logistic', 'NN', 'Tree'))
ax[0].legend(loc='lower right')
ax[1].set_ylabel('Error')
ax[1].set_title('LogLoss by Model')
ax[1].set_xticks(ind)
ax[1].set_xticklabels(('Logistic', 'NN', 'Tree'))
ax[1].legend(loc='lower right')
x=0
for i in train_accuracy:
#ax[0].text(ind[0]-(width/2),logistic_train_accuracy,str(logistic_train_accuracy)[0:4])
ax[0].text(ind[x]-(width),i,str(i)[0:5])
x = x + 1
x=0
for i in test_accuracy:
ax[0].text(ind[x]+(width/2),i,str(i)[0:5])
x = x + 1
x=0
for i in train_logloss:
ax[1].text(ind[x]-(width),i,str(i)[0:5])
x = x + 1
x=0
for i in test_logloss:
ax[1].text(ind[x]+(width/2),i,str(i)[0:5])
x = x + 1
From the above graphs, we see that the logistic regression performs better in terms of accuracy on both the train and test sets. As it is a classification problem, the accuracy is just the amount of predictions that it get's correct. Therefore the accuracy value of 52.5 on the test set means that it correctly predicts 52.5% of the time. Although this may not seem like a good score, predicting a financial market is difficult and predicting over 50% accuracy could lead to a profitable trading strategy, however, that is beyond the scope of this project.
In terms of log loss, which may be a more appropriate measure to use as a trading strategy, the Gradient Boosting Classifier (Tree model) performed the best on the test set with a log loss error score of 16.03. It should be noted that a Grid Search was only implemented for the Gradient Boosting Classifier and not for the Neural Network due to computation power of the Jupyter notebook and scikitlearn packages.
Now I will change from a classification problem back to a regression problem that will allow me predict the magnitude of the future daily return. I will compare the results between a linear regression model and a neural network.
#Split and sort into train, test
dftrain, dftest = train_test_split(df, test_size=0.1, random_state=101)
#training data, drop the columns we don't need
trainy = pd.to_numeric(dftrain['daily_pct_chg'])
trainx = (dftrain.drop(['Open','Volume','target','Day','target','daily_pct_chg'], axis=1))
#Test data
testy = pd.to_numeric(dftest['daily_pct_chg'])
testx = dftest.drop(['Open','Volume','target','Day', 'target', 'daily_pct_chg'], axis=1)
linear_model= LinearRegression()
linear_model.fit(trainx, trainy)
mse_linear_train = sk.metrics.mean_squared_error(trainy, linear_model.predict(trainx))
mse_linear_test = sk.metrics.mean_squared_error(testy, linear_model.predict(testx))
print(f'Linear MSE Train is: {mse_linear_train*1000}')
print(f'Linear MSE Test is: {mse_linear_test*1000}')
nn_model= MLPRegressor(hidden_layer_sizes=(20,20,20))
nn_model.fit(trainx, trainy)
mse_nn_train = sk.metrics.mean_squared_error(trainy, nn_model.predict(trainx))
mse_nn_test = sk.metrics.mean_squared_error(testy, nn_model.predict(testx))
print(f'Neural Network MSE Train is: {mse_nn_train*1000}')
print(f'Neural Network MSE Test is: {mse_nn_test*1000}')
The Linear model performs better in terms of mean squared error on both the training and test set. Let's now try to visualize how the models compare in their predictions with the real values of the returns.
df_results = pd.DataFrame(data={'Linear':linear_model.predict(testx),
'Neural Network':nn_model.predict(testx),
'Actual':testy})
f, ax = plt.subplots(figsize=(20,20))
ax.scatter(testy, linear_model.predict(testx),color='#2ca02c', label='Linear Model')#green
ax.scatter(testy, nn_model.predict(testx), color='#9467bd', label='Neural Network Model')#purple
ax.legend()
ax.set_ylabel('Predicted Return')
ax.set_title('Predicted vs Actual Returns (Test Data)')
ax.set_xlabel('Actual Return')
plt.show()
From the above graph we see that the linear model(green) predicts values much closer to 0 while the Neural Network model has a much wider range to it's predictions. Although the linear model does have lower error scores, it's unclear whether either of the models would be sufficiently accurate to implement in a trading system.
Therefore to conclude, this project was mostly just to gain experience working with a dataset and using some of Python's prediction tools as an extension to my thesis. It's clear that a simple logistic model performs well in a classification problem and a linear regression performs well on non-binary data, however, it is important to note that the hyperparameter tuning of the Neural Network and Gradient Boosting models is an art of it's own and therefore we cannot conclude which model is definitively the best.