In this case study, we will use the dimensionality reduction approach to enhance the ābitcoin trading strategyā related case study discussed in Chapter 6.
In this case study, we will use the dimensionality reduction approach to enhance the ābitcoin trading strategyā related case study discussed in Chapter 6.
The data and the variables used in this case study are same as the case study presented in the classification case study chapter. The data is the bitcoin data for the time period of Jan 2012 to October 2017, with minute to minute updates of OHLC (Open, High, Low, Close), Volume in BTC and indicated currency and weighted bitcoin price
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from mpl_toolkits.mplot3d import Axes3D
import re
from collections import OrderedDict
from time import time
import sqlite3
from scipy.linalg import svd
from scipy import stats
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE
import warnings
warnings.filterwarnings('ignore')
from IPython.html.widgets import interactive, fixed
dataset = pd.read_csv(r'../../Chapter 6 - Sup. Learning - Classification models/CaseStudy3 - Bitcoin Trading Strategy/BitstampData.csv')
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')
# shape
dataset.shape
(2841377, 8)
# peek at data
set_option('display.width', 100)
dataset.tail(5)
Timestamp | Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | |
---|---|---|---|---|---|---|---|---|
2841372 | 1496188560 | 2190.49 | 2190.49 | 2181.37 | 2181.37 | 1.700166 | 3723.784755 | 2190.247337 |
2841373 | 1496188620 | 2190.50 | 2197.52 | 2186.17 | 2195.63 | 6.561029 | 14402.811961 | 2195.206304 |
2841374 | 1496188680 | 2195.62 | 2197.52 | 2191.52 | 2191.83 | 15.662847 | 34361.023647 | 2193.791712 |
2841375 | 1496188740 | 2195.82 | 2216.00 | 2195.82 | 2203.51 | 27.090309 | 59913.492565 | 2211.620837 |
2841376 | 1496188800 | 2201.70 | 2209.81 | 2196.98 | 2208.33 | 9.961835 | 21972.308955 | 2205.648801 |
# describe data
set_option('precision', 3)
dataset.describe()
Timestamp | Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | |
---|---|---|---|---|---|---|---|---|
count | 2.841e+06 | 1.651e+06 | 1.651e+06 | 1.651e+06 | 1.651e+06 | 1.651e+06 | 1.651e+06 | 1.651e+06 |
mean | 1.411e+09 | 4.959e+02 | 4.962e+02 | 4.955e+02 | 4.959e+02 | 1.188e+01 | 5.316e+03 | 4.959e+02 |
std | 4.938e+07 | 3.642e+02 | 3.645e+02 | 3.639e+02 | 3.643e+02 | 4.094e+01 | 1.998e+04 | 3.642e+02 |
min | 1.325e+09 | 3.800e+00 | 3.800e+00 | 1.500e+00 | 1.500e+00 | 0.000e+00 | 0.000e+00 | 3.800e+00 |
25% | 1.368e+09 | 2.399e+02 | 2.400e+02 | 2.398e+02 | 2.399e+02 | 3.828e-01 | 1.240e+02 | 2.399e+02 |
50% | 1.411e+09 | 4.200e+02 | 4.200e+02 | 4.199e+02 | 4.200e+02 | 1.823e+00 | 6.146e+02 | 4.200e+02 |
75% | 1.454e+09 | 6.410e+02 | 6.417e+02 | 6.402e+02 | 6.410e+02 | 8.028e+00 | 3.108e+03 | 6.410e+02 |
max | 1.496e+09 | 2.755e+03 | 2.760e+03 | 2.752e+03 | 2.755e+03 | 5.854e+03 | 1.866e+06 | 2.754e+03 |
#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())
Null Values = True
Given that there are null values, we need to clean the data by filling the NaNs with the last available values.
dataset[dataset.columns.values] = dataset[dataset.columns.values].ffill()
dataset=dataset.drop(columns=['Timestamp'])
We attach a label to each movement:
# Initialize the `signals` DataFrame with the `signal` column
#datas['PriceMove'] = 0.0
# Create short simple moving average over the short window
dataset['short_mavg'] = dataset['Close'].rolling(window=10, min_periods=1, center=False).mean()
# Create long simple moving average over the long window
dataset['long_mavg'] = dataset['Close'].rolling(window=60, min_periods=1, center=False).mean()
# Create signals
dataset['signal'] = np.where(dataset['short_mavg'] > dataset['long_mavg'], 1.0, 0.0)
dataset.tail()
Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | short_mavg | long_mavg | signal | |
---|---|---|---|---|---|---|---|---|---|---|
2841372 | 2190.49 | 2190.49 | 2181.37 | 2181.37 | 1.700 | 3723.785 | 2190.247 | 2179.259 | 2189.616 | 0.0 |
2841373 | 2190.50 | 2197.52 | 2186.17 | 2195.63 | 6.561 | 14402.812 | 2195.206 | 2181.622 | 2189.877 | 0.0 |
2841374 | 2195.62 | 2197.52 | 2191.52 | 2191.83 | 15.663 | 34361.024 | 2193.792 | 2183.605 | 2189.943 | 0.0 |
2841375 | 2195.82 | 2216.00 | 2195.82 | 2203.51 | 27.090 | 59913.493 | 2211.621 | 2187.018 | 2190.204 | 0.0 |
2841376 | 2201.70 | 2209.81 | 2196.98 | 2208.33 | 9.962 | 21972.309 | 2205.649 | 2190.712 | 2190.510 | 1.0 |
We perform feature engineering to construct technical indicators which will be used to make the predictions, and the output variable.
The current data of the bicoin consists of date, open, high, low, close and volume. Using this data we calculate the following technical indicators:
#calculation of exponential moving average
def EMA(df, n):
EMA = pd.Series(df['Close'].ewm(span=n, min_periods=n).mean(), name='EMA_' + str(n))
return EMA
dataset['EMA10'] = EMA(dataset, 10)
dataset['EMA30'] = EMA(dataset, 30)
dataset['EMA200'] = EMA(dataset, 200)
dataset.head()
#calculation of rate of change
def ROC(df, n):
M = df.diff(n - 1)
N = df.shift(n - 1)
ROC = pd.Series(((M / N) * 100), name = 'ROC_' + str(n))
return ROC
dataset['ROC10'] = ROC(dataset['Close'], 10)
dataset['ROC30'] = ROC(dataset['Close'], 30)
#Calculation of price momentum
def MOM(df, n):
MOM = pd.Series(df.diff(n), name='Momentum_' + str(n))
return MOM
dataset['MOM10'] = MOM(dataset['Close'], 10)
dataset['MOM30'] = MOM(dataset['Close'], 30)
#calculation of relative strength index
def RSI(series, period):
delta = series.diff().dropna()
u = delta * 0
d = u.copy()
u[delta > 0] = delta[delta > 0]
d[delta < 0] = -delta[delta < 0]
u[u.index[period-1]] = np.mean( u[:period] ) #first value is sum of avg gains
u = u.drop(u.index[:(period-1)])
d[d.index[period-1]] = np.mean( d[:period] ) #first value is sum of avg losses
d = d.drop(d.index[:(period-1)])
rs = u.ewm(com=period-1, adjust=False).mean() / \
d.ewm(com=period-1, adjust=False).mean()
return 100 - 100 / (1 + rs)
dataset['RSI10'] = RSI(dataset['Close'], 10)
dataset['RSI30'] = RSI(dataset['Close'], 30)
dataset['RSI200'] = RSI(dataset['Close'], 200)
#calculation of stochastic osillator.
def STOK(close, low, high, n):
STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - low.rolling(n).min())) * 100
return STOK
def STOD(close, low, high, n):
STOK = ((close - low.rolling(n).min()) / (high.rolling(n).max() - low.rolling(n).min())) * 100
STOD = STOK.rolling(3).mean()
return STOD
dataset['%K10'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 10)
dataset['%D10'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 10)
dataset['%K30'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 30)
dataset['%D30'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 30)
dataset['%K200'] = STOK(dataset['Close'], dataset['Low'], dataset['High'], 200)
dataset['%D200'] = STOD(dataset['Close'], dataset['Low'], dataset['High'], 200)
#Calculation of moving average
def MA(df, n):
MA = pd.Series(df['Close'].rolling(n, min_periods=n).mean(), name='MA_' + str(n))
return MA
dataset['MA21'] = MA(dataset, 10)
dataset['MA63'] = MA(dataset, 30)
dataset['MA252'] = MA(dataset, 200)
dataset.tail()
Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | short_mavg | long_mavg | signal | ... | RSI200 | %K10 | %D10 | %K30 | %D30 | %K200 | %D200 | MA21 | MA63 | MA252 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2841372 | 2190.49 | 2190.49 | 2181.37 | 2181.37 | 1.700 | 3723.785 | 2190.247 | 2179.259 | 2189.616 | 0.0 | ... | 46.613 | 56.447 | 73.774 | 47.883 | 59.889 | 16.012 | 18.930 | 2179.259 | 2182.291 | 2220.727 |
2841373 | 2190.50 | 2197.52 | 2186.17 | 2195.63 | 6.561 | 14402.812 | 2195.206 | 2181.622 | 2189.877 | 0.0 | ... | 47.638 | 93.687 | 71.712 | 93.805 | 65.119 | 26.697 | 20.096 | 2181.622 | 2182.292 | 2220.295 |
2841374 | 2195.62 | 2197.52 | 2191.52 | 2191.83 | 15.663 | 34361.024 | 2193.792 | 2183.605 | 2189.943 | 0.0 | ... | 47.395 | 80.995 | 77.043 | 81.350 | 74.346 | 23.850 | 22.186 | 2183.605 | 2182.120 | 2219.802 |
2841375 | 2195.82 | 2216.00 | 2195.82 | 2203.51 | 27.090 | 59913.493 | 2211.621 | 2187.018 | 2190.204 | 0.0 | ... | 48.213 | 74.205 | 82.963 | 74.505 | 83.220 | 32.602 | 27.716 | 2187.018 | 2182.337 | 2219.396 |
2841376 | 2201.70 | 2209.81 | 2196.98 | 2208.33 | 9.962 | 21972.309 | 2205.649 | 2190.712 | 2190.510 | 1.0 | ... | 48.545 | 82.810 | 79.337 | 84.344 | 80.066 | 36.440 | 30.964 | 2190.712 | 2182.715 | 2218.980 |
5 rows Ć 29 columns
dataset.tail()
Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | short_mavg | long_mavg | signal | ... | RSI200 | %K10 | %D10 | %K30 | %D30 | %K200 | %D200 | MA21 | MA63 | MA252 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2841372 | 2190.49 | 2190.49 | 2181.37 | 2181.37 | 1.700 | 3723.785 | 2190.247 | 2179.259 | 2189.616 | 0.0 | ... | 46.613 | 56.447 | 73.774 | 47.883 | 59.889 | 16.012 | 18.930 | 2179.259 | 2182.291 | 2220.727 |
2841373 | 2190.50 | 2197.52 | 2186.17 | 2195.63 | 6.561 | 14402.812 | 2195.206 | 2181.622 | 2189.877 | 0.0 | ... | 47.638 | 93.687 | 71.712 | 93.805 | 65.119 | 26.697 | 20.096 | 2181.622 | 2182.292 | 2220.295 |
2841374 | 2195.62 | 2197.52 | 2191.52 | 2191.83 | 15.663 | 34361.024 | 2193.792 | 2183.605 | 2189.943 | 0.0 | ... | 47.395 | 80.995 | 77.043 | 81.350 | 74.346 | 23.850 | 22.186 | 2183.605 | 2182.120 | 2219.802 |
2841375 | 2195.82 | 2216.00 | 2195.82 | 2203.51 | 27.090 | 59913.493 | 2211.621 | 2187.018 | 2190.204 | 0.0 | ... | 48.213 | 74.205 | 82.963 | 74.505 | 83.220 | 32.602 | 27.716 | 2187.018 | 2182.337 | 2219.396 |
2841376 | 2201.70 | 2209.81 | 2196.98 | 2208.33 | 9.962 | 21972.309 | 2205.649 | 2190.712 | 2190.510 | 1.0 | ... | 48.545 | 82.810 | 79.337 | 84.344 | 80.066 | 36.440 | 30.964 | 2190.712 | 2182.715 | 2218.980 |
5 rows Ć 29 columns
#excluding columns that are not needed for our prediction.
dataset=dataset.drop(['High','Low','Open', 'Volume_(Currency)','short_mavg','long_mavg'], axis=1)
dataset = dataset.dropna(axis=0)
dataset.tail()
Close | Volume_(BTC) | Weighted_Price | signal | EMA10 | EMA30 | EMA200 | ROC10 | ROC30 | MOM10 | ... | RSI200 | %K10 | %D10 | %K30 | %D30 | %K200 | %D200 | MA21 | MA63 | MA252 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2841372 | 2181.37 | 1.700 | 2190.247 | 0.0 | 2181.181 | 2182.376 | 2211.244 | 0.431 | -0.649 | 8.42 | ... | 46.613 | 56.447 | 73.774 | 47.883 | 59.889 | 16.012 | 18.930 | 2179.259 | 2182.291 | 2220.727 |
2841373 | 2195.63 | 6.561 | 2195.206 | 0.0 | 2183.808 | 2183.231 | 2211.088 | 1.088 | -0.062 | 23.63 | ... | 47.638 | 93.687 | 71.712 | 93.805 | 65.119 | 26.697 | 20.096 | 2181.622 | 2182.292 | 2220.295 |
2841374 | 2191.83 | 15.663 | 2193.792 | 0.0 | 2185.266 | 2183.786 | 2210.897 | 1.035 | -0.235 | 19.83 | ... | 47.395 | 80.995 | 77.043 | 81.350 | 74.346 | 23.850 | 22.186 | 2183.605 | 2182.120 | 2219.802 |
2841375 | 2203.51 | 27.090 | 2211.621 | 0.0 | 2188.583 | 2185.058 | 2210.823 | 1.479 | 0.297 | 34.13 | ... | 48.213 | 74.205 | 82.963 | 74.505 | 83.220 | 32.602 | 27.716 | 2187.018 | 2182.337 | 2219.396 |
2841376 | 2208.33 | 9.962 | 2205.649 | 1.0 | 2192.174 | 2186.560 | 2210.798 | 1.626 | 0.516 | 36.94 | ... | 48.545 | 82.810 | 79.337 | 84.344 | 80.066 | 36.440 | 30.964 | 2190.712 | 2182.715 | 2218.980 |
5 rows Ć 23 columns
dataset[['Weighted_Price']].plot(grid=True)
plt.show()
fig = plt.figure()
plot = dataset.groupby(['signal']).size().plot(kind='barh', color='red')
plt.show()
The predicted variable is upward 52.87% out of total data-size, meaning that number of the buy signals was higher than that of sell signals.
We split the dataset into 80% training set and 20% test set.
# split out validation dataset for the end
subset_dataset= dataset.iloc[-10000:]
Y= subset_dataset["signal"]
X = subset_dataset.loc[:, dataset.columns != 'signal']
validation_size = 0.2
seed = 1
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=1)
As a preprocessing step, let's start with normalizing the feature values so they standardised - this makes comparisons simpler and allows next steps for Singular Value Decomposition.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
rescaledDataset = pd.DataFrame(scaler.fit_transform(X_train),columns = X_train.columns, index = X_train.index)
# summarize transformed data
X_train.dropna(how='any', inplace=True)
rescaledDataset.dropna(how='any', inplace=True)
rescaledDataset.head(2)
Close | Volume_(BTC) | Weighted_Price | EMA10 | EMA30 | EMA200 | ROC10 | ROC30 | MOM10 | MOM30 | ... | RSI200 | %K10 | %D10 | %K30 | %D30 | %K200 | %D200 | MA21 | MA63 | MA252 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2834071 | 1.072 | -0.367 | 1.040 | 1.064 | 1.077 | 1.014 | 0.005 | -0.159 | 0.009 | -0.183 | ... | -0.325 | 1.322 | 0.427 | -0.205 | -0.412 | 0.714 | 0.673 | 1.061 | 1.086 | 0.895 |
2836517 | -1.738 | 1.126 | -1.714 | -1.687 | -1.653 | -1.733 | -0.533 | -0.597 | -0.066 | -0.416 | ... | -0.465 | -1.620 | -0.511 | -1.283 | -0.970 | -0.988 | -0.788 | -1.685 | -1.643 | -1.662 |
2 rows Ć 22 columns
We want to reduce the dimensionality of the problem to make it more manageable, but at the same time we want to preserve as much information as possible.
Hence, we use a technique called singuā lar value decomposition (SVD), which is one of the ways of performing PCA.Singular Value Decomposition (SVD) is a matrix factorization commonly used in signal processing and data compression. We are using the TruncatedSVD method in the sklearn package.
from matplotlib.ticker import MaxNLocator
ncomps = 5
svd = TruncatedSVD(n_components=ncomps)
svd_fit = svd.fit(rescaledDataset)
plt_data = pd.DataFrame(svd_fit.explained_variance_ratio_.cumsum()*100)
plt_data.index = np.arange(1, len(plt_data) + 1)
Y_pred = svd.fit_transform(rescaledDataset)
ax = plt_data.plot(kind='line', figsize=(10, 4))
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.set_xlabel("Eigenvalues")
ax.set_ylabel("Percentage Explained")
ax.legend("")
print('Variance preserved by first 5 components == {:.2%}'.format(svd_fit.explained_variance_ratio_.cumsum()[-1]))
Variance preserved by first 5 components == 92.75%
We can preserve 92.75% variance by using just 5 components rather than the full 25+ original features.
dfsvd = pd.DataFrame(Y_pred, columns=['c{}'.format(c) for c in range(ncomps)], index=rescaledDataset.index)
print(dfsvd.shape)
dfsvd.head()
(8000, 5)
c0 | c1 | c2 | c3 | c4 | |
---|---|---|---|---|---|
2834071 | -2.252 | 1.920 | 0.538 | -0.019 | -0.967 |
2836517 | 5.303 | -1.689 | -0.678 | 0.473 | 0.643 |
2833945 | -2.315 | -0.042 | 1.697 | -1.704 | 1.672 |
2835048 | -0.977 | 0.782 | 3.706 | -0.697 | 0.057 |
2838804 | 2.115 | -1.915 | 0.475 | -0.174 | -0.299 |
Lets attempt to visualise the data with the compressed dataset, represented by the top 5 components of an SVD.
svdcols = [c for c in dfsvd.columns if c[0] == 'c']
Pairs-plots are a simple representation using a set of 2D scatterplots, plotting each component against another component, and coloring the datapoints according to their classification (or type of signal).
plotdims = 5
ploteorows = 1
dfsvdplot = dfsvd[svdcols].iloc[:,:plotdims]
dfsvdplot['signal']=Y_train
ax = sns.pairplot(dfsvdplot.iloc[::ploteorows,:], hue='signal', size=1.8)
Observation:
a clear segregation of the orange and blue dots, which means that data-points from the same type of signal tend to cluster together.
between data points across all the components, requiring that the reader hold comparisons in their head while viewing
As an alternative to the pairs-plots, we could view a 3D scatterplot, which at least lets us see more dimensions at once and possibly get an interactive feel for the data
def scatter_3D(A, elevation=30, azimuth=120):
maxpts=1000
fig = plt.figure(1, figsize=(9, 9))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=elevation, azim=azimuth)
ax.set_xlabel('component 0')
ax.set_ylabel('component 1')
ax.set_zlabel('component 2')
# plot subset of points
rndpts = np.sort(np.random.choice(A.shape[0], min(maxpts,A.shape[0]), replace=False))
coloridx = np.unique(A.iloc[rndpts]['signal'], return_inverse=True)
colors = coloridx[1] / len(coloridx[0])
sp = ax.scatter(A.iloc[rndpts,0], A.iloc[rndpts,1], A.iloc[rndpts,2]
,c=colors, cmap="jet", marker='o', alpha=0.6
,s=50, linewidths=0.8, edgecolor='#BBBBBB')
plt.show()
dfsvd['signal'] = Y_train
interactive(scatter_3D, A=fixed(dfsvd), elevation=30, azimuth=120)
interactive(children=(IntSlider(value=30, description='elevation', max=90, min=-30), IntSlider(value=120, descā¦
Observation:
The iPython Notebook interactive package lets us create an interactive plot with controls for elevation and azimuth We can use these controls to interactively change the view of the top 3 components and investigate their relations. This certainly appears to be more informative than pairs-plots.
However, we still suffer from the same major limitations of the pairs-plots, namely that we lose a lot of the variance and have to hold a lot in our heads when viewing.
In this step, we implement another technique of dimensionality reduction - t-SNE and look at the related visualization.We will use the basic implementation available in scikit-learn
tsne = TSNE(n_components=2, random_state=0)
Z = tsne.fit_transform(dfsvd[svdcols])
dftsne = pd.DataFrame(Z, columns=['x','y'], index=dfsvd.index)
dftsne['signal'] = Y_train
g = sns.lmplot('x', 'y', dftsne, hue='signal', fit_reg=False, size=8
,scatter_kws={'alpha':0.7,'s':60})
g.axes.flat[0].set_title('Scatterplot of a Multiple dimension dataset reduced to 2D using t-SNE')
Text(0.5, 1.0, 'Scatterplot of a Multiple dimension dataset reduced to 2D using t-SNE')
Observation:
This is quite interesting way of visualizing the trading signal data. The above plot shows us that there is a good degree of clustering for the trading signal. Although, there are some overap of the long and short signals, but they can be distinguished quite well using the reduced number of features.
In Review:
We have analyzed the bitcoin trading signal dataset in the following steps:
# test options for classification
scoring = 'accuracy'
import time
start_time = time.time()
# spot check the algorithms
models = RandomForestClassifier(n_jobs=-1)
cv_results_XTrain= cross_val_score(models, X_train, Y_train, cv=kfold, scoring=scoring)
print("Time Without Dimensionality Reduction--- %s seconds ---" % (time.time() - start_time))
Time Without Dimensionality Reduction--- 7.781347990036011 seconds ---
start_time = time.time()
X_SVD= dfsvd[svdcols].iloc[:,:5]
cv_results_SVD = cross_val_score(models, X_SVD, Y_train, cv=kfold, scoring=scoring)
print("Time with Dimensionality Reduction--- %s seconds ---" % (time.time() - start_time))
Time with Dimensionality Reduction--- 2.281977653503418 seconds ---
print("Result without dimensionality Reduction: %f (%f)" % (cv_results_XTrain.mean(), cv_results_XTrain.std()))
print("Result with dimensionality Reduction: %f (%f)" % (cv_results_SVD.mean(), cv_results_SVD.std()))
Result without dimensionality Reduction: 0.936375 (0.010774) Result with dimensionality Reduction: 0.887500 (0.012698)
Looking at the model results, we do not deviate that much from the accuracy, and the accuracy just decreases from 93.6% to 88.7%. However, there is a 4 times improveā ment in the time taken, which is significant.
Conclusion:
With dimensionality reduction, we achieved almost the same accuracy with four times improvement in the time. In trading strategy development, when the datasets are huge and the number of features is big such improvement in time can lead to a significant improvement in the entire process.
We demonstrated that both SVD and t-SNE provide quite interesting way of visualizing the trading signal data, and provide a way to distinguished long and short signals of a trading strategy with reduced number of features.