In [1]:
sys.path.append('Z:\\GitHub\\PyArb') # Change the path accordingly

Statistical arbitrage with Python

NOTE: For the impatient/curious, the model gives about 12% annual returns with Sharpe ratio=5 and maximum drawdown=0.5% with \$0.005/per share transaction costs. Jump to the end of this notebook to see the plots.

1 - Introduction

I started writing this notebook as a learning project about equity statistical arbitrage. The model is a very elementary system of stochastic differential equations, which is supposed to model the behaviour of equity prices, and specifically their interactions. This can be viewed as a generalization of the Ornstein-Uhlenbeck process (except that I'm using multiplicative noise as in GBM) that is widely used in modelling the behaviour of a certain portfolio of two stocks in pairs trading. The model is defined as

$$ d X_t^i = A^i_j X_t^j dt + X_t^i d W_t^i , $$

where $X_t^i$ is the price of stock $i$ at time $t$, $A$ is an interaction matrix to be modelled, $dW$ is a white noise with a covariance matrix $\mathbb E\left( dW_t^i dW_t^j \right) = \rho^{ij} dt$, where $\rho$ is also modelled from data.

The model can be viewed as a simple generalization of geometric brownian motion to a multivariate case, so I will just call it the Vector Geometric Brownian Motion (VGBM).

A finite time step (1 min bar) equation can be written as

$$ X_{t+1}^i = (\mathbb{1} + A \Delta t)_{ij} X_t^j + X_t^i d W_t^i ,$$

from which one can hopefully see the similarity to e.g. VAR(1), except that the noise term is multiplicative.

The model probably can't be claimed to describe the behaviour of the stock market very well (why would the price change of asset $i$ depend on a linear combination the other prices, $A^i_j X^j$?), but I think it should be able to capture the short term behavior at least to some degree. In any case, consider it a toy model from which it is easy to generalize to better models while still using most of the code below.

A nice thing about the model is that it is easy to form a _mean reverting portfolio as $\xi \cdot X_t$, where $\xi$ is a left eigenvector of $A$_. It is then a simple matter to update the the matrix A, and therefore also $\xi$ and the portfolio, at each 1 min time step (currently the code works with 1 min bars only). Then, when the value of the portfolio reaches a pre-set value below or above the mean (e.g. one standard deviation), we buy or short (respectively) the portfolio. When that happens, we lock the eigenvector/portfolio (i.e. stop updating it), and track it until it either reaches a "close position" band (at e.g. 0.15 standard deviation) or reaches a time limit, at which we close the position and continue updating a new portfolio.

The idea is to use as high frequency data as possible and to try to trade within, say, one day, since it is reasonable to expect that the model can't stay valid for a long time. Currently I have access to 1 min data only (from The Bonnot Gang and eoddata.com, which should be adequate for a thorough backtest. On the other hand, (moderately) high frequency data will be susceptible to the Epps effect, i.e. the degradation of correlation matrices due to noisy data (asynchronous price information etc.). 1 min data will of course contain more information than e.g. 5 min data, provided one can somehow take care of the noise.

One elegant way to do this could be to resort to Random Matrix Theory (RMT). Basically the idea is to identify the noisy eigenvalues of correlation matrices (in my case, of the matrix A in fact) and just not use them in trading. A more straightforward approach is to just "look" at the statistics of the bunch of lowest eigenvalues in a training period and try to pick a low eigenvalue that's above the noisiest ones (more details below). NOTE: Actually it seems that the eoddata.com data is much less noisy than TBG data and the lowest eigenvalue turns out to be the best, so no noise cleaning is necessary...

The code takes into account transaction costs but no slippage. In any case, slippage is pretty hard to implement in a backtest (at least without some data from a walk-forward), so the principle I'm using is that if the backtest is profitable with relatively high transaction costs, it's worth testing in paper trading and eventually live trading. Another thing that is so far missing is any new regarding a company's worth such as quarterly reports etc. Also the fact that usually one has to pay a fixed price for a short sale contract, which is also not included here: basically I'm assuming that the transactions are so large that the contract price is negligible. Another choice might be to consider a synthetic short positions, see e.g. here

If you have any questions, feel free to email me at [email protected]

2 - The backtest

Imports:

In [2]:
from  __future__  import  division
from  __future__  import  print_function

import pyarb as pa

import numpy as np
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
import cProfile
In [3]:
os.chdir('Z:\\PythonStuff\\newtech')# Change the path accordingly
In [4]:
os.getcwd()
Out[4]:
'Z:\\PythonStuff\\newtech'

Import data from The Bonnot Gang

The following loads the data and arranges it into a Pandas dataframe. The method "clean" throws out days with half of missing data (by default; this is adjustable). "normalize" creates a dataframe where all prices start at 1 for easy visualization.

Note that the data import did some resampling of the 1 min data (TBG data is not always logged at exactly even minutes), which introduces a little bit more noise because of asynchronity...

Note also that the last minute at 16:00 was removed also from the price data to make $X$ and $dX$ the same length within a day.

In [5]:
prices = pa.PrepareData_TBG()
prices.clean()
prices.normalize()

Get the prices $X_t^i$ and the differences $dX_t^i$ and also the normalized prices:

In [6]:
Xdata = prices.get_X()
dXdata = prices.get_dX()
normdata = prices.normalized

The price data outs a summary of the Pandas DataFrame:

In [7]:
Xdata
Out[7]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 189540 entries, 2011-09-22 09:30:00 to 2013-09-03 15:59:00
Data columns (total 25 columns):
AAPL_1m     189540  non-null values
AMAT_1m     189540  non-null values
AMD_1m      189540  non-null values
BRCD_1m     189540  non-null values
CMCSA_1m    189540  non-null values
CSCO_1m     189540  non-null values
DELL_1m     189540  non-null values
EBAY_1m     189540  non-null values
EMC_1m      189540  non-null values
GE_1m       189540  non-null values
GLW_1m      189540  non-null values
GOOG_1m     189540  non-null values
HPQ_1m      189540  non-null values
INTC_1m     189540  non-null values
MSFT_1m     189540  non-null values
MU_1m       189540  non-null values
NVDA_1m     189540  non-null values
ORCL_1m     189540  non-null values
QCOM_1m     189540  non-null values
SNDK_1m     189540  non-null values
S_1m        189540  non-null values
T_1m        189540  non-null values
WIN_1m      189540  non-null values
XRX_1m      189540  non-null values
YHOO_1m     189540  non-null values
dtypes: float64(25)

Plot the normalized data:

In [8]:
normdata.plot(grid=True,figsize=(9,6),use_index=False,colormap='Paired')
legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=4,
       ncol=5, mode="expand", borderaxespad=0.)
plt.show()

VGBM simulation:

It's nice to play with the VGBM simulations to see how different A's affect the price interactions differently. The matrix A below is a discrete version of the Laplacian operator. It's spectrum consists of a zero and negative real numbers (different ones in different dimensions). We can adjust the strength of interaction by multiplying A with a postitive constant. Of course in the modelled version A is in general very different from a Laplacian.

The zero eigenvalue eigenvector is just $(1,1,1)$, which is also an "attractor" (steady state) for the system, i.e. the prices starting at any values tend to that vector, but are kicked around it by the noise term. In reality, the largest eigenvalue of $A$ is positive but small and corresponds to the market mode. All the other eigenvalues are negative and correspond to mean reverting portfolios.

In [6]:
A_1 = .1*np.array([[-2,1,1],[1,-2,1],[1,1,-2]])
rho_1 = .0001*np.array([[1,0,0],[0,1,0],[0,0,1]])

vgbm1 = pa.VGBM(A= A_1, rho = rho_1,T=100)
plot(vgbm1)
Out[6]:
[<matplotlib.lines.Line2D at 0x8a79a90>,
 <matplotlib.lines.Line2D at 0x8a79cc0>,
 <matplotlib.lines.Line2D at 0x8a79eb8>]

Note that the returns are gaussian, i.e. this is not a very realistic looking model from the perspectivity of volatility. Of course it would be possible to replace the noise term with some stochastic volatility like alternative, which is one way to generalize the model into something more realistic. From the plot we can see that the prices are cointegrated, i.e. that there are linear combinations of the prices that are mean reverting. The market mode (the mean of all prices) however behaves randomly and cannot be predicted.

VGBM parameter estimation:

I've tried a few different methods for parameter estimation (MLE, QMLE), but IMO by far the best method is the Multivariate Least Squares (MLS) (NOTE: I dislike the notation in the link, but google it for better references). It's also analytically the most simple and relies on stochastic calculus. In fact, we then don't need the noise term in the equation at all for forming the mean reverting portfolios (it will be useful for forecasting though). Multiply the VGBM equation by $X_t^j$ and take $\mathbb E$:

$$ \mathbb E \left(d X_t^i X_t^j \right) = A^i_k \mathbb E \left( X_t^k X_t^j \right) dt + \mathbb E \left( X_t^i X_t^j dW_t^i\right)$$

The last term is now zero by Itô, so we get

$$A = Q \cdot C^{-1} ,$$

where $Q^{ij} = \frac{1}{T} \sum\limits_{t=0}^{N T} \Delta X_t^i X_t^j$ and $C^{ij} = \frac{1}{N T} \sum\limits_{t=0}^{N T} X_t^i X_t^j$ with the ensemble averages replaced by sample mean, $N=391$ is the number of minutes in a day (note that then $dt = 1/N \approx 0.003$) and $T$ is the number of days in the sample (this is a convenient normalization). Note that these are biased estimators, but the sample sizes are always quite large, so the bias is very small. The toughest part in the estimation is therefore the inversion of the matrix $C$ which is an easy job to do even in almost real time for say 10 000 stocks, which is more than enough (I think). I've compared the estimation to MLE and noticed that MLS is very close to MLE (it could be even better?).

The function VGBM_moment then simply calculates the parameter estimates for rho and A for given data. The functions min_eigvec and eigvecs_sorted compute the eigenvalues and eigenvectors of $A$ and sort them. Note that I'm only using the real parts of the complex eigenvalues... currently I don't know how to quickly sort the complex eigs to real & imag part eigs... this means a lot of the eigenvalues are currently missing (the Im parts), but this can be optimized later.

The eigenvalues are updated every minute, i.e. they are "rolling" eigenvalues, and they can be denoted as $\xi_{t,n} \in \mathbb{R}^{dim}$, where $n$ labels the eigenvalue (0 is the lowest) and $dim=$ "dimension" is the number of stocks in the portfolio. The corresponding eigenportfolio is the denoted as $\phi_{t,n} = \xi_{t,n}\cdot X_t$. From now on I will drop the label $n$ and just assume some eigenvalue/-portfolio.

The bulk of the code consists of the function compute_modes, which computes the eigenmodes and the corresponding portfolios for the entire backtest range. It starts with an initial period of length "caldays" (units are days) and then proceeds to update the estimate each minute by adding 1 min worth of values in the future and removing 1 min of data from the end. It returns the portfolios, the standard deviations, the eigenvectors xi and the prices of the portfolios (it has both long and short positions) as functions of $t$. NOTE: I'm assuming symmetric transaction costs for long/short positions which may not be entirely realistic. I hope it won't introduce a bias of some sort...

The class VGBM_backtest has methods to then compute the actual p&l for given open/close bands and transaction costs. This version closes all trades at the end of day.

In [ ]:
plt.plot()
In [7]:
day0=50 # first day of backtest
test_1=pa.VGBM_backtest(Xdata, dXdata,day1=day0,day2=day0+1,caldays=4)
test_1.compute()
test_1.write_compute_data('testdata')
In [8]:
test_pnl = test_1.pnl(mode = 0,cls = .15, exit = 50., \
                    opn = 4, wait = 0, tr_percent = 0.,tr_pershare = 0.005)

The method "show" below displays the tracked portfolio $\phi_t = \xi_t\cdot X_t$ as a function of $t$ in light green. It also displays the in sample and out of sample behavior for a calibration window $(T-\text{caldays}*390, T)$ for a given $T$ (below $T=240$) in blue, i.e. blue $= \xi_T \cdot X(t)$. The light green shaded area is the in sample region $ t

When at time $T$ the tracked portfolio $\phi_t$ hits the value $\pm \text{opn} \times$ standard deviation ($\text{opn}$ is chosen appropriately, e.g. $\text{opn}=1$), we short/buy the portfolio $\phi_t$ with all of our capital. After this we lock the portfolio (otherwise we would have to rebalance it each minute) and follow it, i.e. we follow $\widetilde \phi_t \doteq \xi_T \cdot X(t)$ for $t>T$. Then as the locked portfolio $\widetilde \phi_t$ (hopefully) hits the "close position" band at $\text{cls} \times$ standard deviation, the position is closed.

In [10]:
test_1.show(240, mode = 0, day=0, past = 0, future = 1)

It will take about 1.34 hours to estimate the parameters for the whole 430 day data on my home PC, so it's a good idea to write the data in a file. Below I'm testing the model with 4, 25 and 50 day calibration windows.

In [14]:
day0=50 # first day of backtest
vgbm1=pa.VGBM_backtest(Xdata,dXdata,day1=day0,day2=day0+50,caldays=4)
#vgbm1.compute()
#vgbm1.write_compute_data('days.50.100.4')
vgbm2=pa.VGBM_backtest(Xdata,dXdata,day1=day0,day2=day0+50,caldays=25)
#vgbm2.compute()
#vgbm2.write_compute_data('days.50.100.25')
vgbm3=pa.VGBM_backtest(Xdata,dXdata,day1=day0,day2=day0+50,caldays=50)
#vgbm3.compute()
#vgbm3.write_compute_data('days.50.100.50')

Instead of computing the estimations again, we can just read the saved data arrays:

In [15]:
os.getcwd()
Out[15]:
'Z:\\PythonStuff\\newtech'
In [75]:
os.chdir('Z:\\PythonStuff\\newtech')
In [16]:
vgbm1.read_compute_data('days.50.100.4')
vgbm2.read_compute_data('days.50.100.25')
vgbm3.read_compute_data('days.50.100.50')

Flat rate transaction costs from Interactive Brokers = \$0.005 per share. For lots of trading it's possible to make this as low as (about) $0.0035 or lower by adding liquidity etc. but I'll just go with the .005 (with a comparison to zero transaction costs).

Compare different modes:

In [17]:
cal_4_pnl = []
for n in arange(0,20):
    cal_4_pnl.append(vgbm1.pnl(mode = n,cls = .15, exit = 50., \
                    opn = 4, wait = 0, tr_percent = 0.,tr_pershare = 0.005))
In [18]:
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,5*391))
for n in arange(0+0,7+0):
    plot(cal_4_pnl[n], label = 'mode = %g' %n)
legend()
Out[18]:
<matplotlib.legend.Legend at 0xbcf4cf8>

Clearly 0 mode is quite bad! Modes 4,5 and 6 seem OK!

In [19]:
cal_50_pnl = []
for n in arange(0,20):
    cal_50_pnl.append(vgbm3.pnl(mode = n,cls = .15, exit = 50., \
                    opn = 1,wait = 0, tr_percent = 0.,tr_pershare = 0.005))
In [20]:
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,5*391))
for n in arange(0+0,7+0):
    plot(cal_50_pnl[n], label = 'mode = %g' %n)
legend()
Out[20]:
<matplotlib.legend.Legend at 0xbc0ef28>

Modest profits for the 50 first days. Note that trades occur quite rarely. Lower values of opn would lead to more frequent trades, but they will be eaten by the transaction costs. Long calibration (50 days) leads to slow trading.

Not the full 430 day period:

In [18]:
day0=50
vgbm_full=VGBM_backtest(day1=day0,day2=480,caldays=4)
#vgbm_full.compute()
#vgbm_full.write_compute_data('days.50.480.4')
In [19]:
vgbm_full.read_compute_data('days.50.480.4')
In [ ]:
cal_4_pnl_full = []
for n in arange(0,20):
    cal_4_pnl_full.append(vgbm_full.pnl(mode = n,cls = .15, exit = 50., \
                    opn = 5, wait = 0, tr_percent = 0.,tr_pershare = 0.005))
In [30]:
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,5*391))
for n in arange(0+3,7+3):
    plot(cal_4_pnl_full[n], label = 'mode = %g' %n)
legend()
Out[30]:
<matplotlib.legend.Legend at 0xc006240>

Doesn't quite float... mode 13 does well, but that has to be just luck!

We see that the transaction costs eat the day trading profits pretty badly. It's actually possible to adjust the trade frequency and amplitude/magnitude by increasing the calibration time, but then it would be wise to also allow positions to be held overnight. That's what the class below does.

We now have an expra parameter, close_after. It's value in minutes tells the p&l algo to close the position after close_after minutes.

In [22]:
test1 = pa.VGBM_backtest_overnight(Xdata,dXdata,day1=50,day2=51,caldays=50)
In [33]:
%timeit test1.compute()
1 loops, best of 3: 7.29 s per loop
In [7]:
day0=50
overnight1=pa.VGBM_backtest_overnight(Xdata,dXdata,day1=day0,day2=day0+50,caldays=4)
#overnight1.compute()
#overnight1.write_compute_data('on_days.50.100.4')
overnight2=pa.VGBM_backtest_overnight(Xdata,dXdata,day1=day0,day2=day0+50,caldays=25)
#overnight2.compute()
#overnight2.write_compute_data('on_days.50.100.25')
overnight3=pa.VGBM_backtest_overnight(Xdata,dXdata,day1=day0,day2=day0+50,caldays=50)
#overnight3.compute()
#overnight3.write_compute_data('on_days.50.100.50')

Again, alternatively just read the data:

In [8]:
overnight1.read_compute_data('on_days.50.100.4')
overnight2.read_compute_data('on_days.50.100.25')
overnight3.read_compute_data('on_days.50.100.50')

The methods "show_dynamic" below show the portfolio processess $\phi_t$ in sample for the different calibration times. We see that the long calibration times yield more slowly fluctuating process with a bigger std. This is why it fares better with transaction costs. I didn't consider longer than 50 day calibrations because of the limited amount of data...

In [25]:
overnight1.show_dynamic(day=0, past = 0, future = 5)
overnight3.show_dynamic(day=0, past = 0, future = 5)
In [9]:
pnl_4_50 = []
for n in arange(0,10):
    pnl_4_50.append(overnight1.pnl(mode = n,cls = .15, exit = 50., \
                    opn = 6, tr_percent = 0.,tr_pershare = 0.005, close_after = 1000))
In [10]:
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,5*391))
for n in arange(0+3,7+3):
    plot(pnl_4_50[n], label = 'mode = %g' %n)
legend(loc = 3)
Out[10]:
<matplotlib.legend.Legend at 0xbb605f8>

5,6,7 seem OK with opn=4..6, close_after = 1000..2000

In [11]:
pnl_50_50 = []
for n in arange(0,10):
    pnl_50_50.append(overnight3.pnl(mode = n,cls = .15, exit = 50., \
                    opn = 1.75, tr_percent = 0.,tr_pershare = 0.005, close_after = 500))
In [12]:
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,5*391))
for n in arange(0+0,7+0):
    plot(pnl_50_50[n], label = 'mode = %g' %n)
legend()
Out[12]:
<matplotlib.legend.Legend at 0xb57e668>

50 day calibration does not look good at all!

Now let's try the whole period:

In [14]:
overnight_full_4=pa.VGBM_backtest_overnight(Xdata,dXdata,day1=50,day2=480,caldays=4)
#overnight_full_4.compute()
#overnight_full_4.write_compute_data('on_days.50.480.4')

overnight_full_50=pa.VGBM_backtest_overnight(Xdata,dXdata,day1=50,day2=480,caldays=50)
#overnight_full_50.compute()
#overnight_full_50.write_compute_data('on_days.50.480.50')
In [16]:
overnight_full_4.read_compute_data('on_days.50.480.4')
overnight_full_50.read_compute_data('on_days.50.480.50')
In [17]:
pnl_480_4 = []
for n in arange(0,14):
    pnl_480_4.append(overnight_full_4.pnl(mode = n,cls = .15, exit = 50., \
                    opn = 6, tr_percent = 0.,tr_pershare = 0.005, close_after = 500))
In [19]:
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,10*391), rotation=55)
for n in arange(0+3,8+0):
    plot(pnl_480_4[n], label = 'mode = %g' %n)
legend(loc = 2)
Out[19]:
<matplotlib.legend.Legend at 0x8ae0a20>