In [1]:

```
sys.path.append('Z:\\GitHub\\PyArb') # Change the path accordingly
```

**12% annual returns** with **Sharpe ratio=5** and **maximum drawdown=0.5%** with **\$0.005/per share transaction costs**. Jump to the end of this notebook to see the plots.

I started writing this notebook as a learning project about equity statistical arbitrage. The model is a very elementary system of stochastic differential equations, which is supposed to model the behaviour of equity prices, and specifically their *interactions*. This can be viewed as a generalization of the Ornstein-Uhlenbeck process (except that I'm using multiplicative noise as in GBM) that is widely used in modelling the behaviour of a certain portfolio of two stocks in pairs trading. The model is defined as

where $X_t^i$ is the price of stock $i$ at time $t$, $A$ is an interaction matrix to be modelled, $dW$ is a white noise with a covariance matrix $\mathbb E\left( dW_t^i dW_t^j \right) = \rho^{ij} dt$, where $\rho$ is also modelled from data.

The model can be viewed as a simple generalization of geometric brownian motion to a multivariate case, so I will just call it the **Vector Geometric Brownian Motion (VGBM)**.

A finite time step (1 min bar) equation can be written as

$$ X_{t+1}^i = (\mathbb{1} + A \Delta t)_{ij} X_t^j + X_t^i d W_t^i ,$$from which one can hopefully see the similarity to e.g. VAR(1), except that the noise term is multiplicative.

The model probably can't be claimed to describe the behaviour of the stock market very well (why would the *price change* of asset $i$ depend on a linear combination the other *prices*, $A^i_j X^j$?), but I think it should be able to capture the short term behavior at least to some degree. In any case, consider it a toy model from which it is easy to generalize to better models while still using most of the code below.

A nice thing about the model is that it is easy to form a _mean reverting portfolio as $\xi \cdot X_t$, where $\xi$ is a left eigenvector of $A$_. It is then a simple matter to update the the matrix A, and therefore also $\xi$ and the portfolio, at each 1 min time step (currently the code works with 1 min bars only). Then, when the value of the portfolio reaches a pre-set value below or above the mean (e.g. one standard deviation), we buy or short (respectively) the portfolio. When that happens, we *lock* the eigenvector/portfolio (i.e. stop updating it), and track it until it either reaches a "close position" band (at e.g. 0.15 standard deviation) or reaches a time limit, at which we close the position *and* continue updating a new portfolio.

The idea is to use as high frequency data as possible and to try to trade within, say, one day, since it is reasonable to expect that the model can't stay valid for a long time. Currently I have access to 1 min data only (from The Bonnot Gang and eoddata.com, which should be adequate for a thorough backtest. On the other hand, (moderately) high frequency data will be susceptible to the **Epps effect**, i.e. the degradation of correlation matrices due to noisy data (asynchronous price information etc.). 1 min data *will* of course contain more information than e.g. 5 min data, provided one can somehow take care of the noise.

One elegant way to do this could be to resort to Random Matrix Theory (RMT). Basically the idea is to identify the noisy eigenvalues of correlation matrices (in my case, of the matrix A in fact) and just not use them in trading. A more straightforward approach is to just "look" at the statistics of the bunch of lowest eigenvalues in a training period and try to pick a low eigenvalue that's above the noisiest ones (more details below). **NOTE: Actually it seems that the eoddata.com data is much less noisy than TBG data and the lowest eigenvalue turns out to be the best, so no noise cleaning is necessary...**

The code takes into account transaction costs but no slippage. In any case, slippage is pretty hard to implement in a backtest (at least without some data from a walk-forward), so the principle I'm using is that if the backtest is profitable with relatively high transaction costs, it's worth testing in paper trading and eventually live trading. Another thing that is so far missing is any new regarding a company's worth such as quarterly reports etc. Also the fact that usually one has to pay a fixed price for a short sale contract, which is also not included here: basically I'm assuming that the transactions are so large that the contract price is negligible. Another choice might be to consider a synthetic short positions, see e.g. here

In [2]:

```
from __future__ import division
from __future__ import print_function
import pyarb as pa
import numpy as np
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
import cProfile
```

In [3]:

```
os.chdir('Z:\\PythonStuff\\newtech')# Change the path accordingly
```

In [4]:

```
os.getcwd()
```

Out[4]:

The following loads the data and arranges it into a Pandas dataframe. The method "clean" throws out days with half of missing data (by default; this is adjustable). "normalize" creates a dataframe where all prices start at 1 for easy visualization.

Note that the data import did some resampling of the 1 min data (TBG data is not always logged at exactly even minutes), which introduces a little bit more noise because of asynchronity...

Note also that the last minute at 16:00 was removed also from the price data to make $X$ and $dX$ the same length within a day.

In [5]:

```
prices = pa.PrepareData_TBG()
prices.clean()
prices.normalize()
```

Get the prices $X_t^i$ and the differences $dX_t^i$ and also the normalized prices:

In [6]:

```
Xdata = prices.get_X()
dXdata = prices.get_dX()
normdata = prices.normalized
```

The price data outs a summary of the Pandas DataFrame:

In [7]:

```
Xdata
```

Out[7]:

Plot the normalized data:

In [8]:

```
normdata.plot(grid=True,figsize=(9,6),use_index=False,colormap='Paired')
legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=4,
ncol=5, mode="expand", borderaxespad=0.)
plt.show()
```

It's nice to play with the VGBM simulations to see how different A's affect the price interactions differently. The matrix A below is a discrete version of the Laplacian operator. It's spectrum consists of a zero and negative real numbers (different ones in different dimensions). We can adjust the strength of interaction by multiplying A with a postitive constant. Of course in the modelled version A is in general *very* different from a Laplacian.

The zero eigenvalue eigenvector is just $(1,1,1)$, which is also an "attractor" (steady state) for the system, i.e. the prices starting at any values tend to that vector, but are kicked around it by the noise term. In reality, the largest eigenvalue of $A$ is positive but small and corresponds to the **market mode**. All the other eigenvalues are negative and correspond to mean reverting portfolios.

In [6]:

```
A_1 = .1*np.array([[-2,1,1],[1,-2,1],[1,1,-2]])
rho_1 = .0001*np.array([[1,0,0],[0,1,0],[0,0,1]])
vgbm1 = pa.VGBM(A= A_1, rho = rho_1,T=100)
plot(vgbm1)
```

Out[6]:

*cointegrated*, i.e. that there are linear combinations of the prices that are mean reverting. The *market mode* (the mean of all prices) however behaves randomly and cannot be predicted.

I've tried a few different methods for parameter estimation (MLE, QMLE), but IMO by far the best method is the Multivariate Least Squares (MLS) (NOTE: I dislike the notation in the link, but google it for better references). It's also analytically the most simple and relies on stochastic calculus. In fact, we then don't need the noise term in the equation at all for forming the mean reverting portfolios (it will be useful for forecasting though). Multiply the VGBM equation by $X_t^j$ and take $\mathbb E$:

$$ \mathbb E \left(d X_t^i X_t^j \right) = A^i_k \mathbb E \left( X_t^k X_t^j \right) dt + \mathbb E \left( X_t^i X_t^j dW_t^i\right)$$The last term is now zero by ItÃ´, so we get

$$A = Q \cdot C^{-1} ,$$where $Q^{ij} = \frac{1}{T} \sum\limits_{t=0}^{N T} \Delta X_t^i X_t^j$ and $C^{ij} = \frac{1}{N T} \sum\limits_{t=0}^{N T} X_t^i X_t^j$ with the ensemble averages replaced by sample mean, $N=391$ is the number of minutes in a day (note that then $dt = 1/N \approx 0.003$) and $T$ is the number of days in the sample (this is a convenient normalization). Note that these are biased estimators, but the sample sizes are always quite large, so the bias is very small. The toughest part in the estimation is therefore the inversion of the matrix $C$ which is an easy job to do even in almost real time for say 10 000 stocks, which is more than enough (I think). I've compared the estimation to MLE and noticed that MLS is very close to MLE (it could be even better?).

**min_eigvec** and **eigvecs_sorted** compute the eigenvalues and eigenvectors of $A$ and sort them. Note that I'm only using the real parts of the complex eigenvalues... currently I don't know how to quickly sort the complex eigs to real & imag part eigs... this means a lot of the eigenvalues are currently missing (the Im parts), but this can be optimized later.

*some* eigenvalue/-portfolio.

The bulk of the code consists of the function **compute_modes**, which computes the eigenmodes and the corresponding portfolios for the entire backtest range. It starts with an initial period of length "caldays" (units are days) and then proceeds to update the estimate each minute by adding 1 min worth of values in the future and removing 1 min of data from the end. It returns the portfolios, the standard deviations, the eigenvectors xi and the prices of the portfolios (it has both long and short positions) as functions of $t$. **NOTE:** I'm assuming symmetric transaction costs for long/short positions which may not be entirely realistic. I hope it won't introduce a bias of some sort...

The class VGBM_backtest has methods to then compute the actual p&l for given open/close bands and transaction costs. This version closes all trades at the end of day.

In [ ]:

```
plt.plot()
```

In [7]:

```
day0=50 # first day of backtest
test_1=pa.VGBM_backtest(Xdata, dXdata,day1=day0,day2=day0+1,caldays=4)
test_1.compute()
test_1.write_compute_data('testdata')
```

In [8]:

```
test_pnl = test_1.pnl(mode = 0,cls = .15, exit = 50., \
opn = 4, wait = 0, tr_percent = 0.,tr_pershare = 0.005)
```

The method "show" below displays the tracked portfolio $\phi_t = \xi_t\cdot X_t$ as a function of $t$ in light green. It also displays the in sample and out of sample behavior for a calibration window $(T-\text{caldays}*390, T)$ for a given $T$ (below $T=240$) in blue, i.e. blue $= \xi_T \cdot X(t)$. The light green shaded area is the in sample region $ t

When at time $T$ the tracked portfolio $\phi_t$ hits the value $\pm \text{opn} \times$ standard deviation ($\text{opn}$ is chosen appropriately, e.g. $\text{opn}=1$), we short/buy the portfolio $\phi_t$ with all of our capital. After this we lock the portfolio (otherwise we would have to rebalance it each minute) and follow it, i.e. we follow $\widetilde \phi_t \doteq \xi_T \cdot X(t)$ for $t>T$. Then as the locked portfolio $\widetilde \phi_t$ (hopefully) hits the "close position" band at $\text{cls} \times$ standard deviation, the position is closed.

In [10]:

```
test_1.show(240, mode = 0, day=0, past = 0, future = 1)
```

In [14]:

```
day0=50 # first day of backtest
vgbm1=pa.VGBM_backtest(Xdata,dXdata,day1=day0,day2=day0+50,caldays=4)
#vgbm1.compute()
#vgbm1.write_compute_data('days.50.100.4')
vgbm2=pa.VGBM_backtest(Xdata,dXdata,day1=day0,day2=day0+50,caldays=25)
#vgbm2.compute()
#vgbm2.write_compute_data('days.50.100.25')
vgbm3=pa.VGBM_backtest(Xdata,dXdata,day1=day0,day2=day0+50,caldays=50)
#vgbm3.compute()
#vgbm3.write_compute_data('days.50.100.50')
```

Instead of computing the estimations again, we can just read the saved data arrays:

In [15]:

```
os.getcwd()
```

Out[15]:

In [75]:

```
os.chdir('Z:\\PythonStuff\\newtech')
```

In [16]:

```
vgbm1.read_compute_data('days.50.100.4')
vgbm2.read_compute_data('days.50.100.25')
vgbm3.read_compute_data('days.50.100.50')
```

Compare different modes:

In [17]:

```
cal_4_pnl = []
for n in arange(0,20):
cal_4_pnl.append(vgbm1.pnl(mode = n,cls = .15, exit = 50., \
opn = 4, wait = 0, tr_percent = 0.,tr_pershare = 0.005))
```

In [18]:

```
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,5*391))
for n in arange(0+0,7+0):
plot(cal_4_pnl[n], label = 'mode = %g' %n)
legend()
```

Out[18]:

Clearly 0 mode is quite bad! Modes **4,5 and 6** seem OK!

In [19]:

```
cal_50_pnl = []
for n in arange(0,20):
cal_50_pnl.append(vgbm3.pnl(mode = n,cls = .15, exit = 50., \
opn = 1,wait = 0, tr_percent = 0.,tr_pershare = 0.005))
```

In [20]:

```
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,5*391))
for n in arange(0+0,7+0):
plot(cal_50_pnl[n], label = 'mode = %g' %n)
legend()
```

Out[20]:

Modest profits for the 50 first days. Note that trades occur quite rarely. Lower values of opn would lead to more frequent trades, but they will be eaten by the transaction costs. Long calibration (50 days) leads to slow trading.

Not the full 430 day period:

In [18]:

```
day0=50
vgbm_full=VGBM_backtest(day1=day0,day2=480,caldays=4)
#vgbm_full.compute()
#vgbm_full.write_compute_data('days.50.480.4')
```

In [19]:

```
vgbm_full.read_compute_data('days.50.480.4')
```

In [ ]:

```
cal_4_pnl_full = []
for n in arange(0,20):
cal_4_pnl_full.append(vgbm_full.pnl(mode = n,cls = .15, exit = 50., \
opn = 5, wait = 0, tr_percent = 0.,tr_pershare = 0.005))
```

In [30]:

```
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,5*391))
for n in arange(0+3,7+3):
plot(cal_4_pnl_full[n], label = 'mode = %g' %n)
legend()
```

Out[30]:

Doesn't quite float... mode 13 does well, but that has to be just luck!

We see that the transaction costs eat the day trading profits pretty badly. It's actually possible to adjust the trade frequency and amplitude/magnitude by increasing the calibration time, but then it would be wise to also allow positions to be held overnight. That's what the class below does.

We now have an expra parameter, close_after. It's value in minutes tells the p&l algo to close the position after close_after minutes.

In [22]:

```
test1 = pa.VGBM_backtest_overnight(Xdata,dXdata,day1=50,day2=51,caldays=50)
```

In [33]:

```
%timeit test1.compute()
```

In [7]:

```
day0=50
overnight1=pa.VGBM_backtest_overnight(Xdata,dXdata,day1=day0,day2=day0+50,caldays=4)
#overnight1.compute()
#overnight1.write_compute_data('on_days.50.100.4')
overnight2=pa.VGBM_backtest_overnight(Xdata,dXdata,day1=day0,day2=day0+50,caldays=25)
#overnight2.compute()
#overnight2.write_compute_data('on_days.50.100.25')
overnight3=pa.VGBM_backtest_overnight(Xdata,dXdata,day1=day0,day2=day0+50,caldays=50)
#overnight3.compute()
#overnight3.write_compute_data('on_days.50.100.50')
```

Again, alternatively just read the data:

In [8]:

```
overnight1.read_compute_data('on_days.50.100.4')
overnight2.read_compute_data('on_days.50.100.25')
overnight3.read_compute_data('on_days.50.100.50')
```

In [25]:

```
overnight1.show_dynamic(day=0, past = 0, future = 5)
overnight3.show_dynamic(day=0, past = 0, future = 5)
```

In [9]:

```
pnl_4_50 = []
for n in arange(0,10):
pnl_4_50.append(overnight1.pnl(mode = n,cls = .15, exit = 50., \
opn = 6, tr_percent = 0.,tr_pershare = 0.005, close_after = 1000))
```

In [10]:

```
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,5*391))
for n in arange(0+3,7+3):
plot(pnl_4_50[n], label = 'mode = %g' %n)
legend(loc = 3)
```

Out[10]:

**5,6,7 seem OK with opn=4..6, close_after = 1000..2000**

In [11]:

```
pnl_50_50 = []
for n in arange(0,10):
pnl_50_50.append(overnight3.pnl(mode = n,cls = .15, exit = 50., \
opn = 1.75, tr_percent = 0.,tr_pershare = 0.005, close_after = 500))
```

In [12]:

```
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,5*391))
for n in arange(0+0,7+0):
plot(pnl_50_50[n], label = 'mode = %g' %n)
legend()
```

Out[12]:

50 day calibration does not look good at all!

Now let's try the whole period:

In [14]:

```
overnight_full_4=pa.VGBM_backtest_overnight(Xdata,dXdata,day1=50,day2=480,caldays=4)
#overnight_full_4.compute()
#overnight_full_4.write_compute_data('on_days.50.480.4')
overnight_full_50=pa.VGBM_backtest_overnight(Xdata,dXdata,day1=50,day2=480,caldays=50)
#overnight_full_50.compute()
#overnight_full_50.write_compute_data('on_days.50.480.50')
```

In [16]:

```
overnight_full_4.read_compute_data('on_days.50.480.4')
overnight_full_50.read_compute_data('on_days.50.480.50')
```

In [17]:

```
pnl_480_4 = []
for n in arange(0,14):
pnl_480_4.append(overnight_full_4.pnl(mode = n,cls = .15, exit = 50., \
opn = 6, tr_percent = 0.,tr_pershare = 0.005, close_after = 500))
```

In [19]:

```
plt.figure(figsize=(12, 9))
grid()
plt.xticks(arange(0,391*430,10*391), rotation=55)
for n in arange(0+3,8+0):
plot(pnl_480_4[n], label = 'mode = %g' %n)
legend(loc = 2)
```

Out[19]: