Financial Analysis in Python¶

What you will learn in this tutorial¶

• Timeseries analysis using Pandas
• Using Google Trends to predict market movements
• Mean-reversion

Different stages towards a successful trading strategy¶

1. Data analysis / Idea generation
• Pandas + Matplotlib (Part I)
2. Backtest strategy
• Pandas can be used but many limitations (Part II)
• -> Zipline (Part III)
3. Optimize
4. Forward test strategy (paper trading)
• Trivial on Quantopian (Part IV)
• Coming soon...

Pandas basics (Data analysis / Idea generation)¶

In [1]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=800 height=350></iframe>")
Out[1]:
In [2]:
import datetime

import pandas as pd
import pandas.io.data
from pandas import Series, DataFrame
pd.__version__
Out[2]:
'0.11.0'
In [3]:
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rc('figure', figsize=(8, 7))
mpl.__version__
Out[3]:
'1.2.1'

• From Python structures
• From Yahoo finance
• From CSV files

From Python structures¶

In [4]:
labels = ['a', 'b', 'c', 'd', 'e']
s = Series([1, 2, 3, 4, 5], index=labels)
s
Out[4]:
a    1
b    2
c    3
d    4
e    5
dtype: int64
In [5]:
'b' in s
Out[5]:
True
In [6]:
s['b']
Out[6]:
2
In [7]:
mapping = s.to_dict()
mapping
Out[7]:
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
In [8]:
Series(mapping)
Out[8]:
a    1
b    2
c    3
d    4
e    5
dtype: int64

From Yahoo finance¶

In [9]:
aapl = pd.io.data.get_data_yahoo('AAPL',
start=datetime.datetime(2006, 10, 1),
end=datetime.datetime(2012, 1, 1))
Out[9]:
Open High Low Close Volume Adj Close
Date
2006-10-02 75.10 75.87 74.30 74.86 25451400 73.29
2006-10-03 74.45 74.95 73.19 74.08 28239600 72.52
2006-10-04 74.10 75.46 73.16 75.38 29610100 73.80
2006-10-05 74.53 76.16 74.13 74.83 24424400 73.26
2006-10-06 74.42 75.04 73.81 74.22 16677100 72.66

From CSV files¶

In [10]:
aapl.to_csv('data/aapl_ohlc.csv')
2006-10-02,75.1,75.87,74.3,74.86,25451400,73.29
2006-10-03,74.45,74.95,73.19,74.08,28239600,72.52
2006-10-04,74.1,75.46,73.16,75.38,29610100,73.8
2006-10-05,74.53,76.16,74.13,74.83,24424400,73.26
2006-10-06,74.42,75.04,73.81,74.22,16677100,72.66
2006-10-09,73.8,75.08,73.53,74.63,15650800,73.06
2006-10-10,74.54,74.58,73.08,73.81,18985300,72.26
2006-10-11,73.42,73.98,72.6,73.23,20423400,71.69
2006-10-12,73.61,75.39,73.6,75.26,21173400,73.68
In [11]:
Out[11]:
Open High Low Close Volume Adj Close
Date
2006-10-02 75.10 75.87 74.30 74.86 25451400 73.29
2006-10-03 74.45 74.95 73.19 74.08 28239600 72.52
2006-10-04 74.10 75.46 73.16 75.38 29610100 73.80
2006-10-05 74.53 76.16 74.13 74.83 24424400 73.26
2006-10-06 74.42 75.04 73.81 74.22 16677100 72.66
In [12]:
df.index
Out[12]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-10-02 00:00:00, ..., 2011-12-30 00:00:00]
Length: 1323, Freq: None, Timezone: None

Series and DataFrame: First steps¶

• Indexing
• Slicing
• Creating new columns
In [13]:
ts = df['Close'][-10:]
ts
Out[13]:
Date
2011-12-16    381.02
2011-12-19    382.21
2011-12-20    395.95
2011-12-21    396.45
2011-12-22    398.55
2011-12-23    403.33
2011-12-27    406.53
2011-12-28    402.64
2011-12-29    405.12
2011-12-30    405.00
Name: Close, dtype: float64

A DataFrame is a collection of Series objects. Slicing out a DataFrame column returns a Series.

In [14]:
type(ts)
Out[14]:
pandas.core.series.TimeSeries
In [15]:
date = ts.index[5]
date
Out[15]:
<Timestamp: 2011-12-23 00:00:00>
In [16]:
ts[date]
Out[16]:
403.32999999999998
In [17]:
ts[5]
Out[17]:
403.32999999999998

We can also select multiple columns.

In [18]:
Out[18]:
Open Close
Date
2006-10-02 75.10 74.86
2006-10-03 74.45 74.08
2006-10-04 74.10 75.38
2006-10-05 74.53 74.83
2006-10-06 74.42 74.22

New columns can be added on the fly.

In [19]:
df['diff'] = df.Open - df.Close
Out[19]:
Open High Low Close Volume Adj Close diff
Date
2006-10-02 75.10 75.87 74.30 74.86 25451400 73.29 0.24
2006-10-03 74.45 74.95 73.19 74.08 28239600 72.52 0.37
2006-10-04 74.10 75.46 73.16 75.38 29610100 73.80 -1.28
2006-10-05 74.53 76.16 74.13 74.83 24424400 73.26 -0.30
2006-10-06 74.42 75.04 73.81 74.22 16677100 72.66 0.20

...and deleted on the fly.

In [20]:
del df['diff']
Out[20]:
Open High Low Close Volume Adj Close
Date
2006-10-02 75.10 75.87 74.30 74.86 25451400 73.29
2006-10-03 74.45 74.95 73.19 74.08 28239600 72.52
2006-10-04 74.10 75.46 73.16 75.38 29610100 73.80
2006-10-05 74.53 76.16 74.13 74.83 24424400 73.26
2006-10-06 74.42 75.04 73.81 74.22 16677100 72.66

Common Financial Computations¶

• Moving Average
• Returns
In [21]:
In [22]:
mavg = pd.rolling_mean(close_px, 40)
mavg[-10:]
Out[22]:
Date
2011-12-16    380.53500
2011-12-19    380.27400
2011-12-20    380.03350
2011-12-21    380.00100
2011-12-22    379.95075
2011-12-23    379.91750
2011-12-27    379.95600
2011-12-28    379.90350
2011-12-29    380.11425
2011-12-30    380.30000
dtype: float64

Returns defined as:  $$r_t = \frac{p_t - p_{t-1}}{p_{t-1}} = \frac{p_t}{p_{t-1}} - 1$$

In [23]:
rets = close_px / close_px.shift(1) - 1
Out[23]:
Date
2006-10-02         NaN
2006-10-03   -0.010506
2006-10-04    0.017650
2006-10-05   -0.007317
2006-10-06   -0.008190

Or alternatively .pct_change().

In [24]:
Out[24]:
Date
2006-10-02         NaN
2006-10-03   -0.010506
2006-10-04    0.017650
2006-10-05   -0.007317
2006-10-06   -0.008190

Plotting Basics¶

Series and DataFrames have an associated .plot() command (uses Matplotlib behind the scenes).

In [25]:
close_px.plot(label='AAPL')
mavg.plot(label='mavg')
plt.legend()
Out[25]:
<matplotlib.legend.Legend at 0xa17cd8c>

Optional: If there is time...¶

In [26]:
df = pd.io.data.get_data_yahoo(['AAPL', 'GE', 'GOOG', 'IBM', 'KO', 'MSFT', 'PEP'],
start=datetime.datetime(2010, 1, 1),
Out[26]:
AAPL GE GOOG IBM KO MSFT PEP
Date
2010-01-04 209.51 13.81 626.75 124.58 25.77 28.29 55.08
2010-01-05 209.87 13.88 623.99 123.07 25.46 28.30 55.75
2010-01-06 206.53 13.81 608.26 122.27 25.45 28.12 55.19
2010-01-07 206.15 14.53 594.10 121.85 25.39 27.83 54.84
2010-01-08 207.52 14.84 602.02 123.07 24.92 28.02 54.66
In [27]:
rets = df.pct_change()
In [28]:
plt.scatter(rets.PEP, rets.KO)
plt.xlabel('Returns PEP')
plt.ylabel('Returns KO')
Out[28]:
<matplotlib.text.Text at 0xa1b5d8c>
In [29]:
pd.scatter_matrix(rets, diagonal='kde', figsize=(10, 10));
In [30]:
corr = rets.corr()
corr
Out[30]:
AAPL GE GOOG IBM KO MSFT PEP
AAPL 1.000000 0.457186 0.533405 0.501582 0.388537 0.472015 0.309023
GE 0.457186 1.000000 0.462540 0.609544 0.593366 0.600145 0.532517
GOOG 0.533405 0.462540 1.000000 0.470798 0.404247 0.472152 0.329978
IBM 0.501582 0.609544 0.470798 1.000000 0.602032 0.645565 0.490668
KO 0.388537 0.593366 0.404247 0.602032 1.000000 0.523689 0.636500
MSFT 0.472015 0.600145 0.472152 0.645565 0.523689 1.000000 0.476464
PEP 0.309023 0.532517 0.329978 0.490668 0.636500 0.476464 1.000000
In [31]:
plt.imshow(corr, cmap='hot', interpolation='none')
plt.colorbar()
plt.xticks(range(len(corr)), corr.columns)
plt.yticks(range(len(corr)), corr.columns);

One thing we are often interested in is the relationship of expected returns and the risk we are taking one. Often there is a trade-off between the two.

Here we use plt.annotate to put labels on the scatterplot.

In [32]:
plt.scatter(rets.mean(), rets.std())
plt.xlabel('Expected returns')
plt.ylabel('Risk')
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
plt.annotate(
label,
xy = (x, y), xytext = (20, -20),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

Data alignment and dealing with missing values¶

• Join, merge, concatenate
• Filling NaNs

Lets pull some more securities from Yahoo finance and build a DataFrame of the adjusted closing prices.

In [33]:
series_list = []
securities = ['AAPL', 'GOOG', 'IBM', 'MSFT']
for security in securities:
s = pd.io.data.get_data_yahoo(security, start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2013, 1, 1))['Adj Close']
s.name = security # Rename series to match security name
series_list.append(s)

pd.concat can be used to concatenate multiple Series into one DataFrame.

In [34]:
df = pd.concat(series_list, axis=1)
Out[34]:
AAPL GOOG IBM MSFT
Date
2011-10-03 366.72 495.52 168.24 23.32
2011-10-04 364.67 501.90 169.65 24.09
2011-10-05 370.30 504.70 171.69 24.61
2011-10-06 369.44 514.71 176.39 25.04
2011-10-07 362.02 515.12 177.07 24.95
In [35]:
df.ix[0, 'AAPL'] = np.nan
df.ix[1, ['GOOG', 'IBM']] = np.nan
df.ix[[1, 2, 3], 'MSFT'] = np.nan

Out[35]:
AAPL GOOG IBM MSFT
Date
2011-10-03 NaN 495.52 168.24 23.32
2011-10-04 364.67 NaN NaN NaN
2011-10-05 370.30 504.70 171.69 NaN
2011-10-06 369.44 514.71 176.39 NaN
2011-10-07 362.02 515.12 177.07 24.95

Pandas has great support for computing with missing values.

In [36]:
Out[36]:
Date
2011-10-03       NaN
2011-10-04       NaN
2011-10-05    875.00
2011-10-06    884.15
2011-10-07    877.14
dtype: float64

One common approacht to impute missing values in time series is forward filling.

In [37]: