# The Path of the PyData Ninja

### Dr. Thomas Wiecki

• We back the best investment algorithms with investor capital, trading operations, and technology.
• Do your research in our hosted IPython environment using stock price history, corporate fundamental data, and other data sets.
• When you enter the contest, your algorithm will also be considered for our hedge fund.
• We're hiring in DÃ¼sseldorf: Operations Engineer!

## Why use Python for data analysis?¶

• Python is a general purpose language -> No hodge-podge of perl, bash, matlab, fortran.
• Very easy to learn.
• Quality and quantity of data analysis libraries is very high and growing at a rapid pace.
• What are the alternatives?
• R: "The best thing about R is that it was written by statisticians. The worst thing about R is that it was written by statisticians." Bow Cogwill
• Matlab: \$, not open

## The PyData Stack

Source: [Jake VanderPlas: State of the Tools](https://www.youtube.com/watch?v=5GlNDD7qbP4)

# Python basics¶

### Interpreted and interactive¶

In [1]:
3 * 4

Out[1]:
12

### Lists¶

In [2]:
x = [1, 2, 3]
print(x)

[1, 2, 3]

In [3]:
x.append(4)
print(x)

[1, 2, 3, 4]


### Dictionaries¶

In [4]:
measurements = {'height': [1.70, 1.80, 1.50], 'weight': [60, 120, 50]}
measurements

Out[4]:
{'height': [1.7, 1.8, 1.5], 'weight': [60, 120, 50]}
In [5]:
measurements['height']

Out[5]:
[1.7, 1.8, 1.5]

### Comprehensions¶

In [6]:
x = [1, 2, 3, 4]
[i**2 for i in x]

Out[6]:
[1, 4, 9, 16]
In [7]:
def calc_bmi(weight, height):
return weight / height**2

[calc_bmi(w, h) for w, h in zip(measurements['weight'], measurements['height'])]

Out[7]:
[20.761245674740486, 37.03703703703704, 22.22222222222222]

## Pandas¶

In [8]:
import pandas as pd
import numpy as np

In [9]:
s = pd.Series([1,3,5,np.nan,6,8])
s

Out[9]:
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64
In [10]:
dates = pd.date_range('20130101', periods=6)

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Out[10]:
A B C D
2013-01-01 -1.825304 0.016285 -0.306601 0.004272
2013-01-02 0.192045 -0.291267 1.951792 1.708918
2013-01-03 -0.672820 -2.051800 0.113415 -0.594967
2013-01-04 -1.453007 1.282767 -0.113163 -1.824503
2013-01-05 0.824425 2.156586 -0.047000 -1.042518
2013-01-06 0.359821 -0.896814 0.617897 -0.723848
In [11]:
df[df.A > 0]

Out[11]:
A B C D
2013-01-02 0.192045 -0.291267 1.951792 1.708918
2013-01-05 0.824425 2.156586 -0.047000 -1.042518
2013-01-06 0.359821 -0.896814 0.617897 -0.723848
In [12]:
df.mean()

Out[12]:
A   -0.429140
B    0.035960
C    0.369390
D   -0.412108
dtype: float64
In [13]:
df.mean(axis='columns')

Out[13]:
2013-01-01   -0.527837
2013-01-02    0.890372
2013-01-03   -0.801543
2013-01-04   -0.526977
2013-01-05    0.472873
2013-01-06   -0.160736
Freq: D, dtype: float64

## Mixed types¶

In [14]:
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })

df2

Out[14]:
A B C D E F
0 1 2013-01-02 1 3 test foo
1 1 2013-01-02 1 3 train foo
2 1 2013-01-02 1 3 test foo
3 1 2013-01-02 1 3 train foo
In [15]:
df2.dtypes

Out[15]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### Grouping¶

In [16]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})

In [17]:
df

Out[17]:
A B C D
0 foo one 0.343733 -0.387430
1 bar one 2.061690 2.454234
2 foo two 0.626621 -1.106039
3 bar three 1.511600 0.963582
4 foo two -1.187094 -1.310974
5 bar two -1.065222 -0.983320
6 foo one -0.924636 0.349164
7 foo three 2.093810 -0.913885
In [18]:
df.groupby('A').sum()

Out[18]:
C D
A
bar 2.508068 2.434496
foo 0.952434 -3.369164
In [19]:
df.groupby(['A', 'B']).sum()

Out[19]:
C D
A B
bar one 2.061690 2.454234
three 1.511600 0.963582
two -1.065222 -0.983320
foo one -0.580903 -0.038266
three 2.093810 -0.913885
two -0.560473 -2.417013

## Seaborn: Generating statistical plots¶

In [20]:
%matplotlib inline
import seaborn as sns

In [21]:
x = np.random.normal(size=100)
sns.distplot(x);


### 2D distributions¶

In [22]:
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
df

Out[22]:
x y
0 1.086599 0.232294
1 -0.443143 -0.354076
2 0.652826 1.165967
3 -0.903281 0.170521
4 -1.148413 2.050707
5 -0.176674 2.375160
6 0.845552 1.879006
7 -0.962104 0.340105
8 0.038327 0.011257
9 -0.704808 -0.132214
10 -0.265136 0.426827
11 1.327651 3.319972
12 -0.044100 0.249657
13 0.451112 0.740770
14 0.596272 1.360725
15 -0.137805 1.041698
16 1.168201 2.713771
17 -1.319453 -0.503621
18 0.461970 2.389124
19 -0.334954 1.328563
20 -0.649478 1.537904
21 0.310419 1.523426
22 0.710832 0.943226
23 -0.823799 1.775166
24 0.992667 1.022176
25 0.568753 0.557493
26 -0.301274 1.079018
27 -1.733951 0.567187
28 0.952436 1.852690
29 -1.174882 0.873883
... ... ...
170 0.563523 2.138955
171 1.033686 4.141070
172 0.292181 1.408222
173 -2.301801 -0.042128
174 0.104059 0.542621
175 0.588103 0.567352
176 -1.073328 -0.120820
177 1.853767 0.099758
178 1.075379 2.611514
179 0.559673 3.366675
180 0.747945 2.053162
181 -2.927040 -0.621391
182 0.134816 0.727113
183 0.015388 2.829295
184 0.742623 2.058462
185 -0.229863 1.139025
186 -1.033777 -0.642099
187 0.428791 0.762164
188 -0.596019 2.079460
189 1.819033 1.434940
190 0.254380 1.494345
191 2.276169 0.408099
192 -0.076959 -1.321194
193 0.496929 -0.174779
194 0.723700 1.561495
195 1.097657 2.438591
196 0.460952 2.506192
197 0.458422 1.522599
198 1.411305 1.632619
199 -1.533196 2.016695

200 rows Ã— 2 columns

In [23]:
sns.jointplot(x="x", y="y", data=df, kind="kde");


### All pairwise combinations¶

In [24]:
iris = sns.load_dataset("iris")
sns.pairplot(iris);


## Seaborn: Regressions¶

In [25]:
tips = sns.load_dataset("tips")

In [26]:
tips.head()

Out[26]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [27]:
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips);

In [28]:
sns.lmplot(x="total_bill", y="tip", col="day", data=tips,
col_wrap=2, size=3);

In [29]:
sns.factorplot(x="time", y="total_bill", hue="smoker",
col="day", data=tips, kind="box", size=4, aspect=.5);


## Lots of machine learning and stats libraries¶

• SciPy: comprehensive library of numerical routines like optimizers, integrators, FFT.
• scikit-learn: The ML library out there
• statsmodels: Frequentist statistics
• SymPy: Symbolic Math
• PyMC3: Probabilistic programming in Python

## scikit-learn¶

In [30]:
from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)

Out[30]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
In [31]:
clf.predict([[0, .5]])

Out[31]:
array([0])

### Advanced example: Grid Search with Cross-Validation to find hyper parameters¶

In [32]:
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC


In [33]:
import matplotlib.pyplot as plt
#Display the first digit
plt.figure(1, figsize=(3, 3))
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.grid('off')

In [34]:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)

In [35]:
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5)
clf.fit(X_train, y_train)

Out[35]:
GridSearchCV(cv=5, error_score='raise',
estimator=SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False),
fit_params={}, iid=True, loss_func=None, n_jobs=1,
param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.001, 0.0001]}, {'C': [1, 10, 100, 1000], 'kernel': ['linear']}],
pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
verbose=0)
In [36]:
print(clf.best_params_)

{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

In [37]:
y_true, y_pred = y_test, clf.predict(X_test)
ax = sns.heatmap(confusion_matrix(y_true, y_pred))
ax.set(xlabel='true label', ylabel='predicted label');


## Python is slow!¶

• The interpreted language is indeed quite slow (just like matlab and R are slow)

## Great tools to generate C-code¶

• Cython: Write Python-like syntax that can be translated to fast C-code and called from Python.
• Numba: Directly write Python and auto-translate to LLVM.
• Theano: Write numerical expressions in a NumPy-like syntax to build up compute-graph that can be compiled.
• PyCUDA: GPU programming.

## Comparing Python, Cython and Numba¶

In [38]:
import numpy as np
X = np.random.random((1000, 3))

In [39]:
def pairwise_python(X):
M = X.shape[0]
N = X.shape[1]
D = np.empty((M, M), dtype=np.float)
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = np.sqrt(d)
return D
%timeit pairwise_python(X)

1 loops, best of 3: 6.34 s per loop


## Cython¶

In [40]:
%load_ext cython

In [41]:
%%cython
import numpy as np
cimport cython
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise_cython(double[:, ::1] X):
cdef int M = X.shape[0]
cdef int N = X.shape[1]
cdef double tmp, d
cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = sqrt(d)
return np.asarray(D)

In [42]:
%timeit pairwise_cython(X)

100 loops, best of 3: 7.11 ms per loop


## Numba¶

In [43]:
from numba.decorators import autojit

pairwise_numba = autojit(pairwise_python)

# Run once to compile before timing
pairwise_numba(X)

%timeit pairwise_numba(X)

100 loops, best of 3: 7.31 ms per loop


# Lots of things happening!¶

• Ibis
• PySpark
• bcolz

• Bokeh
• Plotly
• pyxley

## Work interactively on Big Data with Dask¶

In [44]:
!ls -lahL POIWorld.csv

-rw-rw-r-- 1 wiecki wiecki 774M Sep  8 14:43 POIWorld.csv

In [45]:
from dask import dataframe as dd
columns = ["name", "amenity", "Longitude", "Latitude"]
data

Out[45]:
dd.DataFrame<read-csv-POIWorld.csv-74f130b10caf7d8ca68886970e302ba2, divisions=(None, None, None, ..., None, None)>
In [46]:
with_name = data[data.name.notnull()]

In [47]:
is_starbucks = with_name.name.str.contains('[Ss]tarbucks')
is_dunkin = with_name.name.str.contains('[Dd]unkin')

starbucks = with_name[is_starbucks]
dunkin = with_name[is_dunkin]

In [49]:
with ProgressBar():
starbucks_count, dunkin_count = dd.compute(starbucks.name.count(), dunkin.name.count())

[########################################] | 100% Completed |  1min  5.8s

In [50]:
starbucks_count, dunkin_count

Out[50]:
(5416, 1320)
In [51]:
locs = dd.compute(starbucks.Longitude,
starbucks.Latitude,
dunkin.Longitude,
dunkin.Latitude)

# extract arrays of values fro the series:
lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs]

In [52]:
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

def draw_USA():
"""initialize a basemap centered on the continental USA"""
plt.figure(figsize=(14, 10))
return Basemap(projection='lcc', resolution='l',
llcrnrlon=-119, urcrnrlon=-64,
llcrnrlat=22, urcrnrlat=49,
lat_1=33, lat_2=45, lon_0=-95,
area_thresh=10000)

m = draw_USA()

# Draw map background
m.fillcontinents(color='white', lake_color='#eeeeee')
m.drawstates(color='lightgray')
m.drawcoastlines(color='lightgray')
m.drawcountries(color='lightgray')
m.drawmapboundary(fill_color='#eeeeee')

# Plot the values in Starbucks Green and Dunkin Donuts Orange
style = dict(s=5, marker='o', alpha=0.5, zorder=2)
m.scatter(lon_s, lat_s, latlon=True,
label="Starbucks", color='#00592D', **style)
m.scatter(lon_d, lat_d, latlon=True,
label="Dunkin' Donuts", color='#FC772A', **style)
plt.legend(loc='lower left', frameon=False);


## Interactive data visualization with Bokeh¶

In [56]:
show(grid)