Introduction to Data Analysis with Python


Dr. Thomas Wiecki


Lead Data Scientist

Introduction to Data Analysis with Python

The Path of the PyData Ninja


Dr. Thomas Wiecki


Lead Data Scientist

About me

  • We back the best investment algorithms with investor capital, trading operations, and technology.
  • Do your research in our hosted IPython environment using stock price history, corporate fundamental data, and other data sets.
  • Write your algorithm in your browser. Then backtest it, for free, over 13 years of minute-level data.
  • When you enter the contest, your algorithm will also be considered for our hedge fund.
  • We're hiring in Düsseldorf: Operations Engineer!

Why use Python for data analysis?

  • Python is a general purpose language -> No hodge-podge of perl, bash, matlab, fortran.
  • Very easy to learn.
  • Quality and quantity of data analysis libraries is very high and growing at a rapid pace.
  • What are the alternatives?
    • R: "The best thing about R is that it was written by statisticians. The worst thing about R is that it was written by statisticians." Bow Cogwill
    • Matlab: $$$, not open

Jobs!

The PyData Stack

Source: [Jake VanderPlas: State of the Tools](https://www.youtube.com/watch?v=5GlNDD7qbP4)

The PyData Stack

The PyData Stack

The PyData Stack

The PyData Stack

Level 0: n00b

How to get started

Python basics

Interpreted and interactive

In [1]:
3 * 4
Out[1]:
12

Lists

In [2]:
x = [1, 2, 3]
print(x)
[1, 2, 3]
In [3]:
x.append(4)
print(x)
[1, 2, 3, 4]

Dictionaries

In [4]:
measurements = {'height': [1.70, 1.80, 1.50], 'weight': [60, 120, 50]}
measurements
Out[4]:
{'height': [1.7, 1.8, 1.5], 'weight': [60, 120, 50]}
In [5]:
measurements['height']
Out[5]:
[1.7, 1.8, 1.5]

Comprehensions

In [6]:
x = [1, 2, 3, 4]
[i**2 for i in x]
Out[6]:
[1, 4, 9, 16]
In [7]:
def calc_bmi(weight, height):
    return weight / height**2

[calc_bmi(w, h) for w, h in zip(measurements['weight'], measurements['height'])]
Out[7]:
[20.761245674740486, 37.03703703703704, 22.22222222222222]

Level 1: "The Pandas Wrangler"

How to become a "Pandas Wrangler"

Why not start with NumPy and Matplotlib?

Pandas

In [8]:
import pandas as pd
import numpy as np
In [9]:
s = pd.Series([1,3,5,np.nan,6,8])
s
Out[9]:
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64
In [10]:
dates = pd.date_range('20130101', periods=6)

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
Out[10]:
A B C D
2013-01-01 -1.825304 0.016285 -0.306601 0.004272
2013-01-02 0.192045 -0.291267 1.951792 1.708918
2013-01-03 -0.672820 -2.051800 0.113415 -0.594967
2013-01-04 -1.453007 1.282767 -0.113163 -1.824503
2013-01-05 0.824425 2.156586 -0.047000 -1.042518
2013-01-06 0.359821 -0.896814 0.617897 -0.723848
In [11]:
df[df.A > 0]
Out[11]:
A B C D
2013-01-02 0.192045 -0.291267 1.951792 1.708918
2013-01-05 0.824425 2.156586 -0.047000 -1.042518
2013-01-06 0.359821 -0.896814 0.617897 -0.723848
In [12]:
df.mean()
Out[12]:
A   -0.429140
B    0.035960
C    0.369390
D   -0.412108
dtype: float64
In [13]:
df.mean(axis='columns')
Out[13]:
2013-01-01   -0.527837
2013-01-02    0.890372
2013-01-03   -0.801543
2013-01-04   -0.526977
2013-01-05    0.472873
2013-01-06   -0.160736
Freq: D, dtype: float64

Mixed types

In [14]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })

df2
Out[14]:
A B C D E F
0 1 2013-01-02 1 3 test foo
1 1 2013-01-02 1 3 train foo
2 1 2013-01-02 1 3 test foo
3 1 2013-01-02 1 3 train foo
In [15]:
df2.dtypes
Out[15]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Grouping

In [16]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
In [17]:
df
Out[17]:
A B C D
0 foo one 0.343733 -0.387430
1 bar one 2.061690 2.454234
2 foo two 0.626621 -1.106039
3 bar three 1.511600 0.963582
4 foo two -1.187094 -1.310974
5 bar two -1.065222 -0.983320
6 foo one -0.924636 0.349164
7 foo three 2.093810 -0.913885
In [18]:
df.groupby('A').sum()
Out[18]:
C D
A
bar 2.508068 2.434496
foo 0.952434 -3.369164
In [19]:
df.groupby(['A', 'B']).sum()
Out[19]:
C D
A B
bar one 2.061690 2.454234
three 1.511600 0.963582
two -1.065222 -0.983320
foo one -0.580903 -0.038266
three 2.093810 -0.913885
two -0.560473 -2.417013

Seaborn: Generating statistical plots

In [20]:
%matplotlib inline
import seaborn as sns
In [21]:
x = np.random.normal(size=100)
sns.distplot(x);

2D distributions

In [22]:
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
df
Out[22]:
x y
0 1.086599 0.232294
1 -0.443143 -0.354076
2 0.652826 1.165967
3 -0.903281 0.170521
4 -1.148413 2.050707
5 -0.176674 2.375160
6 0.845552 1.879006
7 -0.962104 0.340105
8 0.038327 0.011257
9 -0.704808 -0.132214
10 -0.265136 0.426827
11 1.327651 3.319972
12 -0.044100 0.249657
13 0.451112 0.740770
14 0.596272 1.360725
15 -0.137805 1.041698
16 1.168201 2.713771
17 -1.319453 -0.503621
18 0.461970 2.389124
19 -0.334954 1.328563
20 -0.649478 1.537904
21 0.310419 1.523426
22 0.710832 0.943226
23 -0.823799 1.775166
24 0.992667 1.022176
25 0.568753 0.557493
26 -0.301274 1.079018
27 -1.733951 0.567187
28 0.952436 1.852690
29 -1.174882 0.873883
... ... ...
170 0.563523 2.138955
171 1.033686 4.141070
172 0.292181 1.408222
173 -2.301801 -0.042128
174 0.104059 0.542621
175 0.588103 0.567352
176 -1.073328 -0.120820
177 1.853767 0.099758
178 1.075379 2.611514
179 0.559673 3.366675
180 0.747945 2.053162
181 -2.927040 -0.621391
182 0.134816 0.727113
183 0.015388 2.829295
184 0.742623 2.058462
185 -0.229863 1.139025
186 -1.033777 -0.642099
187 0.428791 0.762164
188 -0.596019 2.079460
189 1.819033 1.434940
190 0.254380 1.494345
191 2.276169 0.408099
192 -0.076959 -1.321194
193 0.496929 -0.174779
194 0.723700 1.561495
195 1.097657 2.438591
196 0.460952 2.506192
197 0.458422 1.522599
198 1.411305 1.632619
199 -1.533196 2.016695

200 rows × 2 columns

In [23]:
sns.jointplot(x="x", y="y", data=df, kind="kde");

All pairwise combinations

In [24]:
iris = sns.load_dataset("iris")
sns.pairplot(iris);

Seaborn: Regressions

In [25]:
tips = sns.load_dataset("tips")
In [26]:
tips.head()
Out[26]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [27]:
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips);
In [28]:
sns.lmplot(x="total_bill", y="tip", col="day", data=tips,
           col_wrap=2, size=3);
In [29]:
sns.factorplot(x="time", y="total_bill", hue="smoker",
               col="day", data=tips, kind="box", size=4, aspect=.5);

Level 2: "The Kaggle top scorer"

Lots of machine learning and stats libraries

  • SciPy: comprehensive library of numerical routines like optimizers, integrators, FFT.
  • scikit-learn: The ML library out there
  • statsmodels: Frequentist statistics
  • SymPy: Symbolic Math
  • PyMC3: Probabilistic programming in Python
In [30]:
from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)
Out[30]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [31]:
clf.predict([[0, .5]])
Out[31]:
array([0])
In [32]:
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

digits = datasets.load_digits()
In [33]:
import matplotlib.pyplot as plt
#Display the first digit
plt.figure(1, figsize=(3, 3))
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.grid('off')
In [34]:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)
In [35]:
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5)
clf.fit(X_train, y_train)
Out[35]:
GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.001, 0.0001]}, {'C': [1, 10, 100, 1000], 'kernel': ['linear']}],
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)
In [36]:
print(clf.best_params_)
{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
In [37]:
y_true, y_pred = y_test, clf.predict(X_test)
ax = sns.heatmap(confusion_matrix(y_true, y_pred))
ax.set(xlabel='true label', ylabel='predicted label');

Level 3: "Lord of Speed"

Python is slow!

  • The interpreted language is indeed quite slow (just like matlab and R are slow)
  • Vectorize computations (i.e. the matlab way): leads to unreadable code.

Great tools to generate C-code

  • Cython: Write Python-like syntax that can be translated to fast C-code and called from Python.
  • Numba: Directly write Python and auto-translate to LLVM.
  • Theano: Write numerical expressions in a NumPy-like syntax to build up compute-graph that can be compiled.
  • PyCUDA: GPU programming.

Comparing Python, Cython and Numba

Taken from https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/

In [38]:
import numpy as np
X = np.random.random((1000, 3))
In [39]:
def pairwise_python(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, M), dtype=np.float)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)
    return D
%timeit pairwise_python(X)
1 loops, best of 3: 6.34 s per loop

Cython

In [40]:
%load_ext cython
In [41]:
%%cython
import numpy as np
cimport cython
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise_cython(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = sqrt(d)
    return np.asarray(D)
In [42]:
%timeit pairwise_cython(X)
100 loops, best of 3: 7.11 ms per loop

Numba

In [43]:
from numba.decorators import autojit

pairwise_numba = autojit(pairwise_python)

# Run once to compile before timing
pairwise_numba(X)

%timeit pairwise_numba(X)
100 loops, best of 3: 7.31 ms per loop

Level 4: "High Priest of Big Data"

Lots of things happening!

Big Data

  • Dask
  • Ibis
  • PySpark
  • bcolz

Interactive data visualization

  • Bokeh
  • Plotly
  • pyxley

Work interactively on Big Data with Dask

Taken from https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/

In [44]:
!ls -lahL POIWorld.csv
-rw-rw-r-- 1 wiecki wiecki 774M Sep  8 14:43 POIWorld.csv
In [45]:
from dask import dataframe as dd
columns = ["name", "amenity", "Longitude", "Latitude"]
data = dd.read_csv('POIWorld.csv', usecols=columns)
data
Out[45]:
dd.DataFrame<read-csv-POIWorld.csv-74f130b10caf7d8ca68886970e302ba2, divisions=(None, None, None, ..., None, None)>
In [46]:
with_name = data[data.name.notnull()]
In [47]:
is_starbucks = with_name.name.str.contains('[Ss]tarbucks')
is_dunkin = with_name.name.str.contains('[Dd]unkin')

starbucks = with_name[is_starbucks]
dunkin = with_name[is_dunkin]
In [49]:
with ProgressBar():
    starbucks_count, dunkin_count = dd.compute(starbucks.name.count(), dunkin.name.count())
[########################################] | 100% Completed |  1min  5.8s
In [50]:
starbucks_count, dunkin_count
Out[50]:
(5416, 1320)
In [51]:
locs = dd.compute(starbucks.Longitude,
                  starbucks.Latitude,
                  dunkin.Longitude,
                  dunkin.Latitude)

# extract arrays of values fro the series:
lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs]
In [52]:
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

def draw_USA():
    """initialize a basemap centered on the continental USA"""
    plt.figure(figsize=(14, 10))
    return Basemap(projection='lcc', resolution='l',
                   llcrnrlon=-119, urcrnrlon=-64,
                   llcrnrlat=22, urcrnrlat=49,
                   lat_1=33, lat_2=45, lon_0=-95,
                   area_thresh=10000)

m = draw_USA()
    
# Draw map background
m.fillcontinents(color='white', lake_color='#eeeeee')
m.drawstates(color='lightgray')
m.drawcoastlines(color='lightgray')
m.drawcountries(color='lightgray')
m.drawmapboundary(fill_color='#eeeeee')

# Plot the values in Starbucks Green and Dunkin Donuts Orange
style = dict(s=5, marker='o', alpha=0.5, zorder=2)
m.scatter(lon_s, lat_s, latlon=True,
          label="Starbucks", color='#00592D', **style)
m.scatter(lon_d, lat_d, latlon=True,
          label="Dunkin' Donuts", color='#FC772A', **style)
plt.legend(loc='lower left', frameon=False);

Interactive data visualization with Bokeh

In [56]:
show(grid)

Staying up-to-date