The Python Data Science Stack

Dr. Florian Wilhelm

Senior Professional: Data Scientist @ CSC

What is Python?¶

multi-purpose
focused on readability and productivity
easy to learn
object oriented
interpreted
strongely and dynamically typed
cross platform

Features¶

indentation is part of the syntax
high level data types (tuples, lists, dictionaries, sets)
Python Standard Library (Batteries included)
- string sevices, regular expressions
- mathematical modules
- IO, file formats and data persistence
- OS, threading, multiprocessing
- networking, email, html, webserver
- ...
easily extensible with C/C++ (glue language)
tons of external libraries

Why Python for Analytics?¶

Besides the features already mentioned, Python has:

large communities for data science, analytics, visualisation etc.,
many and well-established libraries,
lots of examples and documentation,
huge demand from the industry.

Python 2 vs. 3

Source: LearnToCodeWithMe

Use Python 3! All relevant libraries are ported!

Installation

Linux & Mac¶

It is already installed! Use virtualenv and pip to setup isolated environments and install more packages. Conda is an alternative.

Windows¶

A bit trickier! Best use the Anaconda distribution from Continuum Analytics to install everything you need to get going.

Primer on Python¶

Strong and dynamically typed

In [2]:

x = 23
3*x

Out[2]:

In [3]:

x = "Hello "
y = "World!"
print(x + y)

Hello World!

In [4]:

print(x + 1)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-17ee3e2a0896> in <module>()
----> 1 print(x + 1)

TypeError: Can't convert 'int' object to str implicitly

Indentation matters!¶

In [5]:

x = 3

if x > 0:
    if x % 2 == 0:
        print("Positive, even number!")
    else:
        print("Positive, odd number!")
else:
    print("Non-positive number!")

Positive, odd number!

In [6]:

def bmi(height, weight):
    return weight / height**2

print("The BMI is: {:.3}".format(bmi(1.85, 79)))

The BMI is: 23.1

Tuples¶

In [7]:

x = (1, 3, 5)
print(x)

(1, 3, 5)

In [8]:

x[2]

Out[8]:

In [9]:

a, b, c = x
print(a + b + c)

In [10]:

a, *others = x
print(a, 'and', others)

1 and [3, 5]

Lists¶

In [11]:

x = [1, 3, 5]
print(x)

[1, 3, 5]

In [12]:

x.append(7)
print(x)

[1, 3, 5, 7]

In [13]:

del x[0]
print(x)

[3, 5, 7]

Dictionaries¶

In [14]:

x = {'a': 1, 'b': 2, 'c': 3}

In [15]:

print(x['b'])

In [16]:

x['d'] = 4
print(x)

{'a': 1, 'd': 4, 'b': 2, 'c': 3}

In [17]:

x['dispatch'] = lambda x: x + 41
x['dispatch'](1)

Out[17]:

Powerful and easy to use data structures like lists and dictionaries allow declarative programming.

Loops and list comprehension¶

In [18]:

x = []
for i in range(5):
    x.append(i**2)
print(x)

[0, 1, 4, 9, 16]

In [19]:

# better
x = [i**2 for i in range(5)]
print(x)

[0, 1, 4, 9, 16]

Many more high-level concepts available to express an algorithm as natural as possible.

Python Data Science Stack

NumPy to work efficiently with multi-dimensional arrays and matrices. Includes some high-level mathematical operations.
SciPy extends NumPy with additional modules (optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers etc.).
Pandas builds upon NumPy and provides high-performance, easy-to-use data structures and data analysis tools.
Scikit-Learn provides simple and efficient machine learning tools for data mining and data analysis.
matplotlib provides 2d plotting capabilities. Use additionally Seaborn for statistical plots.
IPython is a powerful interactive shell and a kernel for Jupyter.
Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Many libraries internally use efficient C/C++ and FORTRAN implementations!

Jupyter/IPython Notebook

Live demonstration

Titanic: Analysis of a Disaster

Painting from Willy Stöwer, source & more information: Kaggle

Can we predict who survived on the Titanic?

Based on the properties of a passenger like:

gender,

age,

passenger class,

number of siblings,

...

Setting things up and reading in the data¶

In [20]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv('./input/train.csv')

In [22]:

df.head()

Out[22]:

	Alive	Class	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Port
0	0	3	Braund, Mr	male	22	1	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, M	female	38	1	PC 17599	71.2833	C85	C
2	1	3	Heikkinen,	female	26	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle,	female	35	1	113803	53.1000	C123	S
4	0	3	Allen, Mr.	male	35	0	373450	8.0500	NaN	S

Preprocessing¶

In [23]:

# We drop some hard to use columns and define 'Port', 'Sex' and 'Class' as categories
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)
df['Port'] = df['Port'].astype('category')
df['Sex'] = df['Sex'].astype('category')
df['Class'] = df['Class'].astype('category')
df.head()

Out[23]:

	Alive	Class	Sex	Age	SibSp	Fare	Port
0	0	3	male	22	1	7.2500	S
1	1	1	female	38	1	71.2833	C
2	1	3	female	26	0	7.9250	S
3	1	1	female	35	1	53.1000	S
4	0	3	male	35	0	8.0500	S

In [24]:

df.shape

Out[24]:

(891, 8)

Data cleansing¶

In [25]:

df.describe()

Out[25]:

	Alive	Age	SibSp	Parch	Fare
count	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	29.699118	0.523008	0.381594	32.204208
std	0.486592	14.526497	1.102743	0.806057	49.693429
min	0.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	80.000000	8.000000	6.000000	512.329200

In [26]:

df[['Sex', 'Port']].describe()

Out[26]:

	Sex	Port
count	891	889
unique	2	3
top	male	S
freq	577	644

In [27]:

# Fill not available observations
age_mean = df['Age'].mean()
df['Age'] = df['Age'].fillna(age_mean)
df['Port'] = df['Port'].fillna('S')

In [28]:

df[['Sex', 'Port']].describe()

Out[28]:

	Sex	Port
count	891	891
unique	2	3
top	male	S
freq	577	646

Some analysis plots¶

In [29]:

fig, ax = plt.subplots(1, 1, figsize=(12, 5))
ax.axes.set_xlim(0, 80)
g = sns.distplot(df['Age'], color="b", ax=ax)

In [30]:

# Draw a nested barplot to show survival for class and sex
g = sns.factorplot(x="Class", y="Alive", hue="Sex", data=df, size=7, kind="bar", palette="muted")
g.set_ylabels("survival probability")
g.set_xlabels("passenger class")

Out[30]:

<seaborn.axisgrid.FacetGrid at 0x7faacdb509b0>

Fitting a simple predictive model¶

In [31]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

# Define features and target variables
X = df.drop('Alive', axis=1)
y = df['Alive']
# Convert categories to integer values
X['Sex'] = X['Sex'].cat.codes
X['Port'] = X['Port'].cat.codes
# Convert to NumPy arrays
X = X.values
y = y.values

In [32]:

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [33]:

# Create and train the model
model = RandomForestClassifier(n_estimators = 100, random_state=0)
model.fit(X_train, y_train)

In [34]:

# Make some predictions on the test set and compare with the truth
preds = model.predict(X_test)
print("Accuracy: {:.1%}".format(np.mean(preds == y_test)))

Accuracy: 83.4%

And there is much, much more!

Additional libraries for Data Science¶

Visualisation¶

Bokeh allows interactive visualization like plots, dashboards in web browsers.
Pygal for standard but sexy charts in SVG.

High-performance computing¶

Theano evaluate mathematical expressions involving multi-dimensional arrays efficiently.
Spark is fast and general-purpose cluster computing system with PySpark as Python interface.
Cython is an optimising static compiler for writing C-extensions in a Python-like language.
Dask enables parallel computing through task scheduling and blocked algorithms (Out-of-Core).
TensorFlow is a library for numerical computation using data flow graphs often used for AI.

Miscellaneous¶

scikit-image provides tools for image processing.
SQLAlchemy is an SQL toolkit and Object Relational Mapper.
SymPy is library for symbolic mathematics, i.e. computer algebra system like Maple.
PyMC features Bayesian statistical models including Markov chain Monte Carlo.
NetworkX for working with structure, dynamics, and functions of complex networks.

Questions?

Credits¶

This presentation was inspired by: