In [1]:

import numpy

In [2]:

numpy.__version__

Out[2]:

'1.9.1'

In [3]:

import scipy

In [4]:

scipy.__version__

Out[4]:

'0.14.0'

In [5]:

import matplotlib

In [6]:

matplotlib.__version__

Out[6]:

'1.4.0'

In [7]:

import sklearn

In [8]:

sklearn.__version__

Out[8]:

'0.15.2'

Examples:

classification: training, then predicting which group the new observation belongs to
linear regression: training, then fitting of unknown values

The examples shown are in 2 dimensions. Where they are really useful is in higher dimensions.

In [9]:

import numpy as np

In [10]:

X = np.random.random((3, 5))

In [11]:

Out[11]:

array([[ 0.95839979,  0.3972397 ,  0.97018192,  0.54791763,  0.7120874 ],
       [ 0.45727437,  0.23183571,  0.68250897,  0.71685049,  0.98786237],
       [ 0.44067605,  0.58226782,  0.04361043,  0.77600408,  0.51852528]])

In [12]:

# access the second row
X[1]

Out[12]:

array([ 0.45727437,  0.23183571,  0.68250897,  0.71685049,  0.98786237])

Turning a row vector into a column vector:

In [13]:

Y = np.linspace(0, 12, 5)
Y

Out[13]:

array([  0.,   3.,   6.,   9.,  12.])

In [14]:

Y[:, np.newaxis]

Out[14]:

array([[  0.],
       [  3.],
       [  6.],
       [  9.],
       [ 12.]])

In [15]:

_13.shape

Out[15]:

(5,)

In [16]:

_14.shape

Out[16]:

(5, 1)

ML is mostly linear algebra (matrix vector products).

A broadcasting trick:

In [17]:

%matplotlib inline

In [18]:

import pylab as plt

In [19]:

x = np.linspace(1, 12, 100)
y = x[:, np.newaxis]

In [20]:

im = x * y

In [21]:

im.shape

Out[21]:

(100, 100)

In [22]:

plt.imshow(im)
plt.colorbar()

Out[22]:

<matplotlib.colorbar.Colorbar at 0x106e0fd68>

In [23]:

im[-1,-1]

Out[23]:

144.0

Supervised vs unsupervised: "your data is labeled" vs "your data is not labeled"

Data representation¶

In sklearn:

feature matrix
- number of samples: number of rows
- number of features: number of columns (observations)
label vector

Example of iris dataset: flowers to classify. What are samples? Features? Labels (numerical!)?

The iris dataset¶

In [24]:

from sklearn.datasets import load_iris

In [25]:

iris = load_iris()

In [26]:

iris.keys()

Out[26]:

dict_keys(['data', 'DESCR', 'feature_names', 'target_names', 'target'])

In [27]:

n_samples, n_features = iris.data.shape

In [28]:

n_samples

Out[28]:

In [29]:

n_features

Out[29]:

In [30]:

iris.data[0]

Out[30]:

array([ 5.1,  3.5,  1.4,  0.2])

In [31]:

iris.target

Out[31]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [32]:

def plot_iris_projection(x_index, y_index):
    formatter = plt.FuncFormatter(lambda i, *args:iris.target_names[i][:4])
    plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)
    plt.colorbar(ticks=[0, 1, 2], format=formatter)
    plt.xlabel(iris.feature_names[x_index])
    plt.ylabel(iris.feature_names[y_index])

In [33]:

plot_iris_projection(0, 1)

In [34]:

from IPython.html.widgets import interact

In [35]:

def plot_iris_projection_widget(first_index, second_index):
    plot_iris_projection(first_index, second_index)

In [36]:

interact(plot_iris_projection_widget,
         first_index=(0, 3),
         second_index=(0, 3))

Out[36]:

<function __main__.plot_iris_projection_widget>

In [37]:

plt.figure(figsize=(10, 10))
for i in range(4):
    for j in range(4):
        plt.subplot(4, 4, 4 * i + j + 1)
        plot_iris_projection(i, j)
plt.tight_layout()

Setosa is always a little bit out. Virginica and Versacola always overlap a little bit. Humans and ML will have problems with this.

Unsupervised algorithms are the algorithms that would cluster these.

Datasets:

can be loaded datasets.load
or fetched (from online sites) datasets.fetch
made from some statistical distribution datasets.make

Digits¶

In [38]:

digits = sklearn.datasets.load_digits()

In [39]:

digits.keys()

Out[39]:

dict_keys(['data', 'DESCR', 'target_names', 'target', 'images'])

In [40]:

n_samples, n_features = digits.data.shape

In [41]:

n_samples

Out[41]:

In [42]:

n_features

Out[42]:

In [43]:

digits.data[0]

Out[43]:

array([  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
        15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
         8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
         5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
         1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
         0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.])

In [44]:

index = 11
plt.imshow(digits.data[index].reshape((8, 8)), cmap=matplotlib.cm.gray, interpolation='nearest')
print(digits.target_names[digits.target[index]])

You can always take the value of each pixel and turn it into a feature.

The digits in the dataset are already processed: in real life, all the number are not the same. Feature engineering is hard!

The load magic¶

There's this really nice magic to load source code: %load. It takes a Python file as argument!

In [45]:

%load generate_readme.py

In [ ]:

import os
import sys
if sys.version_info >= (3, 0):
    from urllib.parse import quote
else:
    from urllib import quote

def filename2url(filename):
    return "http://nbviewer.ipython.org/urls/raw.github.com/flothesof/posts/master/{0}".format(
                                                                        quote(filename))
if __name__ == "__main__":
    files = os.listdir(os.getcwd())
    files = filter(lambda s: s.endswith('.ipynb'), files)
    
    header = """posts
=====

This is a sort of blog / work in progress repository for interesting projects that pop into my mind.

![files/xkcd_departments.png](files/xkcd_departments.png)

"""
    with open('README.md', 'w') as f:
        f.write(header)
        for index, filename in enumerate(files):
            f.write("- [%i-%s](%s)\n" % (index + 1, 
                                    filename, 
                                    filename2url(filename)))

Faces¶

Olivetti faces dataset: you can do PCA on them and eigenfaces!!! I want to do that!!!

Basic principles¶

Supervised learning: "predict"

k neighbors
regression

Unsupervised learning: "transform"

PCA

Questions:

is there sometime non-relevant data in the data we're fitting in unsupervised: yes! But need to cope.