import numpy
numpy.__version__
'1.9.1'
import scipy
scipy.__version__
'0.14.0'
import matplotlib
matplotlib.__version__
'1.4.0'
import sklearn
sklearn.__version__
'0.15.2'
Examples:
The examples shown are in 2 dimensions. Where they are really useful is in higher dimensions.
import numpy as np
X = np.random.random((3, 5))
X
array([[ 0.95839979, 0.3972397 , 0.97018192, 0.54791763, 0.7120874 ], [ 0.45727437, 0.23183571, 0.68250897, 0.71685049, 0.98786237], [ 0.44067605, 0.58226782, 0.04361043, 0.77600408, 0.51852528]])
# access the second row
X[1]
array([ 0.45727437, 0.23183571, 0.68250897, 0.71685049, 0.98786237])
Turning a row vector into a column vector:
Y = np.linspace(0, 12, 5)
Y
array([ 0., 3., 6., 9., 12.])
Y[:, np.newaxis]
array([[ 0.], [ 3.], [ 6.], [ 9.], [ 12.]])
_13.shape
(5,)
_14.shape
(5, 1)
ML is mostly linear algebra (matrix vector products).
A broadcasting trick:
%matplotlib inline
import pylab as plt
x = np.linspace(1, 12, 100)
y = x[:, np.newaxis]
im = x * y
im.shape
(100, 100)
plt.imshow(im)
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x106e0fd68>
im[-1,-1]
144.0
Supervised vs unsupervised: "your data is labeled" vs "your data is not labeled"
In sklearn:
Example of iris dataset: flowers to classify. What are samples? Features? Labels (numerical!)?
from sklearn.datasets import load_iris
iris = load_iris()
iris.keys()
dict_keys(['data', 'DESCR', 'feature_names', 'target_names', 'target'])
n_samples, n_features = iris.data.shape
n_samples
150
n_features
4
iris.data[0]
array([ 5.1, 3.5, 1.4, 0.2])
iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
def plot_iris_projection(x_index, y_index):
formatter = plt.FuncFormatter(lambda i, *args:iris.target_names[i][:4])
plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])
plot_iris_projection(0, 1)
from IPython.html.widgets import interact
def plot_iris_projection_widget(first_index, second_index):
plot_iris_projection(first_index, second_index)
interact(plot_iris_projection_widget,
first_index=(0, 3),
second_index=(0, 3))
<function __main__.plot_iris_projection_widget>
plt.figure(figsize=(10, 10))
for i in range(4):
for j in range(4):
plt.subplot(4, 4, 4 * i + j + 1)
plot_iris_projection(i, j)
plt.tight_layout()
Setosa is always a little bit out. Virginica and Versacola always overlap a little bit. Humans and ML will have problems with this.
Unsupervised algorithms are the algorithms that would cluster these.
Datasets:
datasets.load
datasets.fetch
datasets.make
digits = sklearn.datasets.load_digits()
digits.keys()
dict_keys(['data', 'DESCR', 'target_names', 'target', 'images'])
n_samples, n_features = digits.data.shape
n_samples
1797
n_features
64
digits.data[0]
array([ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4., 12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5., 10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.])
index = 11
plt.imshow(digits.data[index].reshape((8, 8)), cmap=matplotlib.cm.gray, interpolation='nearest')
print(digits.target_names[digits.target[index]])
1
You can always take the value of each pixel and turn it into a feature.
The digits in the dataset are already processed: in real life, all the number are not the same. Feature engineering is hard!
There's this really nice magic to load source code: %load
. It takes a Python file as argument!
%load generate_readme.py
import os
import sys
if sys.version_info >= (3, 0):
from urllib.parse import quote
else:
from urllib import quote
def filename2url(filename):
return "http://nbviewer.ipython.org/urls/raw.github.com/flothesof/posts/master/{0}".format(
quote(filename))
if __name__ == "__main__":
files = os.listdir(os.getcwd())
files = filter(lambda s: s.endswith('.ipynb'), files)
header = """posts
=====
This is a sort of blog / work in progress repository for interesting projects that pop into my mind.
![files/xkcd_departments.png](files/xkcd_departments.png)
"""
with open('README.md', 'w') as f:
f.write(header)
for index, filename in enumerate(files):
f.write("- [%i-%s](%s)\n" % (index + 1,
filename,
filename2url(filename)))
Olivetti faces dataset: you can do PCA on them and eigenfaces!!! I want to do that!!!
Supervised learning: "predict"
Unsupervised learning: "transform"
Questions: