$\newcommand{\vec}[1]{\boldsymbol{#1}}$
If you are using nbviewer you can change to slides mode by clicking on the icon:
Programs with parameters that automatically adjust by adapting to previously seen data.
Work in machine learning:<br/> https://stackoverflow.com/insights/survey/2017#salary
</div>
<div class="col-md-3">
<img src="https://upload.wikimedia.org/wikipedia/commons/4/40/Raimundus_Lullus_alchemic_page.jpg"/>
<small>Alchemic treatise of [Ramon Llull](https://en.wikipedia.org/wiki/Ramon_Llull).</small>
</div>
</div>
Intelligent systems find patterns and discover relations that are latent in large volumes of data.
Features of intelligent systems:
Learning is the act of acquiring new, or modifying and reinforcing, existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information.
Example: Evolving cars with genetic algorithms: http://www.boxcar2d.com/.
More formally, the machine learning can be described as:
Note: Generally, $\mathcal{D}\subseteq\mathbb{R}^n$; the definition of $\mathcal{I}$ depends on the problem.
Many more: times-series analysis, anomaly detection, imputation, transcription, etc.
An example of a supervised problem (regression)
import random
import numpy as np
import matplotlib.pyplot as plt
# plt.rc('text', usetex=True); plt.rc('font', family='serif')
# plt.rc('text.latex', preamble='\\usepackage{libertine}\n\\usepackage[utf8]{inputenc}')
# numpy - pretty matrix
np.set_printoptions(precision=3, threshold=1000, edgeitems=5, linewidth=80, suppress=True)
import seaborn
seaborn.set(style='whitegrid')
seaborn.set_context('talk')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Fixed seed to make the results replicable - remove in real life!
random.seed(42)
x = np.arange(100)
Let's suppose that we have a phenomenon such that $$y_\text{real} = \sin\left(\frac{\pi x}{50}\right)\,.$$
y_real = np.sin(x*np.pi/50)
Introducing some uniform random noise to simulate measurement noise:
y_measured = y_real + (np.random.rand(100) - 0.5)/1
plt.scatter(x,y_measured, marker='.', color='b', label='measured')
plt.plot(x,y_real, color='r', label='real')
plt.xlabel('x'); plt.ylabel('y'); plt.legend(frameon=True);
We can now learn from the dataset $\Psi=\left\{x, y_\text{measured}\right\}$.
scikit-learn
.Training (adjusting) SVR
from sklearn.svm import SVR
clf = SVR() # using default parameters
clf.fit(x.reshape(-1, 1), y_measured)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
We can now see how our SVR models the data.
y_pred = clf.predict(x.reshape(-1, 1))
plt.scatter(x, y_measured, marker='.', color='blue', label='measured')
plt.plot(x, y_pred, 'g--', label='predicted')
plt.xlabel('X'); plt.ylabel('y'); plt.legend(frameon=True);
We observe for the first time an important negative phenomenon: overfitting.
We will be dedicating part of the course to the methods that we have for control overfitting.
clf = SVR(C=1e3, gamma=0.0001)
clf.fit(x.reshape(-1, 1), y_measured)
SVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.0001, kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
y_pred_ok = clf.predict(x.reshape(-1, 1))
plt.scatter(x, y_measured, marker='.', color='b', label='measured')
plt.plot(x, y_pred, 'g--', label='overfitted')
plt.plot(x, y_pred_ok, 'm-', label='not overfitted')
plt.xlabel('X'); plt.ylabel('y'); plt.legend(frameon=True);
In some cases we can just observe a series of items or values, e.g., $\Psi=\left\{\vec{x}_i\right\}$:
It is necessary to find the hidden structure of unlabeled data.
We need a measure of correctness of the model that does not requires an expected outcome.
Although, at first glance, it may look a bit awkward, this type of problem is very common.
Let's generate a dataset that is composed by three groups or clusters of elements, $\vec{x}\in\mathbb{R}^2$.
x_1 = np.random.randn(30,2) + (5,5)
x_2 = np.random.randn(30,2) + (10,0)
x_3 = np.random.randn(30,2) + (0,2)
plt.scatter(x_1[:,0], x_1[:,1], c='red', label='Cluster 1', alpha =0.74)
plt.scatter(x_2[:,0], x_2[:,1], c='blue', label='Cluster 2', alpha =0.74)
plt.scatter(x_3[:,0], x_3[:,1], c='green', label='Cluster 3', alpha =0.74)
plt.legend(frameon=True); plt.xlabel('$x_1$'); plt.ylabel('$x_2$');
plt.title('Three datasets');
Preparing the training dataset.
x = np.concatenate(( x_1, x_2, x_3), axis=0)
x.shape
(90, 2)
plt.scatter(x[:,0], x[:,1], c='m', alpha =0.74)
plt.title('Training dataset');
We can now try to learn what clusters are in the dataset. We are going to use the $k$-means clustering algorithm.
from sklearn.cluster import KMeans
clus = KMeans(n_clusters=3)
clus.fit(x)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
labels_pred = clus.predict(x)
print(labels_pred)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2]
cm=iter(plt.cm.Set1(np.linspace(0,1,len(np.unique(labels_pred)))))
for label in np.unique(labels_pred):
plt.scatter(x[labels_pred==label][:,0], x[labels_pred==label][:,1],
c=next(cm), alpha =0.74, label='Pred. cluster ' +str(label+1))
plt.legend(loc='upper right', bbox_to_anchor=(1.45,1), frameon=True);
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('Clusters predicted');
Needing to set the number of clusters can lead to problems.
clus = KMeans(n_clusters=10)
clus.fit(x)
labels_pred = clus.predict(x)
cm=iter(plt.cm.Set1(np.linspace(0,1,len(np.unique(labels_pred)))))
for label in np.unique(labels_pred):
plt.scatter(x[labels_pred==label][:,0], x[labels_pred==label][:,1],
c=next(cm), alpha =0.74, label='Pred. cluster ' + str(label+1))
plt.legend(loc='upper right', bbox_to_anchor=(1.45,1), frameon=True)
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('Ten clusters predicted');
Define a training criterion.
$$\min_{\theta}\ \mathbf{E}_{\Psi}[\ell(\mathcal{M}(\vec{x};\vec{\theta}), \vec{y})].$$
This is a cornerstone issue of machine learning and we will be comming back to it.
The machine learning flowchart
%load_ext version_information
%version_information scipy, numpy, matplotlib
Software | Version |
---|---|
Python | 3.6.2 64bit [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] |
IPython | 6.1.0 |
OS | Darwin 17.0.0 x86_64 i386 64bit |
scipy | 0.19.1 |
numpy | 1.13.1 |
matplotlib | 2.0.2 |
Tue Aug 22 09:03:13 2017 -03 |
# this code is here for cosmetic reasons
from IPython.core.display import HTML
from urllib.request import urlopen
HTML(urlopen('https://raw.githubusercontent.com/lmarti/jupyter_custom/master/custom.include').read().decode('utf-8'))