$\newcommand{\vec}[1]{\boldsymbol{#1}}$
Programs with parameters that automatically adjust by adapting to previously seen data.
Work in machine learning:<br/> https://stackoverflow.com/insights/survey/2017#salary
</div>
<div class="col-md-3">
<img src="https://upload.wikimedia.org/wikipedia/commons/4/40/Raimundus_Lullus_alchemic_page.jpg"/>
<small>Alchemic treatise of [Ramon Llull](https://en.wikipedia.org/wiki/Ramon_Llull).</small>
</div>
</div>
Intelligent systems find patterns and discover relations that are latent in large volumes of data.
Features of intelligent systems:
Learning is the act of acquiring new, or modifying and reinforcing, existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information.
Example: Evolving cars with genetic algorithms: http://www.boxcar2d.com/.
More formally, the machine learning can be described as:
Note: Generally, $\mathcal{D}\subseteq\mathbb{R}^n$; the definition of $\mathcal{I}$ depends on the problem.
Many more: times-series analysis, anomaly detection, imputation, transcription, etc.
An example of a supervised problem (regression)
import random
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
Let's suppose that we have a phenomenon such that $$y_\text{real} = \sin\left(\frac{\pi x}{50}\right)\,.$$
y_real = np.sin(x*np.pi/50)
Introducing some uniform random noise to simulate measurement noise:
y_measured = y_real + (np.random.rand(100) - 0.5)/1
plt.scatter(x,y_measured, marker='.', color='b', label='measured')
plt.plot(x,y_real, color='r', label='real')
plt.xlabel('x'); plt.ylabel('y'); plt.legend(frameon=True);
We can now learn from the dataset $\Psi=\left\{x, y_\text{measured}\right\}$.
scikit-learn
.Training (adjusting) SVR
from sklearn.svm import SVR
clf = SVR() # using default parameters
clf.fit(x.reshape(-1, 1), y_measured)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
We can now see how our SVR models the data.
y_pred = clf.predict(x.reshape(-1, 1))
plt.scatter(x, y_measured, marker='.', color='blue', label='measured')
plt.plot(x, y_pred, 'g--', label='predicted')
plt.xlabel('X'); plt.ylabel('y'); plt.legend(frameon=True);
We observe for the first time an important negative phenomenon: overfitting.
We will be dedicating part of the course to the methods that we have for control overfitting.
clf = SVR(C=1e3, gamma=0.0001)
clf.fit(x.reshape(-1, 1), y_measured)
SVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.0001, kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
y_pred_ok = clf.predict(x.reshape(-1, 1))
plt.scatter(x, y_measured, marker='.', color='b', label='measured')
plt.plot(x, y_pred, 'g--', label='overfitted')
plt.plot(x, y_pred_ok, 'm-', label='not overfitted')
plt.xlabel('X'); plt.ylabel('y'); plt.legend(frameon=True);
In some cases we can just observe a series of items or values, e.g., $\Psi=\left\{\vec{x}_i\right\}$:
It is necessary to find the hidden structure of unlabeled data.
We need a measure of correctness of the model that does not requires an expected outcome.
Although, at first glance, it may look a bit awkward, this type of problem is very common.
Let's generate a dataset that is composed by three groups or clusters of elements, $\vec{x}\in\mathbb{R}^2$.
x_1 = np.random.randn(30,2) + (5,5)
x_2 = np.random.randn(30,2) + (10,0)
x_3 = np.random.randn(30,2) + (0,2)
plt.scatter(x_1[:,0], x_1[:,1], c='red', label='Cluster 1', alpha =0.74)
plt.scatter(x_2[:,0], x_2[:,1], c='blue', label='Cluster 2', alpha =0.74)
plt.scatter(x_3[:,0], x_3[:,1], c='green', label='Cluster 3', alpha =0.74)
plt.legend(frameon=True); plt.xlabel('$x_1$'); plt.ylabel('$x_2$');
plt.title('Three datasets');
Preparing the training dataset.
x = np.concatenate(( x_1, x_2, x_3), axis=0)
x.shape
(90, 2)
plt.scatter(x[:,0], x[:,1], c='m', alpha =0.74)
plt.title('Training dataset');
We can now try to learn what clusters are in the dataset. We are going to use the $k$-means clustering algorithm.
from sklearn.cluster import KMeans
clus = KMeans(n_clusters=3)
clus.fit(x)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
labels_pred = clus.predict(x)
print(labels_pred)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2]
cm=iter(plt.cm.Set1(np.linspace(0,1,len(np.unique(labels_pred)))))
for label in np.unique(labels_pred):
plt.scatter(x[labels_pred==label][:,0], x[labels_pred==label][:,1],
c=next(cm), alpha =0.74, label='Pred. cluster ' +str(label+1))
plt.legend(loc='upper right', bbox_to_anchor=(1.45,1), frameon=True);
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('Clusters predicted');
Needing to set the number of clusters can lead to problems.
clus = KMeans(n_clusters=10)
clus.fit(x)
labels_pred = clus.predict(x)
cm=iter(plt.cm.Set1(np.linspace(0,1,len(np.unique(labels_pred)))))
for label in np.unique(labels_pred):
plt.scatter(x[labels_pred==label][:,0], x[labels_pred==label][:,1],
c=next(cm), alpha =0.74, label='Pred. cluster ' + str(label+1))
plt.legend(loc='upper right', bbox_to_anchor=(1.45,1), frameon=True)
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('Ten clusters predicted');
Define a training criterion.
$$\min_{\theta}\ \mathbf{E}_{\Psi}[\ell(\mathcal{M}(\vec{x};\vec{\theta}), \vec{y})].$$
This is a cornerstone issue of machine learning and we will be comming back to it.
The machine learning flowchart