$\newcommand{\vec}[1]{\boldsymbol{#1}}$

Programs with **parameters** that **automatically** **adjust** by **adapting** to previously seen **data**.

- Machine learning can be considered a subfield of
**artificial intelligence**... - ...since those algorithms can be seen as building blocks to make computers learn to behave more intelligently.
**Generalize**instead of that just storing and retrieving data items like a database system would do.

- Data is more abundant -and least expensive- than knowledge.
- Professionals from various areas of industry work on a particular philosopher's stone:

Work in machine learning:

https://stackoverflow.com/insights/survey/2017#salary

Alchemic treatise of [Ramon Llull](https://en.wikipedia.org/wiki/Ramon_Llull).

**Intelligent systems** find patterns and discover relations that are latent in large volumes of data.

Features of intelligent systems:

- Learning
- Adaptation
- Flexibility and robustness
- Provide explanations
- Discovery/creativity

Learning is the act of acquiring new, or modifying and reinforcing, existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information.

- Construction and study of systems that can learn from data.

- The environment/real world is in constant change.
- The capacity to adapt implies to be able to modify what has been learn in order to cope with those modifications.
- There are many real-world cases:
- Changes in economy
- Wear of mechanic parts of a robot

- In many instances the capacity to adapt is essential to solve the problem $\rightarrow$
*continuous learning*.

- It is required to have a robust and consistent system.
- Similar inputs should generate consistent outputs.

- Self-organization
- 'Classical' approaches based on Boolean algebra and logic have limited flexibility.

- Explanations are necessary to validate and find directions for improvement.
- It is not enough to automate the decision making process.
- In many context explanations are necessary: medicine, credit evaluation, etc.

- They are important if a human expert takes part of the decission loop.
- Machine learning can become a research tool.

- Capacity of discovering processes and/or relations previously unknown.
- Creation of solution and artifacts.

Example: Evolving cars with genetic algorithms: http://www.boxcar2d.com/.

More formally, the machine learning can be described as:

- Having a
**process**$\vec{F}:\mathcal{D}\rightarrow\mathcal{I}$ that**transforms**a given $\vec{x}\in\mathcal{D}$ in a $\vec{y}$. - Construct on a dataset $\Psi=\left\{\left<\vec{x}_i,\vec{y}_i\right>\right\}$ with $i=1,\ldots,N$.
- Each $\left<\vec{x}_i,\vec{y}_i\right>$ represents an
**input**and its corresponding**expected output**: $\vec{y}_i=\vec{F}\left(\vec{x}_i\right)$. **Optimize**a**model**$\mathcal{M}(\vec{x};\vec{\theta})$ by adjusting its parameters $\vec{\theta}$.- Make $\mathcal{M}()$ to be as similar as possible to $\vec{F}()$ by optimizing one or more error (loss) functions.

*Note*: Generally, $\mathcal{D}\subseteq\mathbb{R}^n$; the definition of $\mathcal{I}$ depends on the problem.

*Classification*: $\vec{F}: \mathcal{D}\rightarrow\left\{1,\ldots, k\right\}$; $\vec{F}(\cdot)$ defines 'categories' or 'classes' labels.*Regression*: $\vec{F}: \mathbb{R}^n\rightarrow\mathbb{R}$; it is necessary to predict a real-valued output instead of categories.*Density estimation*: predicit a function $p_\mathrm{model}: \mathbb{R}^n\rightarrow\mathbb{R}$, where $p_\mathrm{model}(\vec{x})$ can be interpreted as a probability density function on the set that the examples were drawn from.*Clustering*: group a set of objects in such a way that objects in the same group (*cluster*) are more*similar*to each other than to those in other groups (clusters).*Synthesis*: generate new examples that are similar to those in the training data.

Many more: times-series analysis, anomaly detection, imputation, transcription, etc.

- Sometimes we can observe the pairs $\left<\vec{x}_i,\vec{y}_i\right>$:
- We can use the $\vec{y}_i$'s to provide a
*scalar feedback*on how good is the model $\mathcal{M}(\vec{x};\vec{\theta})$. - That feed back is known as the
*loss function*. - Modify parameters $\vec{\theta}$ as to improve $\mathcal{M}(\vec{x};\vec{\theta})$ $\rightarrow$
*learning*.

- We can use the $\vec{y}_i$'s to provide a

An example of a supervised problem (regression)

In [1]:

```
import random
import numpy as np
import matplotlib.pyplot as plt
```

In [4]:

```
x = np.arange(100)
```

Let's suppose that we have a phenomenon such that $$y_\text{real} = \sin\left(\frac{\pi x}{50}\right)\,.$$

In [5]:

```
y_real = np.sin(x*np.pi/50)
```

Introducing some uniform random noise to simulate measurement noise:

In [6]:

```
y_measured = y_real + (np.random.rand(100) - 0.5)/1
```

In [7]:

```
plt.scatter(x,y_measured, marker='.', color='b', label='measured')
plt.plot(x,y_real, color='r', label='real')
plt.xlabel('x'); plt.ylabel('y'); plt.legend(frameon=True);
```

We can now learn from the dataset $\Psi=\left\{x, y_\text{measured}\right\}$.

- We are going to use a support vector regressor from
`scikit-learn`

. - Don't get too excited, you will have to program things 'by hand'.

Training (adjusting) SVR

In [8]:

```
from sklearn.svm import SVR
```

In [9]:

```
clf = SVR() # using default parameters
clf.fit(x.reshape(-1, 1), y_measured)
```

Out[9]:

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

We can now see how our SVR models the data.

In [10]:

```
y_pred = clf.predict(x.reshape(-1, 1))
```

In [11]:

```
plt.scatter(x, y_measured, marker='.', color='blue', label='measured')
plt.plot(x, y_pred, 'g--', label='predicted')
plt.xlabel('X'); plt.ylabel('y'); plt.legend(frameon=True);
```

We observe for the first time an important negative phenomenon: *overfitting*.

We will be dedicating part of the course to the methods that we have for control overfitting.

In [12]:

```
clf = SVR(C=1e3, gamma=0.0001)
clf.fit(x.reshape(-1, 1), y_measured)
```

Out[12]:

SVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.0001, kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [13]:

```
y_pred_ok = clf.predict(x.reshape(-1, 1))
```

In [14]:

```
plt.scatter(x, y_measured, marker='.', color='b', label='measured')
plt.plot(x, y_pred, 'g--', label='overfitted')
plt.plot(x, y_pred_ok, 'm-', label='not overfitted')
plt.xlabel('X'); plt.ylabel('y'); plt.legend(frameon=True);
```

In some cases we can just observe a series of items or values, e.g., $\Psi=\left\{\vec{x}_i\right\}$:

- It is necessary to find the
*hidden structure*of*unlabeled data*. - We need a measure of correctness of the model that does not requires an expected outcome.
Although, at first glance, it may look a bit awkward, this type of problem is very common.

- Related to anomaly detection, clustering, etc.

Let's generate a dataset that is composed by three groups or clusters of elements, $\vec{x}\in\mathbb{R}^2$.

In [15]:

```
x_1 = np.random.randn(30,2) + (5,5)
x_2 = np.random.randn(30,2) + (10,0)
x_3 = np.random.randn(30,2) + (0,2)
```

In [16]:

```
plt.scatter(x_1[:,0], x_1[:,1], c='red', label='Cluster 1', alpha =0.74)
plt.scatter(x_2[:,0], x_2[:,1], c='blue', label='Cluster 2', alpha =0.74)
plt.scatter(x_3[:,0], x_3[:,1], c='green', label='Cluster 3', alpha =0.74)
plt.legend(frameon=True); plt.xlabel('$x_1$'); plt.ylabel('$x_2$');
plt.title('Three datasets');
```

Preparing the training dataset.

In [17]:

```
x = np.concatenate(( x_1, x_2, x_3), axis=0)
x.shape
```

Out[17]:

(90, 2)

In [18]:

```
plt.scatter(x[:,0], x[:,1], c='m', alpha =0.74)
plt.title('Training dataset');
```

We can now try to learn what clusters are in the dataset. We are going to use the $k$-means clustering algorithm.

In [19]:

```
from sklearn.cluster import KMeans
```

In [20]:

```
clus = KMeans(n_clusters=3)
clus.fit(x)
```

Out[20]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)

In [21]:

```
labels_pred = clus.predict(x)
print(labels_pred)
```

In [22]:

```
cm=iter(plt.cm.Set1(np.linspace(0,1,len(np.unique(labels_pred)))))
for label in np.unique(labels_pred):
plt.scatter(x[labels_pred==label][:,0], x[labels_pred==label][:,1],
c=next(cm), alpha =0.74, label='Pred. cluster ' +str(label+1))
plt.legend(loc='upper right', bbox_to_anchor=(1.45,1), frameon=True);
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('Clusters predicted');
```

Needing to set the number of clusters can lead to problems.

In [23]:

```
clus = KMeans(n_clusters=10)
clus.fit(x)
labels_pred = clus.predict(x)
```

In [24]:

```
cm=iter(plt.cm.Set1(np.linspace(0,1,len(np.unique(labels_pred)))))
for label in np.unique(labels_pred):
plt.scatter(x[labels_pred==label][:,0], x[labels_pred==label][:,1],
c=next(cm), alpha =0.74, label='Pred. cluster ' + str(label+1))
plt.legend(loc='upper right', bbox_to_anchor=(1.45,1), frameon=True)
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('Ten clusters predicted');
```

- Obtaining a supervised learning dataset can be expensive.
- Some times it can be complemented with a "cheaper" unsupervised learning dataset.
- What if we first learn as much as possible from unlabeled data and then use the labeled dataset.

- Inspired by behaviorist psychology;
- How to take actions in an environment so as to maximize some notion of cumulative reward?
- Differs from standard supervised learning in that correct input/output pairs are never presented,
- ...nor sub-optimal actions explicitly corrected.
- Involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

*A parametrized family of functions*$\mathcal{M}(\vec{x};\theta)$ describing how the learner will behave on new examples.- What output $\mathcal{M}(\vec{x};\theta)$ will produce given some input $\vec{x}$?

*A loss function*$\ell()$ describing what scalar loss $\ell(\hat{\vec{y}}, \vec{y})$ is associated with each supervised example $\langle x, y\rangle$, as a function of the learner's output $\hat{\vec{y}} = \mathcal{M}(\vec{x};\theta)$ and the target output $\vec{y}$.- Training consists in choosing the parameters $\theta$ given some training examples $\Psi=\left\{\left<\vec{x}_i,\vec{y}_i\right>\right\}$ sampled from an unknown data generating distribution $P(X, Y)$.

Define a

*training criterion*.*Ideally*: to minimize the expected loss sampled from the unknown data generating distribution.- This is not possible because the expectation makes use of the true underlying $P()$...
- ...but we only have access to a finite number of training examples, $\Psi$.
A

*training criterion*usually includes an empirical average of the loss over the training set,$$\min_{\theta}\ \mathbf{E}_{\Psi}[\ell(\mathcal{M}(\vec{x};\vec{\theta}), \vec{y})].$$

- Some additional terms (called
*regularizers*) can be added to enforce preferences over the choices of $\vec{\theta}$.

*An optimization procedure*to approximately minimize the training criterion by modifying $\theta$.

- It is clear now that we need a dataset for training (of fitting or optimizing) the model.
**Training dataset**

- We need another dataset to assess progress and compute the training criterion.
**Testing dataset**

- As most ML approaches are stochastic and to contrast different approaches we need to have another dataset.
**Validation dataset**

This is a cornerstone issue of machine learning and we will be comming back to it.

The machine learning flowchart

- Cellular automata
**Neural computation**- Evolutionary computation
- Swarm intelligence
- Artificial immune systems
- Membrane computing
- Amorphous computing

- Different classes of machine learning problems:
- Classification
- Regression
- Clustering.

- Different classes of learning scenarions:
- Supervised,
- unsupervised,
- semi-supervised, and
- reinforcement learning.

- Model, dataset, loss function, optimization.

- Read Chapters 1,2 and 3 of Ian Goodfellow, Yoshua Bengio and Aaron Courville (2017) Deep Learning MIT Press.
- Read Chapter 2 of Hastie, Tibshirani and Friedman (2009) The Elements of Statistical Learning (2nd edition) Springer-Verlag.