# Build You a Machine Learning¶

G iven some inputs $x_1, \ldots, x_n \in \mathbb{R}^m$, and corresponding outputs $y_1, \ldots, y_n \in \mathbb{R}$, find a function $f: \mathbb{R}^n \to \mathbb{R}$ so that

$$f(x) \approx y$$

for all $(x, y)$ drawn from the same distribution as $(x_j, y_j)$.

## Definitions!¶

The points $x_j$ are features and corresponding $y_j$ are labels. Together, they are called the training set, and since we love matrices and notation gets hellish without them,

$$X = (x_1, \ldots, x_n) = \left( \begin{array}{ccc} x_1^1 & \ldots & x_1^m \\ \vdots & & \vdots \\ x_n^1 & \ldots & x_n^m \end{array} \right)$$

Also, $$\mathbf{y} = (y_1, \ldots, y_n)$$

## Some practical notes:¶

• You don't have to use the features you are given. The literature will sometimes call the $X$ you use for model building basis functions. Another word to google about this is feature selection.
• There is a lot to be said about overfitting.
• You probably want to reserve some data to use as a test set, and maybe some more for cross validation.

## More notes¶

• Regression tasks mean the labels are real valued. Classification tasks have categorical labels.
• A task is supervised if you have labels, unsupervised if you don't, and semi-supervised if you have some, but not tons.

## Linear Regression¶

$$f(\mathbf{x}) \approx y \Rightarrow \mathbf{x} \cdot \mathbf{w} \approx y$$

For reasons we will choose the vector $\mathbf{w}$ that minimizes $$cost(\mathbf{w}) = \| X\mathbf{w} - \mathbf{y} \|^2$$

## Deriving an exact least squares solution:¶

\begin{eqnarray} && X\mathbf{w} = \mathbf{y} \\ &\Rightarrow& X^T X \mathbf{w} = X^T \mathbf{y} \\ &\Rightarrow& \mathbf{w} = (X^TX)^{-1} X^T \mathbf{y} \end{eqnarray}

The best part is that this is almost true!

## Examples!¶

Let's generate some random data for supervised regression:

In [2]:
import numpy as np
DIM = 5
COEFS = np.random.randn(DIM)

def generate_feature(x, weights):
feature = [w * x ** idx for idx, w in enumerate(weights)]
return feature

def generate_features_and_labels(X, weights, noise=0):
features = np.array([generate_feature(x, weights) for x in X])
features += noise * np.random.randn(*features.shape)
labels = features.sum(axis=1)
return features, labels

print(COEFS)

[-1.09805639  0.42003036 -2.48283096  0.44664483  0.06329607]


## Model Building¶

Scikit-learn makes our lives easy:

In [3]:
from sklearn.linear_model import LinearRegression
x = np.random.random(50)
features, y = generate_features_and_labels(x, COEFS, noise=0.05)

reg = LinearRegression().fit(x.reshape(-1, 1), y)
print("y = {:.2f} * x + {:.2f}".format(reg.coef_[0], reg.intercept_))

y = -1.80 * x + -0.71

In [4]:
from talk_utils import Matrix

A = Matrix([list(x), [1 for _ in x]]).T
b = Matrix([list(y)]).T

w = (A.T * A).inverse() * A.T * b
print("y = {:.2f} * x + {:.2f}".format(w.vals[0][0], w.vals[1][0]))

y = -1.80 * x + -0.71


## Plotting with pyplot¶

We use the model to make a prediction and then plot that prediction:

In [5]:
%matplotlib inline
import seaborn # makes plots pretty
from talk_utils import plotter

t = np.linspace(x.min(), x.max())
plotter(t, reg.predict(t.reshape(-1, 1)), x, y)


## Linear regression can also be curvy!¶

In [6]:
new_dim = 4 * DIM
high_dim_features, _ = generate_features_and_labels(x, np.ones(new_dim))

overfit_reg = LinearRegression().fit(high_dim_features, y)

t_transformed, _ = generate_features_and_labels(t, np.ones(new_dim))
overfit_preds = overfit_reg.predict(t_transformed)
plotter(t, overfit_preds, x, y)


## Other forms of regression!¶

### Ridge regression:¶

Instead of minimizing $$cost(\mathbf{w}) = \| X\mathbf{w} - \mathbf{y} \|^2$$ we will minimize $$cost(\mathbf{w}) = \| X\mathbf{w} - \mathbf{y} \|^2 + \alpha \|\mathbf{w} \|^2$$

• Intuitively: we want to penalize complexity, so having a big weight incurs a penalty
• Notice we now also have to choose $\alpha$
• This is also called $L^2$ regularization
In [7]:
from sklearn.linear_model import RidgeCV

ridge_reg = RidgeCV().fit(high_dim_features, y)

ridge_preds = ridge_reg.predict(t_transformed)

plotter(t, ridge_preds, x, y)
ridge_reg.coef_

Out[7]:
array([ 0.        , -0.41349154, -0.58080439, -0.44419022, -0.27891466,
-0.15459739, -0.07354935, -0.02405877,  0.00500718,  0.02155159,
0.03063587,  0.03534176,  0.03748795,  0.03812214,  0.03783417,
0.03694787,  0.0356356 ,  0.03398576,  0.03204177,  0.02982429])

### Lasso regression:¶

We will minimize $$cost(\mathbf{w}) = \| X\mathbf{w} - \mathbf{y} \|^2 + \alpha \|\mathbf{w} \|^1$$

• This encourages sparse weights for awesome, easy reasons
• We still have to choose $\alpha$
• This is also called $L^1$ regularization
In [8]:
from sklearn.linear_model import LassoCV

lasso_reg = LassoCV()
lasso_reg.fit(high_dim_features, y)

lasso_preds = lasso_reg.predict(t_transformed)

plotter(t, lasso_preds, x, y)
print(lasso_reg.score(high_dim_features, y))
print(ridge_reg.score(high_dim_features, y))
print(overfit_reg.score(high_dim_features, y))

0.965889563709
0.966076370214
0.978404025746


## Classification!¶

In [9]:
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

CMAP = plt.cm.viridis
X = iris.data[:, :2]
_, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], c=iris.target, cmap=CMAP, s=40);

In [11]:
from sklearn.linear_model import LogisticRegression

visualize_classifier(LogisticRegression)

Accuracy is 76.7%

In [12]:
from sklearn.neighbors import KNeighborsClassifier

visualize_classifier(KNeighborsClassifier)

Accuracy is 83.3%

In [13]:
from sklearn.svm import SVC

visualize_classifier(SVC)

Accuracy is 82.7%

In [14]:
from sklearn.naive_bayes import GaussianNB

visualize_classifier(GaussianNB)

Accuracy is 78.0%

In [15]:
from sklearn.tree import DecisionTreeClassifier

visualize_classifier(DecisionTreeClassifier)

Accuracy is 92.7%

In [16]:
from sklearn.ensemble import RandomForestClassifier

visualize_classifier(RandomForestClassifier)

Accuracy is 92.0%