Build You a Machine Learning¶

G iven some inputs $x_1, \ldots, x_n \in \mathbb{R}^m $, and corresponding outputs $y_1, \ldots, y_n \in \mathbb{R}$, find a function $f: \mathbb{R}^n \to \mathbb{R}$ so that

\begin{equation} f(x) \approx y \end{equation}

for all $(x, y)$ drawn from the same distribution as $(x_j, y_j)$.

Definitions!¶

The points $x_j$ are features and corresponding $y_j$ are labels. Together, they are called the training set, and since we love matrices and notation gets hellish without them,

\begin{equation} X = (x_1, \ldots, x_n) = \left( \begin{array}{ccc} x_1^1 & \ldots & x_1^m \\ \vdots & & \vdots \\ x_n^1 & \ldots & x_n^m \end{array} \right) \end{equation}

Also, $$ \mathbf{y} = (y_1, \ldots, y_n) $$

Some practical notes:¶

You don't have to use the features you are given. The literature will sometimes call the $X$ you use for model building basis functions. Another word to google about this is feature selection.
There is a lot to be said about overfitting.
You probably want to reserve some data to use as a test set, and maybe some more for cross validation.

More notes¶

Regression tasks mean the labels are real valued. Classification tasks have categorical labels.
A task is supervised if you have labels, unsupervised if you don't, and semi-supervised if you have some, but not tons.

Linear Regression¶

$$ f(\mathbf{x}) \approx y \Rightarrow \mathbf{x} \cdot \mathbf{w} \approx y $$

For reasons we will choose the vector $\mathbf{w}$ that minimizes $$ cost(\mathbf{w}) = \| X\mathbf{w} - \mathbf{y} \|^2 $$

Deriving an exact least squares solution:¶

\begin{eqnarray} && X\mathbf{w} = \mathbf{y} \\ &\Rightarrow& X^T X \mathbf{w} = X^T \mathbf{y} \\ &\Rightarrow& \mathbf{w} = (X^TX)^{-1} X^T \mathbf{y} \end{eqnarray}

The best part is that this is almost true!

Examples!¶

Let's generate some random data for supervised regression:

In [2]:

import numpy as np
DIM = 5
COEFS = np.random.randn(DIM)

def generate_feature(x, weights):
    feature = [w * x ** idx for idx, w in enumerate(weights)]
    return feature

def generate_features_and_labels(X, weights, noise=0):
    features = np.array([generate_feature(x, weights) for x in X])
    features += noise * np.random.randn(*features.shape)
    labels = features.sum(axis=1)
    return features, labels

print(COEFS)

[-1.09805639  0.42003036 -2.48283096  0.44664483  0.06329607]

Model Building¶

Scikit-learn makes our lives easy:

In [3]:

from sklearn.linear_model import LinearRegression
x = np.random.random(50)
features, y = generate_features_and_labels(x, COEFS, noise=0.05)

reg = LinearRegression().fit(x.reshape(-1, 1), y)
print("y = {:.2f} * x + {:.2f}".format(reg.coef_[0], reg.intercept_))

y = -1.80 * x + -0.71

In [4]:

from talk_utils import Matrix

# Manually add intercept
A = Matrix([list(x), [1 for _ in x]]).T
b = Matrix([list(y)]).T

w = (A.T * A).inverse() * A.T * b
print("y = {:.2f} * x + {:.2f}".format(w.vals[0][0], w.vals[1][0]))

y = -1.80 * x + -0.71

Plotting with pyplot¶

We use the model to make a prediction and then plot that prediction:

In [5]:

%matplotlib inline
import seaborn # makes plots pretty
from talk_utils import plotter

t = np.linspace(x.min(), x.max())
plotter(t, reg.predict(t.reshape(-1, 1)), x, y)

Linear regression can also be curvy!¶

In [6]:

new_dim = 4 * DIM
high_dim_features, _ = generate_features_and_labels(x, np.ones(new_dim))

overfit_reg = LinearRegression().fit(high_dim_features, y)

t_transformed, _ = generate_features_and_labels(t, np.ones(new_dim))
overfit_preds = overfit_reg.predict(t_transformed)
plotter(t, overfit_preds, x, y)

Other forms of regression!¶

Ridge regression:¶

Instead of minimizing $$ cost(\mathbf{w}) = \| X\mathbf{w} - \mathbf{y} \|^2 $$ we will minimize $$ cost(\mathbf{w}) = \| X\mathbf{w} - \mathbf{y} \|^2 + \alpha \|\mathbf{w} \|^2 $$

Intuitively: we want to penalize complexity, so having a big weight incurs a penalty
Notice we now also have to choose $\alpha$
This is also called $L^2$ regularization

In [7]:

from sklearn.linear_model import RidgeCV

ridge_reg = RidgeCV().fit(high_dim_features, y)

ridge_preds = ridge_reg.predict(t_transformed)

plotter(t, ridge_preds, x, y)
ridge_reg.coef_

Out[7]:

array([ 0.        , -0.41349154, -0.58080439, -0.44419022, -0.27891466,
       -0.15459739, -0.07354935, -0.02405877,  0.00500718,  0.02155159,
        0.03063587,  0.03534176,  0.03748795,  0.03812214,  0.03783417,
        0.03694787,  0.0356356 ,  0.03398576,  0.03204177,  0.02982429])

Lasso regression:¶

Instead of minimizing

We will minimize $$ cost(\mathbf{w}) = \| X\mathbf{w} - \mathbf{y} \|^2 + \alpha \|\mathbf{w} \|^1 $$

This encourages sparse weights for awesome, easy reasons
We still have to choose $\alpha$
This is also called $L^1$ regularization

In [8]:

from sklearn.linear_model import LassoCV

lasso_reg = LassoCV()
lasso_reg.fit(high_dim_features, y)

lasso_preds = lasso_reg.predict(t_transformed)

plotter(t, lasso_preds, x, y)
print(lasso_reg.score(high_dim_features, y))
print(ridge_reg.score(high_dim_features, y))
print(overfit_reg.score(high_dim_features, y))

0.965889563709
0.966076370214
0.978404025746

Classification!¶

In [9]:

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

CMAP = plt.cm.viridis
iris = load_iris()
X = iris.data[:, :2]
_, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], c=iris.target, cmap=CMAP, s=40);

In [11]:

from sklearn.linear_model import LogisticRegression

visualize_classifier(LogisticRegression)

Accuracy is 76.7%

In [12]:

from sklearn.neighbors import KNeighborsClassifier

visualize_classifier(KNeighborsClassifier)

Accuracy is 83.3%

In [13]:

from sklearn.svm import SVC

visualize_classifier(SVC)

Accuracy is 82.7%

In [14]:

from sklearn.naive_bayes import GaussianNB

visualize_classifier(GaussianNB)

Accuracy is 78.0%

In [15]:

from sklearn.tree import DecisionTreeClassifier

visualize_classifier(DecisionTreeClassifier)

Accuracy is 92.7%

In [16]:

from sklearn.ensemble import RandomForestClassifier

visualize_classifier(RandomForestClassifier)

Accuracy is 92.0%