The good news is that modern machine learning can be distilled down to a couple of key techniques that are of very wide applicability. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:
Ensembles of decision trees (i.e. Random Forests and Gradient Boosting Machines), mainly for structured data (such as you might find in a database table at most companies). We looked at random forests in depth as we analyzed the Blue Book for Bulldozers dataset.
Multi-layered neural networks learnt with SGD (i.e. shallow and/or deep learning), mainly for unstructured data (such as audio, vision, and natural language)
In this lesson, we will start on the 2nd approach (a neural network with SGD) by analyzing the MNIST dataset. You may be surprised to learn that logistic regression is actually an example of a simple neural net!
In this lesson, we will be working with MNIST, a classic data set of hand-written digits. Solutions to this problem are used by banks to automatically recognize the amounts on checks, and by the postal service to automatically recognize zip codes on mail.
A matrix can represent an image, by creating a grid where each entry corresponds to a different pixel.
We will be using the fastai library, which is still in pre-alpha. If you are accessing this course notebook, you probably already have it downloaded, as it is in the same Github repo as the course materials.
We use symbolic links (often called symlinks) to make it possible to import these from your current directory. For instance, I ran:
ln -s ../../fastai
in the terminal, within the directory I'm working in, home/fastai/courses/ml1
.
%load_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.imports import *
from fastai.torch_imports import *
from fastai.io import *
/home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release. from numpy.core.umath_tests import inner1d
path = 'data/mnist/'
Let's download, unzip, and format the data.
import os
os.makedirs(path, exist_ok=True)
!ls {path}
mnist.pkl.gz
URL='http://deeplearning.net/data/mnist/'
FILENAME='mnist.pkl.gz'
def load_mnist(filename):
return pickle.load(gzip.open(filename, 'rb'), encoding='latin-1')
get_data(URL+FILENAME, path+FILENAME)
((x, y), (x_valid, y_valid), _) = load_mnist(path+FILENAME)
type(x), x.shape, type(y), y.shape
(numpy.ndarray, (50000, 784), numpy.ndarray, (50000,))
Many machine learning algorithms behave better when the data is normalized, that is when the mean is 0 and the standard deviation is 1. We will subtract off the mean and standard deviation from our training set in order to normalize the data:
mean = x.mean()
std = x.std()
x=(x-mean)/std
mean, std, x.mean(), x.std()
(0.13044983, 0.3072898, -3.1638146e-07, 0.99999934)
Note that for consistency (with the parameters we learn when training), we subtract the mean and standard deviation of our training set from our validation set.
x_valid = (x_valid-mean)/std
x_valid.mean(), x_valid.std()
(-0.005850922, 0.99243325)
In any sort of data science work, it's important to look at your data, to make sure you understand the format, how it's stored, what type of values it holds, etc. To make it easier to work with, let's reshape it into 2d images from the flattened 1d format.
def show(img, title=None):
plt.imshow(img, cmap="gray")
if title is not None:
plt.title(title)
def plots(ims, figsize=(12,6), rows=2, titles=None):
f = plt.figure(figsize=figsize)
cols = len(ims)//rows
for i in range(len(ims)):
sp = f.add_subplot(rows, cols, i+1)
# sp.axis('Off')
if titles is not None: sp.set_title(titles[i], fontsize=16)
plt.imshow(ims[i], cmap='gray')
x_valid.shape
(10000, 784)
x_imgs = np.reshape(x_valid, (-1,28,28)); x_imgs.shape
(10000, 28, 28)
show(x_imgs[0], y_valid[0])
y_valid.shape
(10000,)
It's the digit 3! And that's stored in the y value:
y_valid[0]
3
We can look at part of an image:
x_imgs[0,10:20,10:20]
array([[-0.42452, -0.42452, -0.42452, -0.42452, 0.17294, 2.34669, 2.80432, 2.32126, -0.05587, -0.42452], [-0.42452, -0.42452, -0.42452, 0.78312, 2.43567, 2.80432, 2.68991, 0.40176, -0.42452, -0.42452], [-0.42452, -0.27197, 1.20261, 2.77889, 2.80432, 2.5755 , 0.08396, -0.42452, -0.42452, -0.42452], [-0.42452, 1.76194, 2.80432, 2.80432, 1.73651, 0.31278, -0.42452, -0.42452, -0.42452, -0.42452], [-0.42452, 2.20685, 2.80432, 2.80432, 0.40176, -0.42452, -0.42452, -0.42452, -0.42452, -0.42452], [-0.42452, 1.31702, 2.80432, 2.80432, 2.76618, 1.43143, -0.09401, -0.42452, -0.42452, -0.42452], [-0.42452, -0.31011, 1.77465, 2.42296, 2.80432, 2.80432, 2.49923, 0.47803, -0.42452, -0.42452], [-0.42452, -0.42452, -0.32282, -0.27197, 2.80432, 2.80432, 2.80432, 2.70262, 0.89752, -0.42452], [-0.42452, -0.42452, -0.42452, -0.42452, 0.16023, 1.97804, 2.80432, 2.80432, 2.42296, -0.42452], [-0.42452, -0.42452, -0.42452, -0.42452, -0.42452, -0.20841, 1.80007, 2.80432, 2.80432, -0.10672]], dtype=float32)
show(x_imgs[0,10:20,10:20])
plots(x_imgs[:8], titles=y_valid[:8])
We will take a deep look logistic regression and how we can program it ourselves. We are going to treat it as a specific example of a shallow neural net.
What is a neural network?
A neural network is an infinitely flexible function, consisting of layers. A layer is a linear function such as matrix multiplication followed by a non-linear function (the activation).
One of the tricky parts of neural networks is just keeping track of all the vocabulary!
A function takes inputs and returns outputs. For instance, f(x)=3x+5 is an example of a function. If we input 2, the output is 3×2+5=11, or if we input −1, the output is 3×−1+5=2
Functions have parameters. The above function f is ax+b, with parameters a and b set to a=3 and b=5.
Machine learning is often about learning the best values for those parameters. For instance, suppose we have the data points on the chart below. What values should we choose for a and b?
In the above gif from fast.ai's deep learning course, intro to SGD notebook), an algorithm called stochastic gradient descent is being used to learn the best parameters to fit the line to the data (note: in the gif, the algorithm is stopping before the absolute best parameters are found). This process is called training or fitting.
Most datasets will not be well-represented by a line. We could use a more complicated function, such as g(x)=ax2+bx+c+sind. Now we have 4 parameters to learn: a, b, c, and d. This function is more flexible than f(x)=ax+b and will be able to accurately model more datasets.
Neural networks take this to an extreme, and are infinitely flexible. They often have thousands, or even hundreds of thousands of parameters. However the core idea is the same as above. The neural network is a function, and we will learn the best parameters for modeling our data.
We will be using the open source deep learning library, fastai, which provides high level abstractions and best practices on top of PyTorch. This is the highest level, simplest way to get started with deep learning. Please note that fastai requires Python 3 to function. It is currently in pre-alpha, so items may move around and more documentation will be added in the future.
The fastai deep learning library uses PyTorch, a Python framework for dynamic neural networks with GPU acceleration, which was released by Facebook's AI team.
PyTorch has two overlapping, yet distinct, purposes. As described in the PyTorch documentation:
The neural network functionality of PyTorch is built on top of the Numpy-like functionality for fast matrix computations on a GPU. Although the neural network purpose receives way more attention, both are very useful. We'll implement a neural net from scratch today using PyTorch.
Further learning: If you are curious to learn what dynamic neural networks are, you may want to watch this talk by Soumith Chintala, Facebook AI researcher and core PyTorch contributor.
If you want to learn more PyTorch, you can try this introductory tutorial or this tutorial to learn by examples.
Graphical processing units (GPUs) allow for matrix computations to be done with much greater speed, as long as you have a library such as PyTorch that takes advantage of them. Advances in GPU technology in the last 10-20 years have been a key part of why neural networks are proving so much more powerful now than they did a few decades ago.
You may own a computer that has a GPU which can be used. For the many people that either don't have a GPU (or have a GPU which can't be easily accessed by Python), there are a few differnt options:
from fastai.metrics import *
from fastai.model import *
from fastai.dataset import *
import torch.nn as nn
We will begin with the highest level abstraction: using a neural net defined by PyTorch's Sequential class.
net = nn.Sequential(
nn.Linear(28*28, 100),
nn.LogSoftmax()
).cuda()
Each input is a vector of size 28*28
pixels and our output is of size 10
(since there are 10 digits: 0, 1, ..., 9).
We use the output of the final layer to generate our predictions. Often for classification problems (like MNIST digit classification), the final layer has the same number of outputs as there are classes. In that case, this is 10: one for each digit from 0 to 9. These can be converted to comparative probabilities. For instance, it may be determined that a particular hand-written image is 80% likely to be a 4, 18% likely to be a 9, and 2% likely to be a 3.
md = ImageClassifierData.from_arrays(path, (x,y), (x_valid, y_valid))
loss=nn.NLLLoss()
metrics=[accuracy]
opt=optim.SGD(net.parameters(), lr=1e-2)
In machine learning the loss function or cost function is representing the price paid for inaccuracy of predictions.
The loss associated with one example in binary classification is given by:
-(y * log(p) + (1-y) * log (1-p))
where y
is the true label of x
and p
is the probability predicted by our model that the label is 1.
def binary_loss(y, p): #also called Negative Log Loss
return np.mean(-(y * np.log(p) + (1-y)*np.log(1-p)))
acts = np.array([1, 0, 0, 1])
preds = np.array([0.9, 0.1, 0.2, 0.8])
binary_loss(acts, preds)
0.164252033486018
Note that in our toy example above our accuracy is 100% and our loss is 0.16. Compare that to a loss of 0.03 that we are getting while predicting cats and dogs. Exercise: play with preds
to get a lower loss for this example.
Example: Here is an example on how to compute the loss for one example of binary classification problem. Suppose for an image x with label 1 and your model gives it a prediction of 0.9. For this case the loss should be small because our model is predicting a label 1 with high probability.
loss = -log(0.9) = 0.10
Now suppose x has label 0 but our model is predicting 0.9. In this case our loss is should be much larger.
loss = -log(1-0.9) = 2.30
binary_loss
using if
instead of *
and +
?Why not just maximize accuracy? The binary classification loss is an easier function to optimize.
For multi-class classification, we use negative log liklihood (also known as categorical cross entropy) which is exactly the same thing, but summed up over all classes.
Fitting is the process by which the neural net learns the best parameters for the dataset.
fit(net, md, n_epochs=1, crit=loss, opt=opt, metrics=metrics)
HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…
epoch trn_loss val_loss accuracy 0 0.392233 0.334639 0.9091
[array([0.33464]), 0.9091]
preds = predict(net, md.val_dl)
preds.shape
(10000, 100)
Question: Why does our output have length 10 (for each image)?
preds.argmax(axis=1)
array([3, 8, 6, ..., 5, 6, 8])
preds = preds.argmax(1)
Let's check how accurate this approach is on our validation set. You may want to compare this against other implementations of logistic regression, such as the one in sklearn. In our testing, this simple pytorch version is faster and more accurate for this problem!
np.mean(preds == y_valid)
0.9091
Let's see how some of our predictions look!
plots(x_imgs[:8], titles=preds[:8])
Above, we used pytorch's nn.Linear
to create a linear layer. This is defined by a matrix multiplication and then an addition (these are also called affine transformations
). Let's try defining this ourselves.
Just as Numpy has np.matmul
for matrix multiplication (in Python 3, this is equivalent to the @
operator), PyTorch has torch.matmul
.
Our PyTorch class needs two things: constructor (says what the parameters are) and a forward method (how to calculate a prediction using those parameters) The method forward
describes how the neural net converts inputs to outputs.
In PyTorch, the optimizer knows to try to optimize any attribute of type Parameter.
def get_weights(*dims):
return nn.Parameter(torch.randn(dims)/dims[0])
def softmax(x):
return torch.exp(x)/(torch.exp(x).sum(dim=1)[:,None])
class LogReg(nn.Module):
def __init__(self):
super().__init__()
self.l1_w = get_weights(28*28, 10) # Layer 1 weights
self.l1_b = get_weights(10) # Layer 1 bias
def forward(self, x):
x = x.view(x.size(0), -1)
x = (x @ self.l1_w) + self.l1_b # Linear Layer
x = torch.log(softmax(x)) # Non-linear (LogSoftmax) Layer
return x
We create our neural net and the optimizer. (We will use the same loss and metrics from above).
m = LogReg().cuda()
opt=optim.Adam(m.parameters())
fit(m, md, n_epochs=1, crit=loss, opt=opt, metrics=metrics)
HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…
epoch trn_loss val_loss accuracy 0 0.315913 0.282962 0.9215
[array([0.28296]), 0.9215]
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
%time lr.fit(x[:10000], y[:10000])
CPU times: user 1min 23s, sys: 0 ns, total: 1min 23s Wall time: 1min 23s
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
%time preds = lr.predict(x_valid)
preds.shape
CPU times: user 68 ms, sys: 0 ns, total: 68 ms Wall time: 22.3 ms
(10000,)
preds[:5]
array([3, 8, 6, 9, 6])
(preds == y_valid).mean()
0.899