2016-06-17, Josh Montague
Motivation, a little history, a naive implementation, and a discussion of neural networks.
Recap of the structural pillars of logistic regression for classification (previous RST).
Let's see an example where logistic regression works. Consider some two-dimensional data that we'd like to classify.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from mlxtend.evaluate import plot_decision_regions
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
samples = 20
X, y = make_blobs(n_samples=samples, n_features=2, cluster_std=0.25,
centers=[(0, 0.5), (1.5, 0.5)], shuffle=False, random_state=1)
# fit the LR model
clf = LogisticRegression().fit(X,y)
# plotting decision regions
plot_decision_regions(X, y, clf=clf, res=0.02)
plt.xlabel('x1'); plt.ylabel('x2'); plt.title('LR (linearly separable)')
print('The model features are weighted according to: {}'.format(clf.coef_))
Consider a schematic reframing of the LR model above. This time we'll treat the inputs as nodes, and they connect to other nodes via vertices that represent the weight coefficients.
The diagram above is a (simplified form of a) single-neuron model in biology.
As a result, this is the same model that is used to demonstrate a computational neural network.
So that's great. Logistic regression works, why do we need something like a neural network? To start, consider an example where the LR model breaks down:
rng = np.random.RandomState(1)
X = rng.randn(samples, 2)
y = np.array(np.logical_xor(X[:, 0] > 0, X[:, 1] > 0), dtype=int)
clf = LogisticRegression().fit(X,y)
plot_decision_regions(X=X, y=y, clf=clf, res=0.02, legend=2)
plt.xlabel('x1'); plt.ylabel('x2'); plt.title('LR (XOR)')
Why does this matter? Well...
In the 1960s, when the concept of neural networks were first gaining steam, this type of data was a show-stopper. In particular, the reason our model fails to be effective with this data is that it's not linearly separable; it has interaction terms.
This is a specific type of data that is representative of an XOR logic gate. It's not magic, just well-known, and a fundamental type of logic in computing. We can say it in words, as approximately: "label is 1, if either x1 or x2 is 1, but not if both are 1."
At the time, this led to an interesting split in computational work in the field: on the one hand, some people set off on efforts to design very custom data and feature engineering tactics so that existing models would still work. On the other hand, people set out to solve the challenge of designing new algorithms; for example, this is approximately the era when the support vector machine was developed. Since progress on neural network models slowed significantly in this era (rememeber that computers were entire rooms!), this is often referred to as the first "AI winter." Even though the multi-layer network was designed a few years later, and solved the XOR problem, the attention on the field of AI and neural networks had faded.
Today, you might (sensibly) suggest something like an 'rbf-kernel SVM' to solve this problem, and that would totally work! But that's not where we're going today.
With the acceleration of computational power in the last decade, there has been a resurgence in the interest (and capability) of neural network computation.
What is a multi-layer model, and how does it help solve this problem? Non-linearity and feature mixing leads to new features that we don't have to encode by hand. In particular, we no longer depend just on combinations of input features. We combine input features, apply non-linearities, then combine all of those as new features, apply additional non-linearities, and so on until basically forever.
It sounds like a mess, and it pretty much can be. But first, we'll start simply. Imagine that we put just a single layer of "neurons" between our input data and output. How would that change the evaluation approach we looked at earlier?
Reminder: manually writing out algorithms is a terrible idea for using them, but a great idea for learning how they work.
To get a sense for how the diagram above works, let's first write out the "single-layer" version (which we saw above is equivalent to logistic regression and doesn't work!). We just want to see how it looks in the form of forward- and backward-propagation.
Remember, we have a (samples x 2
) input matrix, so we need a (2x1)
matrix of weights. And to save space, we won't use the fully-accurate and correct implementation of backprop and SGD; instead, we'll use a simplified version that's easier to read but has very similar results.
# make the same data as above (just a little closer so it's easier to find)
rng = np.random.RandomState(1)
X = rng.randn(samples, 2)
y = np.array(np.logical_xor(X[:, 0] > 0, X[:, 1] > 0), dtype=int)
def activate(x, deriv=False):
"""sigmoid activation function and its derivative wrt the argument"""
if deriv is True:
return x*(1-x)
return 1/(1+np.exp(-x))
# initialize synapse0 weights randomly with mean 0
syn0 = 2*np.random.random((2,1)) - 1
# nothing to see here... just some numpy vector hijinks for the next code
y = y[None].T
This is the iterative phase. We propagate the input data forward through the synapse (weights), calculate the errors, and then back-propogate those errors through the synapses (weights) according to the proper gradients. Note that the number of iterations is arbitary at this point. We'll come back to that.
for i in range(10000):
# first "layer" is the input data
l0 = X
# forward propagation
l1 = activate(np.dot(l0, syn0))
###
# this is an oversimplified version of backprop + gradient descent
#
# how much did we miss?
l1_error = y - l1
#
# how much should we scale the adjustments?
# (how much we missed by) * (gradient at l1 value)
# ~an "error-weighted derivative"
l1_delta = l1_error * activate(l1,True)
###
# how much should we update the weight matrix (synapse)?
syn0 += np.dot(l0.T,l1_delta)
# some insight into the update progress
if (i% 2000) == 0:
print("Mean error @ iteration {}: {}".format(i, np.mean(np.abs(l1_error))))
As expected, this basically didn't work at all!
Even though we aren't looking at the actual output data, we can use it to look at the accuracy; it never got much better than random guessing. Even after thousands of iterations! But remember, we knew that would be the case, because this single-layer network is functionally the same as vanilla logistic regression, which we saw fail on the xor data above!
But, now that we have the framework and understanding for how to optimize backprogation, we can add an additional layer to the network (a so-called "hidden" layer of neurons), which will introduce the kind of mixing we need to represent this data.
As we saw above in the diagram (and talked about), introduction of a new layer means that we get an extra step in both the forward- and backward-propagation steps. This new step means we need an additional weight (synapse) matrix, and an additional derivative calculation. Other than that, the code looks pretty much the same.
# hold tight, we'll come back to choosing this number
hidden_layer_width = 3
# initialize synapse (weight) matrices randomly with mean 0
syn0 = 2*np.random.random((2,hidden_layer_width)) - 1
syn1 = 2*np.random.random((hidden_layer_width,1)) - 1
for i in range(60000):
# forward propagation through layers 0, 1, and 2
l0 = X
l1 = activate(np.dot(l0,syn0))
l2 = activate(np.dot(l1,syn1))
# how much did we miss the final target value?
l2_error = y - l2
# how much should we scale the adjustments?
l2_delta = l2_error*activate(l2,deriv=True)
# project l2 error back onto l1 values according to weights
l1_error = l2_delta.dot(syn1.T)
# how much should we scale the adjustments?
l1_delta = l1_error * activate(l1,deriv=True)
# how much should we update the weight matrices (synapses)?
syn1 += l1.T.dot(l2_delta)
syn0 += l0.T.dot(l1_delta)
if (i % 10000) == 0:
print("Error @ iteration {}: {}".format(i, np.mean(np.abs(l2_error))))
Ok, this time we started at random guessing (sensible), but notice that we quickly reduced our overall error! That's excellent!
Note: I didn't have time to debug the case where the full XOR data only trained to label one quadrant correctly. To get a sense for how it can look with a smaller set, change the "fall-back data" cell to code, and run the cells starting there!
Knowing that the error is lower is great, but we can also inspect the results of the fit network by looking at the forward propagation results from the trained synapses (weights).
def forward_prop(X):
"""forward-propagate data X through the pre-fit network"""
l1 = activate(np.dot(X,syn0))
l2 = activate(np.dot(l1,syn1))
return l2
# numpy and plotting shenanigans come from:
# http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
# mesh step size
h = .02
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# calculate the surface (by forward-propagating)
Z = forward_prop(np.c_[xx.ravel(), yy.ravel()])
# reshape the result into a grid
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
# we can use this to inspect the smaller dataset
#plt.plot(X[:, 0], X[:, 1], 'o')
Success! (Possibly... depending on whether Josh debugged the larger network problem :) ). If only one quadrant was trained correctly, go use the smaller dataset!
The stuff in this session is just a very basic start! The limits to the increasing complexity are now at the hardware level! Networks can be amazingly complicated, too. Below is an example from a talk I saw - note how interestingly the layers are building on each other to represent increasingly complicated structure in the context of facial recognition.
It's not clear how you'd encode "this is a face," but once you see how the first layer's "atomic" components are assembled into abstract parts of a face, and how those parts are combined into representations of kinds of faces, it seems more believable!
And, as you probably guessed, what we've done above isn't how you use these in practice. There are many Python libraries for building and using various neural network models. And, as you might expect, many are built with an object-oriented expressiveness:
# pseudo-code (that is actually very nearly valid)
nn = Network(optimizer='sgd')
nn.add_layer('fully_connected', name='l0', nodes=4)
nn.add_layer('fully_connected', name='l1', nodes=5)
nn.add_layer('fully_connected', name='l2', nodes=2)
nn.compile()
nn.fit(X,y)
In Neural Networks - Part 2, we'll look at some of these libraries and use them for some learning tasks! (hold me to it!)
In addition to using optimized libraries, there are many other issues and topics that go into developing and using neural networks for practical purposes. Below is a bag-of-words approach to some terms and phrases that you'll invariably see when reading about neural networks.
GPU (graphical processing unit)
architecture
batching
training epochs
for
loops above were chosen arbitrarily. A lot of work has also gone into deciding how to optimize the convergence of network training.regularization
"deep learning"
To save you some time if you want to learn more, here are some of the references that I found the most helpful while researching for this RST: