$$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\betav}{\mathbf{\beta}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\Sv}{\mathbf{S}} \newcommand{\Gv}{\mathbf{G}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\Norm}{\mathcal{N}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}} \newcommand{\grad}{\mathbf{\nabla}} \newcommand{\ebx}[1]{e^{\betav_{#1}^T \xv_n}} \newcommand{\eby}[1]{e^{y_{n,#1}}} \newcommand{\Tiv}{\mathbf{Ti}} \newcommand{\Fv}{\mathbf{F}} \newcommand{\ones}[1]{\mathbf{1}_{#1}} $$

Classification with Nonlinear Logistic Regression Using Neural Networks

Motivation and Setup

To classify a sample as being a member of 1 of 3 different classes, we could use integers 1, 2, and 3 as target outputs.

Linear function of $x$ seems to match data fairly well. Why is this not a good idea?

We must convert the continuous y-axis value to discrete integers 1, 2, or 3. Without adding more parameters, we are forced to use the general solution of splitting at 1.5 and 2.5.

Rats! Boundaries are not where we want them.

To allow flexibility, we need to decouple the modeling of the boundaries. Problem is due to using one value to represent all classes. Instead, let's use three values, one for each class. Binary-valued variables are adequate. Class 1 = $(1,0,0)$, Class 2 = $(0,1,0)$ and Class 3 = $(0,0,1)$. These are called indicator variables.

Our linear model has three outputs now. How do we interpret the output for a new sample?

Let the output be $\yv = (y_1, y_2, y_3)$. Convert these values to a class by picking the maximum value.

$$ \begin{align*} \text{class} = \argmax{i}\;\; y_i \end{align*} $$

We can plot the three output components on three separate graphs. What linear functions will each one learn?

Overlay them to see which one is the maximum for each $x$ value.

See any potential problems?

What if the green line is too low?

What could cause this?

The problem was that the green line for Class 2 was too low. In fact, all lines are too low in the middle of x range. Maybe we can reduce the masking effect by

  • requiring the function values to be between 0 and 1, and
  • requiring them to sum to 1 for every value of x.

We can satisfy those two requirements by directly representing $p(C=k|\xv)$ as

$$ \begin{align*} p(C=k|\xv) = \frac{f(\xv;\wv_k)}{\sum_{m=1}^K f(\xv;\wv_m)} \end{align*} $$

with $f(\xv;\wv) \ge 0$. We haven't discussed the form of $f$ yet, but $\wv$ represents the parameters of $f$ that we will tune to fit the training data (later).

This is certainly an expression that is between 0 and 1 for any $\xv$.

Let's give the above expression another name

$$ \begin{align*} g_k(\xv) = p(C=k|\xv) = \frac{f(\xv;\wv_k)}{\sum_{m=1}^K f(\xv;\wv_m)} \end{align*} $$

Whatever we choose for $f$, we must make a plan for optimizing its parameters $\wv$. How?

Let's maximize the likelihood of the data. So, what is the likelihood of training data consisting of samples $\{\xv_1, \xv_2, \ldots, \xv_N\}$ and class indicator variables

$$ \begin{align*} \begin{pmatrix} t_{1,1} & t_{1,2} & \ldots & t_{1,K}\\ t_{2,1} & t_{2,2} & \ldots & t_{2,K}\\ \vdots\\ t_{N,1} & t_{N,2} & \ldots & t_{N,K} \end{pmatrix} \end{align*} $$

with every value $t_{n,k}$ being 0 or 1, and each row of this matrix contains a single 1? (We can also express $\{\xv_1, \xv_2, \ldots, \xv_N\}$ as an $N \times D$ matrix, but we will be using single samples $\xv_n$ more often in the following.)

The likelihood is just the product of all $p(C=\text{class of } n^\text{th}\text{ sample}\,|\,\xv_n)$ values for sample $n$. A common way to express this product, using those handy indicator variables is

$$ \begin{align*} L(\betav) = \prod_{n=1}^N \prod_{k=1}^K p(C=k\,|\, \xv_n)^{t_{n,k}} \end{align*} $$

Say we have three classes ($K=3$) and training sample $n$ is from Class 2, then the product is

$$ \begin{align*} p(C=1\,|\,\xv_n)^{t_{n,1}} p(C=2\,|\,\xv_n)^{t_{n,2}} p(C=3\,|\,\xv_n)^{t_{n,3}} & = p(C=1\,|\,\xv_n)^0 p(C=2\,|\,\xv_n)^1 p(C=3\,|\,\xv_n)^0 \\ & = 1\; p(C=2\,|\,\xv_n)^1 \; 1 \\ & = p(C=2\,|\,\xv_n) \end{align*} $$

This shows how the indicator variables as exponents select the correct terms to be included in the product.

Linear function approximator as a neural network.

What must we add to do logistic regression?

Just add the softmax calculation to the output layer.

Any thoughts on how to do nonlinear logistic regression?

Derivation

Back to maximizing the data likelihood. As always, we will be wanting to take gradients of this with respect to our weights. But, the gradient of a big product is problematic. So, instead, we will work with the log likelihood of the training data, because

  1. The gradient of a log of a product of things is a sum of gradients of those things,
  2. The weight values that maximize the likelihood are the same weight values that maximize the log likelihood.
$$ \begin{align*} L(\Wv) & = \left ( \prod_{n=1}^N \prod_{k=1}^K P(C=k|\xv_n)^{t_{n,k}} \right ) \\ \log L(\Wv) & = LL(\Wv) = \sum_{n=1}^N \sum_{k=1}^K t_{n,k} \log P(C=k|\xv_n)\\ & = \sum_{n=1}^N \sum_{k=1}^K t_{n,k} \log g_{n,k} \end{align*} $$
In [3]:
import sympy
In [4]:
x = sympy.Symbol('x')
x
Out[4]:
$\displaystyle x$
In [5]:
f = sympy.log(x)
sympy.diff(f, x)
Out[5]:
$\displaystyle \frac{1}{x}$

Gradient of the Log Likelihood

$$ \begin{align*} LL(\Wv) & = \sum_{n=1}^N \sum_{k=1}^K t_{n,k} \log g_{n,k}\;\;\;\;\; \text{ where } g_{n,k} = \frac{\eby{k}}{\sum_{m=1}^K \eby{m}}; \\ \frac{\partial LL(\Wv)}{\partial \Wv_{d,j}} & = \sum_{n=1}^N \sum_{k=1}^K \frac{t_{n,k}}{g_{n,k}} \frac{\partial g_{n,k}}{\partial \Wv_{d,j}}\\ \end{align*} $$

Linear Version

General gradient

$$ \begin{align*} \frac{\partial LL(\Wv)}{\partial \Wv_{d,j}} & = \sum_{n=1}^N \sum_{k=1}^{K} (t_{n,k} - g_{n,k}) \frac{\partial y_{n,k}}{\partial \Wv_{d,j}} \end{align*} $$

For linear logistic regression, $y_{n,j} = \xv_n^T \Wv_{*,j}$, so $\frac{\partial y_{n,k}}{\partial \Wv_{d,j}}$ exists only when $j=k$, so

$$ \begin{align*} \frac{\partial LL(\Wv)}{\partial \Wv_{d,j}} & = \sum_{n=1}^N (t_{n,j} - g_{n,j}) \frac{\partial y_{n,j}}{\partial \Wv_{d,j}}\\ & = \sum_{n=1}^N \left ( t_{n,j} - g_{n,j} \right ) \xv_{d,j} \end{align*} $$

Nonlinear Version

First the general form again.

$$ \begin{align*} \frac{\partial LL(\Wv)}{\partial \Wv_{d,j}} & = \sum_{n=1}^N \sum_{k=1}^{K} (t_{n,k} - g_{n,k}) \frac{\partial y_{n,k}}{\partial \Wv_{d,j}} \end{align*} $$

Now $y_{n,j}$ depends on $\Vv$ and $\Wv$, so

$$ \begin{align*} \frac{\partial LL(\Vv,\Wv)}{\partial \Vv_{d,m}} & = \sum_{n=1}^N \sum_{k=1}^{K} \left ( t_{n,k} - g_{n,k} \right ) \frac{\partial y_{n,k}}{\partial \Vv_{d,m}}\\ \frac{\partial LL(\Vv,\Wv)}{\partial \Wv_{m,j}} & = \sum_{n=1}^N \sum_{k=1}^{K} \left ( t_{n,k} - g_{n,k} \right ) \frac{\partial y_{n,k}}{\partial \Wv_{m,j}}\\ \frac{\partial LL(\Vv,\Wv)}{\partial \Wv_{m,j}} & = \sum_{n=1}^N \left ( t_{n,j} - g_{n,j} \right ) \frac{\partial y_{n,j}}{\partial \Wv_{m,j}} \end{align*} $$

But, thank goodness, we have already calculated $\frac{\partial y_{n,k}}{\partial \Vv_{d,m}}$ and $\frac{\partial y_{n,k}}{\partial \Wv_{m,k}}$ in our previous neural network lectures. This becomes more clear when we compare above with the derivatives of mean squared error with respect to weights for neural networks for regression problems.

$$ \begin{align*} E &= \frac{1}{NK} \frac{1}{2} \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - y_{n,k})^2\\ \frac{\partial E}{\partial \Vv_{d,m}} & = - \frac{1}{NK} \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - y_{n,k}) \frac{\partial y_{n,k}}{\partial \Vv_{d,m}}\\ \frac{\partial E}{\partial \Wv_{m,j}} & = - \frac{1}{NK} \sum_{n=1}^N \sum_{k=1}^K (t_{n,k} - y_{n,k}) \frac{\partial y_{n,k}}{\partial \Wv_{m,j}}\\ \frac{\partial E}{\partial \Wv_{m,j}} & = - \frac{1}{NK} \sum_{n=1}^N (t_{n,j} - y_{n,j}) \frac{\partial y_{n,j}}{\partial \Wv_{m,j}} \end{align*} $$

Compare to gradients for likelihood

$$ \begin{align*} LL(\Vv,\Wv) & = \sum_{n=1}^N \sum_{k=1}^K t_{n,k} \log g_{n,k} \text{ where } g_{n,k} = \frac{\eby{k}}{\sum_{m=1}^{K} \eby{m}}\\ \frac{\partial LL(\Vv,\Wv)}{\partial \Vv_{d,m}} & = \sum_{n=1}^N \sum_{k=1}^{K} \left ( t_{n,k} - g_{n,k} \right ) \frac{\partial y_{n,k}}{\partial \Vv_{d,m}}\\ \frac{\partial LL(\Vv,\Wv)}{\partial \Wv_{m,j}} & = \sum_{n=1}^N \sum_{k=1}^{K} \left ( t_{n,k} - g_{n,k} \right ) \frac{\partial y_{n,k}}{\partial \Wv_{m,j}}\\ \frac{\partial LL(\Vv,\Wv)}{\partial \Wv_{m,j}} & = \sum_{n=1}^N \left ( t_{n,j} - g_{n,j} \right ) \frac{\partial y_{n,j}}{\partial \Wv_{m,j}} \end{align*} $$

So, our previously derived matrix expressions for neural networks can be used if we modify the output calculation. Here are the expressions we used for minimizing mean squared error:

$$ \begin{align*} \Zv &= h(\tilde{\Xv} \Vv)\\ \Yv &= \tilde{\Zv} \Wv\\ E &= \frac{1}{NK} \frac{1}{2} \sum (\Tv - \Yv)^2\\ \grad_\Vv E &= - \frac{1}{NK} \tilde{\Xv}^T \left ( (\Tv - \Yv) \hat{\Wv}^T \cdot (1-\Zv^2) \right )\\ \grad_\Wv E &= - \frac{1}{NK} \tilde{\Zv}^T (\Tv - \Yv) \end{align*} $$

Here are the changes needed for nonlinear logistic regression. $\Tiv$ is indicator variables for $\Tv$

$$ \begin{align*} \Zv &= h(\tilde{\Xv} \Vv)\\ \Yv &= \tilde{\Zv} \Wv\\ \Fv &= e^{\Yv}\\ \Sv &= \Fv \ones{K}\;\;\;\;\;\;\;\;\;\;\;\;\; \text{ sum across columns}\\ \Gv &= \Fv / \left [ \Sv, \Sv,\ldots,\Sv \right ] \;\;\; \Sv \text{ are column vectors }\\ LL &= \sum \Tiv \log \Gv\\ \grad_\Vv LL &= \tilde{\Xv}^T \left ( (\Tiv - \Gv) \hat{\Wv}^T \cdot (1-\Zv^2) \right )\\ \grad_\Wv LL &= \tilde{\Zv}^T (\Tiv - \Gv) \end{align*} 3$$

Now in Python

Since so many aspects of classification using neural networks are exactly like regression using neural networks, such as all of the hidden layer calculations, we can simply define a new class, NeuralNetworkClassifier that inherits from NeuralNetwork as the base class. Then we must redefine several of the class functions.

First, let's look at our A3 implementation. Here is my solution to A3.

In [5]:
import numpy as np

class NeuralNetwork():

    def __init__(self, n_inputs, n_hiddens_list, n_outputs):

        self.n_inputs = n_inputs
        self.n_hiddens_list = n_hiddens_list
        self.n_outputs = n_outputs
        self.n_layers = len(n_hiddens_list) + 1

        self.all_weights, self.Ws = self.make_weights()
        self.initialize_weights()
        self.all_gradients, self.Gs = self.make_weights()

        self.stand_params = None
        self.error_trace = []

    def __repr__(self):
        return f'NeuralNetwork({self.n_inputs}, {self.n_hiddens_list}, {self.n_outputs})'

    def make_weights(self):
        # Create list of matrix shapes, one per layer
        n_in = self.n_inputs
        units = self.n_hiddens_list + [self.n_outputs]
        shapes = []
        for n_units in units:
            shapes.append((n_in + 1, n_units))
            n_in = n_units

        # Allocate contiguous memory to hold all weights
        n_weights = sum([ni * nu for (ni, nu) in shapes])
        all_weights = np.zeros(n_weights)

        # Create numpy views into this memory with the appropriate
        # shapes to be our weight matrices.
        Ws = []
        first = 0
        for (ni, nu) in shapes:
            n_w = ni * nu
            W = all_weights[first:first + n_w].reshape(ni, nu)
            Ws.append(W)
            first += n_w

        return all_weights, Ws

    def initialize_weights(self):
        for W in self.Ws:
            ni, nu = W.shape
            W[:] = np.random.uniform(-1, 1, size=(ni, nu)) / np.sqrt(ni)
        self.error_trace = []

    def forward(self,Xst):
        Ys = [Xst]
        for layer_i, W in enumerate(self.Ws):
            X = Ys[-1]
            Y = self.add_ones(X) @ W
            if layer_i < self.n_layers - 1:
                Y = np.tanh(Y)
            Ys.append(Y)
        return Ys[1:]  # remove X from Ys

    def backward(self, Xst, Tst):
        n_samples = Xst.shape[0]

        Ys = self.forward(Xst)
        Ys = [Xst] + Ys  # Ys now has n_layers + 1 elements.  Ws still has n_layers elements.
        delta = - (Tst - Ys[-1]) / (n_samples * self.n_outputs)
        for layeri in range(self.n_layers)[::-1]:
            X = Ys[layeri]
            self.Gs[layeri][:] = self.add_ones(X).T @ delta
            if layeri > 0:
                delta = delta @ self.Ws[layeri][1:, :].T
                Y = X  # Ys[layeri]
                delta *= 1 - Ys[layeri] ** 2
            # if layeri < self.n_layers - 1:
            #     delta *= 1 - Y ** 2
        # print(self.all_gradients)
        return self.all_gradients

    def mse(self, Xst, Tst):
        Yst = self.forward(Xst)
        return 0.5 * np.mean((Tst - Yst[-1])**2)

    def train(self, X, T, n_epochs, learning_rate=0.01, method='scg', verbose=True):

        self.stand_params = self.calc_standardize_parameters(X, T)
        Xst = self.standardize_X(X)
        Tst = self.standardize_T(T)

        if method == 'sgd':
            optimizer = opt.Optimizers(self.all_weights).sgd
        elif method == 'adam':
            optimizer = opt.Optimizers(self.all_weights).adam
        elif method == 'scg':
            optimizer = opt.Optimizers(self.all_weights).scg
        else:
            print('train: method must be \'sgd\', \'adam\', or \'scg\'.')

        def error_convert(err):
            if T.shape[1] == 1:
                return np.sqrt(err) * self.stand_params['Tstds']
            else:
                # Can't unstandardize err if more than one network output
                return np.sqrt(err)
            
        error_trace = optimizer(self.mse, self.backward, [Xst, Tst],
                                n_epochs, learning_rate,
                                error_convert_f=error_convert,
                                verbose=verbose)
        self.error_trace += error_trace
        return self

    def use(self, X, return_hidden_layer_outputs=False):
        Xst = self.standardize_X(X)
        Ys = self.forward(Xst)
        Y = Ys[-1]
        Y = self.unstandardize_T(Y)
        Zs = Ys[:-1]
        return (Y, Zs) if return_hidden_layer_outputs else Y

    def get_error_trace(self):
        return self.error_trace

    def add_ones(self, X):
        return np.insert(X, 0, 1, axis=1)

    def calc_standardize_parameters(self, X, T):
        Xmeans = X.mean(axis=0)
        Xstds = X.std(axis=0)
        Xstds[Xstds == 0] = np.mean(Xstds[Xstds > 0])
        Tmeans = T.mean(axis=0)
        Tstds = T.std(axis=0)
        return {'Xmeans': Xmeans, 'Xstds': Xstds, 'Tmeans': Tmeans, 'Tstds': Tstds}

    def standardize_X(self, X):
        return (X - self.stand_params['Xmeans']) / self.stand_params['Xstds']

    def unstandardize_X(self, Xst):
        return Xst * self.stand_params['Xstds'] + self.stand_params['Xmeans']

    def standardize_T(self, T):
        return (T - self.stand_params['Tmeans']) / self.stand_params['Tstds']

    def unstandardize_T(self, Tst):
        return Tst * self.stand_params['Tstds'] + self.stand_params['Tmeans']

Now, which functions will we have to override in our subclass? The constructor and the functions for defining and initializing weights and gradients can be reused, unchanged. Other functions can also be reused.
We will have to redefine train, use, forward, and define new functions for neg_log_likelihood.

Let's consider train first. Here is a new version for our new class that we will discuss in lecture.

In [ ]:
class NeuralNetworkClassifier(NeuralNetwork):
    
    def train(self, X, T, n_epochs, learning_rate=0.01, method='scg', verbose=True):

        self.stand_params = self.calc_standardize_parameters(X, T)
        Xst = self.standardize_X(X)
        # Tst = self.standardize_T(T)                                  ## CHANGED

        # NEW PART FROM HERE ...
        
        self.classes, counts = np.unique(T, return_counts=True)
        self.most_common_class = self.classes[np.argmax(counts)]

        if self.n_outputs != len(self.classes):
            raise ValueError(f'''In NeuralNetworkClassifier, the number of outputs must equal
the number of classes in the training data. The given number of outputs
is {self.n_outputs} and number of classes is {len(self.classes)}. Try changing
the number of outputs in the call to NeuralNetworkClassifier().''')

        T_ind_vars = self.make_indicator_variables(T)
  
        #   ... TO HERE
        
        if method == 'sgd':
            optimizer = opt.Optimizers(self.all_weights).sgd
        elif method == 'adam':
            optimizer = opt.Optimizers(self.all_weights).adam
        elif method == 'scg':
            optimizer = opt.Optimizers(self.all_weights).scg
        else:
            print('train: method must be \'sgd\', \'adam\', or \'scg\'.')

        def error_convert(negLL):
            return np.exp(-negLL)                                                    ## CHANGED
            
        error_trace = optimizer(self.neg_log_likelihood, self.backward, [Xst, T_ind_vars],   ## CHANGED
                                n_epochs, learning_rate,
                                error_convert_f=error_convert,
                                verbose=verbose)
        
        self.error_trace += error_trace
        return self

We will need the make_indicator_variables function.

In [6]:
T = np.array([1, 1, 1, 2, 2, 3, 3, 2, 1, 2, 2]).reshape(-1, 1)
T
Out[6]:
array([[1],
       [1],
       [1],
       [2],
       [2],
       [3],
       [3],
       [2],
       [1],
       [2],
       [2]])
In [7]:
(T == np.unique(T)).astype(int)
Out[7]:
array([[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0]])
In [8]:
    def make_indicator_variables(self, T):
        '''Assumes argument is N x 1, N samples each being integer class label.'''
        # Make sure T is two-dimensional. Should be n_samples x 1.
        if T.ndim == 1:
            T = T.reshape((-1, 1))    
        return (T == np.unique(T)).astype(int)

Okay. What else? We need a softmax function to apply to the output of the neural network to calculate the class probabilities.

In [9]:
Y = np.array([[-10, -101, -20]])
Y
Out[9]:
array([[ -10, -101,  -20]])
In [10]:
np.exp(Y) / np.sum(np.exp(Y))
Out[10]:
array([[9.99954602e-01, 3.01427194e-40, 4.53978687e-05]])
In [11]:
import sys
sys.float_info.epsilon
Out[11]:
2.220446049250313e-16
In [12]:
    def softmax(self, Y):
        '''Apply to final layer weighted sum outputs'''
        # Trick to avoid overflow
        maxY = max(0, np.max(Y))
        expY = np.exp(Y - maxY)
        denom = np.sum(expY, axis=1).reshape((-1, 1)) 
        Y = expY / (denom + sys.float_info.epsilon)             # requires an    import sys
        return Y

Now that we have softmax, we can use it in a new version of forward.

In [ ]:
    def forward(self, Xst):
        .
        .
        .

backward can be reused, so no changes needed.

We do need to define the new function that is to be optimized. As you see in the above new version of train, we are calling it neg_log_likelihood.

In [ ]:
    def neg_log_likelihood(self, Xst, T):  # T must be indicator variables
        .
        .
        .

Now, just one more function needs overriding, the use function. Let's make it return the predicted classes and class probabilities for each sample, plus the hidden layer outputs if requested.

In [ ]:
    def use(self, X, return_hidden_layer_outputs=False):
        Xst = self.standardize_X(X)
        Ys = self.forward(Xst)
        Y = Ys[-1]
        classes = . . .  # complete this line
        Zs = Ys[:-1]
        return (classes, Y, Zs) if return_hidden_layer_outputs else (classes, Y)

That's it. Now you should be able to use your new class.

But first, a bit of candy. Add this function to your base NeuralNetwork class, and you can draw a visualization of your weights. You will have to add

import matplotlib.pyplot as plt
import matplotlib.patches as pltpatch  # for Arc
import matplotlib.collections as pltcoll
In [ ]:
    def draw(self, input_names=None, output_names=None, scale='by layer', gray=False): 
        
        plt.title('{} weights'.format(sum([Wi.size for Wi in self.Ws])))

        def isOdd(x):
            return x % 2 != 0

        n_layers = len(self.Ws)

        Wmax_overall = np.max(np.abs(np.hstack([w.reshape(-1) for w in self.Ws])))

        # calculate xlim and ylim for whole network plot
        #  Assume 4 characters fit between each wire
        #  -0.5 is to leave 0.5 spacing before first wire
        xlim = max(map(len, input_names)) / 4.0 if input_names else 1
        ylim = 0

        for li in range(n_layers):
            ni, no = self.Ws[li].shape  #no means number outputs this layer
            if not isOdd(li):
                ylim += ni + 0.5
            else:
                xlim += ni + 0.5

        ni, no = self.Ws[n_layers-1].shape  #no means number outputs this layer
        if isOdd(n_layers):
            xlim += no + 0.5
        else:
            ylim += no + 0.5

        # Add space for output names
        if output_names:
            if isOdd(n_layers):
                ylim += 0.25
            else:
                xlim += round(max(map(len, output_names)) / 4.0)

        ax = plt.gca()

        # changes from Jim Jazwiecki ([email protected]) CS480 student
        character_width_factor = 0.07
        padding = 2
        if input_names:
            x0 = max([1, max(map(len, input_names)) * (character_width_factor * 3.5)])
        else:
            x0 = 1
        y0 = 0 # to allow for constant input to first layer
        # First Layer
        if input_names:
            y = 0.55
            for n in input_names:
                y += 1
                ax.text(x0 - (character_width_factor * padding), y, n, horizontalalignment="right", fontsize=20)

        patches = []
        for li in range(n_layers):
            thisW = self.Ws[li]
            if scale == 'by layer':
                maxW = np.max(np.abs(thisW))
            else:
                maxW = Wmax_overall
            ni, no = thisW.shape
            if not isOdd(li):
                # Even layer index. Vertical layer. Origin is upper left.
                # Constant input
                ax.text(x0 - 0.2, y0 + 0.5, '1', fontsize=20)
                for i in range(ni):
                    ax.plot((x0, x0 + no - 0.5), (y0 + i + 0.5, y0 + i + 0.5), color='gray')
                # output lines
                for i in range(no):
                    ax.plot((x0 + 1 + i - 0.5, x0 + 1 + i - 0.5), (y0, y0 + ni + 1), color='gray')
                # cell "bodies"
                xs = x0 + np.arange(no) + 0.5
                ys = np.array([y0 + ni + 0.5] * no)
                for x, y in zip(xs, ys):
                    patches.append(pltpatch.RegularPolygon((x, y - 0.4), 3, 0.3, 0, color ='#555555'))
                # weights
                if gray:
                    colors = np.array(['black', 'gray'])[(thisW.flat >= 0) + 0]
                else:
                    colors = np.array(['red', 'green'])[(thisW.flat >= 0) + 0]
                xs = np.arange(no) + x0 + 0.5
                ys = np.arange(ni) + y0 + 0.5
                coords = np.meshgrid(xs, ys)
                for x, y, w, c in zip(coords[0].flat, coords[1].flat, 
                                      np.abs(thisW / maxW).flat, colors):
                    patches.append(pltpatch.Rectangle((x - w / 2, y - w / 2), w, w, color=c))
                y0 += ni + 1
                x0 += -1 ## shift for next layer's constant input
            else:
                # Odd layer index. Horizontal layer. Origin is upper left.
                # Constant input
                ax.text(x0 + 0.5, y0 - 0.2, '1', fontsize=20)
                # input lines
                for i in range(ni):
                    ax.plot((x0 + i + 0.5,  x0 + i + 0.5), (y0, y0 + no - 0.5), color='gray')
                # output lines
                for i in range(no):
                    ax.plot((x0, x0 + ni + 1), (y0 + i+ 0.5, y0 + i + 0.5), color='gray')
                # cell 'bodies'
                xs = np.array([x0 + ni + 0.5] * no)
                ys = y0 + 0.5 + np.arange(no)
                for x, y in zip(xs, ys):
                    patches.append(pltpatch.RegularPolygon((x - 0.4, y), 3, 0.3, -np.pi / 2, color ='#555555'))
                # weights
                if gray:
                    colors = np.array(['black', 'gray'])[(thisW.flat >= 0) + 0]
                else:
                    colors = np.array(['red', 'green'])[(thisW.flat >= 0) + 0]
                xs = np.arange(ni) + x0 + 0.5
                ys = np.arange(no) + y0 + 0.5
                coords = np.meshgrid(xs, ys)
                for x, y, w, c in zip(coords[0].flat, coords[1].flat,
                                   np.abs(thisW / maxW).flat, colors):
                    patches.append(pltpatch.Rectangle((x - w / 2, y - w / 2), w, w, color=c))
                x0 += ni + 1
                y0 -= 1 ##shift to allow for next layer's constant input

        collection = pltcoll.PatchCollection(patches, match_original=True)
        ax.add_collection(collection)

        # Last layer output labels
        if output_names:
            if isOdd(n_layers):
                x = x0 + 1.5
                for n in output_names:
                    x += 1
                    ax.text(x, y0 + 0.5, n, fontsize=20)
            else:
                y = y0 + 0.6
                for n in output_names:
                    y += 1
                    ax.text(x0 + 0.2, y, n, fontsize=20)
        ax.axis([0, xlim, ylim, 0])
        ax.axis('off')

Example Classification using Two-Dimensional Data

In [13]:
import numpy as np
import matplotlib.pyplot as plt

Let's make some interesting two-dimensional data samples to classify. How about two bananas and a pineapple ring?

In [14]:
n = 500
x1 = np.linspace(5, 20, n) + np.random.uniform(-2, 2, n)
y1 = ((20 - 12.5)**2 - (x1 - 12.5)**2) / (20 - 12.5)**2 * 10 + 14 + np.random.uniform(-2, 2, n)
x2 = np.linspace(10, 25, n) + np.random.uniform(-2, 2, n)
y2 = ((x2 - 17.5)**2) / (25 - 17.5)**2 * 10 + 5.5 + np.random.uniform(-2, 2, n)
angles = np.linspace(0, 2 * np.pi, n)
x3 = np.cos(angles) * 15 + 15 + np.random.uniform(-2, 2, n)
y3 = np.sin(angles) * 15 + 15 + np.random.uniform(-2, 2, n)
X =  np.vstack((np.hstack((x1, x2, x3)), np.hstack((y1, y2, y3)))).T
T = np.repeat(range(1, 4),n).reshape((-1, 1))
colors = ['red', 'green', 'blue']
plt.figure(figsize=(6, 6))
for c in range(1, 4):
    mask = (T == c).flatten()
    plt.plot(X[mask, 0], X[mask, 1], 'o', markersize=6, alpha=0.5, color=colors[c - 1])
In [15]:
np.unique(T)
Out[15]:
array([1, 2, 3])

Assume you have completed this and have saved your new class in A4mysolution.py python source file, which now contains classes NeuralNetwork and NeuralNetworkClassifer.

In [16]:
from A4mysolution import *
In [19]:
import mpl_toolkits.mplot3d as plt3
from matplotlib import cm

## if you edit A4mysolution.py, force ipython to reload it by doing this.
# from imp import reload
# reload(A4mysolution)

n_hiddens_list = [10]
nnet = NeuralNetworkClassifier(X.shape[1], n_hiddens_list, len(np.unique(T)))
nnet.train(X, T, n_epochs=100, verbose=True)

xs = np.linspace(0, 30, 40)
x,y = np.meshgrid(xs, xs)
Xtest = np.vstack((x.flat, y.flat)).T
classes, probs = nnet.use(Xtest)

plt.figure(figsize=(10, 10))

plt.subplot(2, 2, 1)
plt.plot(nnet.get_error_trace())
plt.xlabel("Epochs")
plt.ylabel("Likelihood")

plt.subplot(2, 2, 3)
nnet.draw()

colors = ['red', 'green', 'blue']
plt.subplot(2, 2, 2)

for c in range(1, 4):
    mask = (T == c).flatten()
    plt.plot(X[mask, 0], X[mask, 1], 'o', markersize=6, alpha=0.5, color=colors[c - 1])

plt.subplot(2, 2, 4)
plt.contourf(Xtest[:, 0].reshape((40, 40)),Xtest[:, 1].reshape((40, 40)), classes.reshape((40, 40)),
             levels = [0.5, 1.99, 2.01, 3.5], #    levels=(0.5,1.5,2.5,3.5),
             colors=('red', 'green', 'blue'))
plt.xlim(-5, 35)
plt.ylim(-5, 35);
SCG: Epoch 10 Error=0.74992
SCG: Epoch 20 Error=0.84456
SCG: Epoch 30 Error=0.94059
SCG: Epoch 40 Error=0.97287
SCG: Epoch 50 Error=0.98572
SCG: Epoch 60 Error=0.99091
SCG: Epoch 70 Error=0.99241
SCG: Epoch 80 Error=0.99276
SCG: Epoch 90 Error=0.99313
SCG: Epoch 100 Error=0.99349

Since we have two-dimenional input, we can plot the outputs of our neural network as heights above the base plane. We have three outputs, representing the probabilities of each clas for each sample, or coordinate in the base plane. Here is how these can be displayed as surfaces!

In [20]:
from matplotlib.colors import LightSource
In [21]:
fig = plt.figure(figsize=(10, 30))
ls = LightSource(azdeg=30, altdeg=60)
white = np.ones((x.shape[0], x.shape[1], 3))
red = white * np.array([1, 0, 0])
green = white * np.array([0, 1, 0])
blue = white * np.array([0, 0, 1])
colors = [red, green, blue]

for c in range(3):
    ax = fig.add_subplot(3, 1, c + 1, projection='3d')
    ax.view_init(azim=180 + 40, elev=50)
    Z = probs[:, c].reshape(x.shape)
    rgb = ls.shade_rgb(colors[c], Z, vert_exag=0.1)
    ax.plot_surface(x, y, Z,
                    rstride=1, cstride=1, linewidth=0, antialiased=False,
                    shade=True, facecolors=rgb)
    ax.set_zlabel(r"$p(C="+str(c+1)+"|x)$", fontsize=20)

How would you plot the outputs of the hidden units?