Supervised Learning: Neural Networks

In [ ]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='ticks')

import tensorflow as tf
from scipy import optimize
from ipywidgets import interact
from IPython.display import SVG
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale

McCulloch and Pitts Neuron

In 1943, McCulloch and Pitts introduced a mathematical model of a neuron. It consisted of three components:

  1. A set of weights $w_i$ corresponding to synapses (inputs)
  2. An adder for summing input signals; analogous to cell membrane that collects charge
  3. An activation function for determining when the neuron fires, based on accumulated input

The neuron model is shown schematically below. On the left are input nodes $\{x_i\}$, usually expressed as a vector. The strength with which the inputs are able to deliver the signal along the synapse is determined by their corresponding weights $\{w_i\}$. The adder then sums the inputs from all the synapses:

$$h = \sum_i w_i x_i$$

The parameter $\theta$ determines whether or not the neuron fires given a weighted input of $h$. If it fires, it returns a value $y=1$, otherwise $y=0$. For example, a simple activation function is using $\theta$ as a simple fixed threshold:

$$y = g(h) = \left\{ \begin{array}{l} 1, \text{if } h \gt \theta \\ 0, \text{if } h \le \theta \end{array} \right.$$

this activation function may take any of several forms, such as a logistic function.

neuron

A single neuron is not interesting, nor useful, from a learning perspective. It cannot learn; it simply receives inputs and either fires or not. Only when neurons are joined as a network can they perform useful work.

Learning takes place by changing the weights of the connections in a neural network, and by changing the parameters of the activation functions of neurons.

Perceptron

A collection of McCullough and Pitts neurons, along with a set of input nodes connected to the inputs via weighted edges, is a perceptron, the simplest neural network.

Each neuron is independent of the others in the perceptron, in the sense that its behavior and performance depends only on its own weights and threshold values, and not of those for the other neurons. Though they share inputs, they operate independently.

The number of inputs and outputs are determined by the data. Weights are stored as a N x K matrix, with N observations and K neurons, with $w_{ij}$ specifying the weight on the ith observation on the jth neuron.

perceptron

In order to use the perceptron for statistical learning, we compare the outputs $y_j$ from each neuron to the obervation target $t_j$, and adjust the input weights when they do not correspond (e.g. if a neuron fires when it should not have).

$$t_j - y_j$$

We use this difference to update the weight $w_{ij}$, based on the input and a desired learning rate. This results in an update rule:

$$w_{ij} \leftarrow w_{ij} + \eta (t_j - y_j) x_i$$

After an incremental improvement, the perceptron is shown the training data again, resulting in another update. This is repeated until the performance no longer improves. Having a learning rate less than one results in a more stable learning rate, though this stability is traded off against having to expose the network to the data multiple times. Typical learning rates are in the 0.1-0.4 range.

An additional input node is typically added to the perceptron model, which is a constant value (usually -1, 0, or 1) that acts analogously to an intercept in a regression model. This establishes a baseline input for the case when all inputs are zero.

bias

Learning with Perceptrons

  1. Initialize weights $w_{ij}$ to small, random numbers.
  2. For each t in T iterations
    • compute activation for each neuron j connected to each input vector i $$y_j = g\left( h=\sum_i w_{ij} x_i \right) = \left\{ \begin{array}{l} 1, \text{if } h \gt 0 \\ 0, \text{if } h \le 0 \end{array} \right.$$
    • update weights $$w_{ij} \leftarrow w_{ij} + \eta (t_j - y_j) x_i$$

This algorithm is $\mathcal{O}(Tmn)$

Example: Logical functions

Let's see how the perceptron learns by training it on a couple of of logical functions, AND and OR. For two variables x1 and x2, the AND function returns 1 if both are true, or zero otherwise; the OR function returns 1 if either variable is true, or both. These functions can be expressed as simple lookup tables.

In [ ]:
AND = pd.DataFrame({'x1': (0,0,1,1), 'x2': (0,1,0,1), 'y': (0,0,0,1)})
AND

First, we need to initialize weights to small, random values (can be positive and negative).

In [ ]:
w = np.random.randn(3)*1e-4

Then, a simple activation function for calculating $g(h)$:

In [ ]:
g = lambda inputs, weights: np.where(np.dot(inputs, weights)>0, 1, 0)

Finally, a training function that iterates the learning algorithm, returning the adapted weights.

In [ ]:
def train(inputs, targets, weights, eta, n_iterations):

    # Add the inputs that match the bias node
    inputs = np.c_[inputs, -np.ones((len(inputs), 1))]

    for n in range(n_iterations):

        activations = g(inputs, weights);
        weights -= eta*np.dot(np.transpose(inputs), activations - targets)
        
    return(weights)

Let's test it first on the AND function.

In [ ]:
inputs = AND[['x1','x2']]
target = AND['y']

w = train(inputs, target, w, 0.25, 10)

Checking the performance:

In [ ]:
g(np.c_[inputs, -np.ones((len(inputs), 1))], w)

Thus, it has learned the function perfectly. Now for OR:

In [ ]:
OR = pd.DataFrame({'x1': (0,0,1,1), 'x2': (0,1,0,1), 'y': (0,1,1,1)})
OR
In [ ]:
w = np.random.randn(3)*1e-4
In [ ]:
inputs = OR[['x1','x2']]
target = OR['y']

w = train(inputs, target, w, 0.25, 20)
In [ ]:
g(np.c_[inputs, -np.ones((len(inputs), 1))], w)

Also 100% correct.

Exercise: XOR

Now try running the model on the XOR function, where a one is returned for either x1 or x2 being true, but not both. What happens here?

In [ ]:
# Write your answer here

Let's explore the problem graphically:

In [ ]:
AND.plot(kind='scatter', x='x1', y='x2', c='y', s=50, colormap='winter')
plt.plot(np.linspace(0,1.4), 1.5 - 1*np.linspace(0,1.4), 'k--');
In [ ]:
OR.plot(kind='scatter', x='x1', y='x2', c='y', s=50, colormap='winter')
plt.plot(np.linspace(-.4,1), .5 - 1*np.linspace(-.4,1), 'k--');
In [ ]:
XOR = pd.DataFrame({'x1': (0,0,1,1), 'x2': (0,1,0,1), 'y': (0,1,1,0)})

XOR.plot(kind='scatter', x='x1', y='x2', c='y', s=50, colormap='winter');

The perceptron tries to find a separating hyperplane for the two response classes. Namely, a set of weights that satisfies:

$$\mathbf{x_1}\mathbf{w}^T=0$$

and:

$$\mathbf{x_2}\mathbf{w}^T=0$$

Hence,

$$\begin{aligned} \mathbf{x}_1\mathbf{w}^T &= \mathbf{x}_2\mathbf{w}^T \\ \Rightarrow (\mathbf{x}_1 - \mathbf{x}_2) \mathbf{w}^T &= 0 \end{aligned}$$

This means that either the norms of $\mathbf{x}_1 - \mathbf{x}_2$ or $\mathbf{w}$ are zero, or the cosine of the angle between them is equal to zero, due to the identity:

$$\mathbf{a}\mathbf{b} = \|a\| \|b\| \cos \theta$$

Since there is no reason for the norms to be zero in general, we need the two vectors to be at right angles to one another. So, we need a weight vector that is perpendicular to the decision boundary.

Clearly, for the XOR function, the output classes are not linearly separable. So, the algorithm does not converge on an answer, but simply cycles through two incorrect solutions.

Multi-layer Perceptron

The solution to fitting more complex (i.e. non-linear) models with neural networks is to use a more complex network that consists of more than just a single perceptron. The take-home message from the perceptron is that all of the learning happens by adapting the synapse weights until prediction is satisfactory. Hence, a reasonable guess at how to make a perceptron more complex is to simply add more weights.

There are two ways to add complexity:

  1. Add backward connections, so that output neurons feed back to input nodes, resulting in a recurrent network
  2. Add neurons between the input nodes and the outputs, creating an additional ("hidden") layer to the network, resulting in a multi-layer perceptron

The latter approach is more common in applications of neural networks.

multilayer

How to train a multilayer network is not intuitive. Propagating the inputs forward over two layers is straightforward, since the outputs from the hidden layer can be used as inputs for the output layer. However, the process for updating the weights based on the prediction error is less clear, since it is difficult to know whether to change the weights on the input layer or on the hidden layer in order to improve the prediction.

Updating a multi-layer perceptron (MLP) is a matter of:

  1. moving forward through the network, calculating outputs given inputs and current weight estimates
  2. moving backward updating weights according to the resulting error from forward propagation.

In this sense, it is similar to a single-layer perceptron, except it has to be done twice, once for each layer (in principle, we can add additional hidden layers, but without sacrificing generality, I will keep it simple).

Error back-propagation

We update the weights in a MLP using back-propagation of the prediction errors, which is essentially a form of gradient descent, as we have used previously for optimization.

First, for the multi-layer perceptron we need to modify the error function, which in the single-layer case was a simple difference between the predicted and observed outputs. Because we will be summing errors, we have to avoid having errors in different directions cancelling each other out, so a sum of squares error is more appropriate:

$$E(t,y) = \frac{1}{2} \sum_i (t_i - y_i)^2$$

It is on this function that we will perform gradient descent, since the goal is to minimize the error. Specificially, we will differentiate with respect to the weights, since it is the weights that we are manipulating in order to get better predictions.

Recall that the error is a function of the threshold function

$$E(\mathbf{w}) = \frac{1}{2} \sum_i (t_i - y_i)^2 = \frac{1}{2} \sum_i \left(t_i - g\left[ \sum_j w_{ij} a_j \right]\right)^2$$

So, we will also need to differentiate that. However, the threshold function we used in the single-layer perceptron was discontinuous, making it non-differentiable. Thus, we need to modify it as well. An alternative is to employ some type of sigmoid function, such as the logistic, which can be parameterized to resemble a threshold function, but varies smoothly across its range.

$$g(h) = \frac{1}{1 + \exp(-\beta h)}$$
In [ ]:
logistic = lambda h, beta: 1./(1 + np.exp(-beta * h))

@interact(beta=(-1, 25))
def logistic_plot(beta=5):
    hvals = np.linspace(-2, 2)
    plt.plot(hvals, logistic(hvals, beta))

This has the advantage of having a simple derivative:

$$\frac{dg}{dh} = \beta g(h)(1 - g(h))$$

Alternatively, the hyperbolic tangent function is also sigmoid:

$$g(h) = \tanh(h) = \frac{\exp(h) - \exp(-h)}{\exp(h) + \exp(-h)}$$
In [ ]:
hyperbolic_tangent = lambda h: (np.exp(h) - np.exp(-h)) / (np.exp(h) + np.exp(-h))

@interact(theta=(-1, 25))
def tanh_plot(theta=5):
    hvals = np.linspace(-2, 2)
    h = hvals*theta
    plt.plot(hvals, hyperbolic_tangent(h))

Notice that the hyperbolic tangent function asymptotes at -1 and 1, rather than 0 and 1, which is sometimes beneficial, and its derivative is simple:

$$\frac{d \tanh(x)}{dx} = 1 - \tanh^2(x)$$

Performing gradient descent will allow us to change the weights in the direction that optimially reduces the error. The next trick will be to employ the chain rule to decompose how the error changes as a function of the input weights into the change in error as a function of changes in the inputs to the weights, mutliplied by the changes in input values as a function of changes in the weights.

$$\frac{\partial E}{\partial w} = \frac{\partial E}{\partial h}\frac{\partial h}{\partial w}$$

This will allow us to write a function describing the activations of the output weights as a function of the activations of the hidden layer nodes and the output weights, which will allow us to propagate error backwards through the network.

The second term in the chain rule simplifies to:

$$\begin{align} \frac{\partial h_k}{\partial w_{jk}} &= \frac{\partial \sum_l w_{lk} a_l}{\partial w_{jk}} \\ &= \sum_l \frac{\partial w_{lk} a_l}{\partial w_{jk}} \\ & = a_j \end{align}$$

where $a_j$ is the activation of the jth hidden layer neuron.

For the first term in the chain rule above, we decompose it as well:

$$\frac{\partial E}{\partial h_k} = \frac{\partial E}{\partial y_k}\frac{\partial y_k}{\partial h_k} = \frac{\partial E}{\partial g(h_k)}\frac{\partial g(h_k)}{\partial h_k}$$

The second term of this chain rule is just the derivative of the activation function, which we have chosen to have a conveneint form, while the first term simplifies to:

$$\frac{\partial E}{\partial g(h_k)} = \frac{\partial}{\partial g(h_k)}\left[\frac{1}{2} \sum_k (t_k - y_k)^2 \right] = t_k - y_k$$

Combining these, and assuming (for illustration) a logistic activiation function, we have the gradient:

$$\frac{\partial E}{\partial w} = (t_k - y_k) y_k (1-y_k) a_j$$

Which ends up getting plugged into the weight update formula that we saw in the single-layer perceptron:

$$w_{jk} \leftarrow w_{jk} - \eta (t_k - y_k) y_k (1-y_k) a_j$$

Note that here we are subtracting the second term, rather than adding, since we are doing gradient descent.

We can now outline the MLP learning algorithm:

  1. Initialize all $w_{jk}$ to small random values
  2. For each input vector, conduct forward propagation:
    • compute activation of each neuron $j$ in hidden layer (here, sigmoid): $$h_j = \sum_i x_i v_{ij}$$ $$a_j = g(h_j) = \frac{1}{1 + \exp(-\beta h_j)}$$
    • when the output layer is reached, calculate outputs similarly: $$h_k = \sum_k a_j w_{jk}$$ $$y_k = g(h_k) = \frac{1}{1 + \exp(-\beta h_k)}$$
  3. Calculate loss for resulting predictions:
    • compute error at output: $$\delta_k = (t_k - y_k) y_k (1-y_k)$$
  4. Conduct backpropagation to get partial derivatives of cost with respect to weights, and use these to update weights:
    • compute error of the hidden layers: $$\delta_{hj} = \left[\sum_k w_{jk} \delta_k \right] a_j(1-a_j)$$
    • update output layer weights: $$w_{jk} \leftarrow w_{jk} - \eta \delta_k a_j$$
    • update hidden layer weights: $$v_{ij} \leftarrow v_{ij} - \eta \delta_{hj} x_i$$

Return to (2) and iterate until learning completes. Best practice is to shuffle input vectors to avoid training in the same order.

Its important to be aware that because gradient descent is a hill-climbing (or descending) algorithm, it is liable to be caught in local minima with respect to starting values. Therefore, it is worthwhile training several networks using a range of starting values for the weights, so that you have a better chance of discovering a globally-competitive solution.

One useful performance enhancement for the MLP learning algorithm is the addition of momentum to the weight updates. This is just a coefficient on the previous weight update that increases the correlation between the current weight and the weight after the next update. This is particularly useful for complex models, where falling into local mimima is an issue; adding momentum will give some weight to the previous direction, making the resulting weights essentially a weighted average of the two directions. Adding momentum, along with a smaller learning rate, usually results in a more stable algorithm with quicker convergence. When we use momentum, we lose this guarantee, but this is generally seen as a small price to pay for the improvement momentum usually gives.

A weight update with momentum looks like this:

$$w_{jk} \leftarrow w_{jk} - \eta \delta_k a_j + \alpha \Delta w_{jk}^{t-1}$$

where $\alpha$ is the momentum (regularization) parameter and $\Delta w_{jk}^{t-1}$ the update from the previous iteration.

The multi-layer pereptron is implemented below in the MLP class. The implementation uses the scikit-learn interface, so it is uses in the same way as other supervised learning algorithms in that package.

In [ ]:
softmax = lambda a: a / np.sum(a, axis=1, keepdims=True)

class MLP:
    
    def __init__(self, alpha=0.01, eta=0.01, n_hidden_dim=25):
        
        self.alpha = alpha
        self.eta = eta
        self.n_hidden_dim = n_hidden_dim
        
    # Helper function to evaluate the total loss on the dataset
    def calculate_loss(self, X, y):
        
        # Forward propagation to calculate our predictions
        z1 = X.dot(self.w1) + self.b1
        a1 = np.tanh(z1)
        z2 = a1.dot(self.w2) + self.b2
        exp_scores = np.exp(z2)
        probs = softmax(exp_scores)
        
        # Calculating the loss
        data_loss = -np.log(probs[range(num_examples), y]).sum()
        
        # Add regulatization term to loss (optional)
        data_loss += self.alpha/2 * np.square(self.w1).sum() + np.square(self.w2).sum()
        
        return 1./num_examples * data_loss
        
    # Helper function to predict an output (0 or 1)
    def predict(self, x):

        # Forward propagation
        z1 = x.dot(self.w1) + self.b1
        a1 = np.tanh(z1)
        z2 = a1.dot(self.w2) + self.b2
        exp_scores = np.exp(z2)
        probs = softmax(exp_scores)
        
        return np.argmax(probs, axis=1)
        
        
    def fit(self, X, y, num_passes=20000, print_loss=False, seed=42):
        
        num_examples, nn_input_dim = X.shape
        nn_output_dim = len(set(y))
    
        # Initialize the parameters to random values. We need to learn these.
        np.random.seed(seed)
        self.w1 = np.random.randn(nn_input_dim, self.n_hidden_dim) / np.sqrt(nn_input_dim)
        self.b1 = np.zeros((1, self.n_hidden_dim))
        self.w2 = np.random.randn(self.n_hidden_dim, nn_output_dim) / np.sqrt(self.n_hidden_dim)
        self.b2 = np.zeros((1, nn_output_dim))


        # Gradient descent. For each batch...
        for i in range(num_passes):

            # Forward propagation
            z1 = X.dot(self.w1) + self.b1
            a1 = np.tanh(z1)
            z2 = a1.dot(self.w2) + self.b2
            exp_scores = np.exp(z2)

            # Backpropagation
            delta3 = softmax(exp_scores)
            delta3[range(num_examples), y] -= 1
            dw2 = (a1.T).dot(delta3)
            db2 = np.sum(delta3, axis=0, keepdims=True)
            delta2 = delta3.dot(self.w2.T) * (1 - np.power(a1, 2))
            dw1 = np.dot(X.T, delta2)
            db1 = np.sum(delta2, axis=0)

            # Add regularization terms (b1 and b2 don't have regularization terms)
            dw2 += self.alpha * self.w2
            dw1 += self.alpha * self.w1

            # Gradient descent parameter update
            self.w1 += -self.eta * dw1
            self.b1 += -self.eta * db1
            self.w2 += -self.eta * dw2
            self.b2 += -self.eta * db2

            # Optionally print the loss.
            # This is expensive because it uses the whole dataset, so we don't want to do it too often.
            if print_loss and i % 1000 == 0:
              print("Loss after iteration %i: %f" %(i, calculate_loss(X, y)))

Let's initialize a MLP classifier, specifying the conjugate gradient minimization method.

In [ ]:
mlp = MLP()

Now we can confirm that it solves a non-linear classification, using the simple XOR example

In [ ]:
X = XOR[['x1','x2']].values
y = XOR['y'].values
In [ ]:
mlp.fit(X, y, num_passes=100)
In [ ]:
mlp.predict(X).reshape([2,2])

For a somewhat more sophisiticated example, we can use scikit-learn to simulate some data with a non-linear boundary.

In [ ]:
# Generate a dataset and plot it
np.random.seed(0)
X, y = datasets.make_moons(200, noise=0.20)
X = X.astype(np.float32)
y = y.astype(np.int32)
plt.scatter(X[:,0], X[:,1], s=40, c=y, cmap=plt.cm.Set1)
In [ ]:
clf = MLP(n_hidden_dim=3)
clf.fit(X, y)
In [ ]:
def plot_decision_boundary(pred_func, X=X, y=y):
    
    # Set min and max values and give it some padding
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    h = 0.01
    
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # Predict the function value for the whole gid
    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot the contour and training examples
    plt.contourf(xx, yy, Z, cmap=plt.cm.summer_r)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Greens)
In [ ]:
plot_decision_boundary(lambda x: clf.predict(x))
plt.title("Decision Boundary for hidden layer size 3")

Since we used the scikit-learn interface, its easy to take advantage of the metrics module to evaluate the MLP's performance.

In [ ]:
X, y = datasets.make_moons(50, noise=0.20)
X = X.astype(np.float32)
y = y.astype(np.int32)
In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix

accuracy_score(y, clf.predict(X))
In [ ]:
confusion_matrix(y, clf.predict(X))

Varying the hidden layer size

In the example above we picked a hidden layer size of 3. Let's now get a sense of how varying the hidden layer size affects the result.

In [ ]:
@interact(n_hidden_dim=[1, 2, 3, 4, 5, 20, 50])
def hidden_layer_size(n_hidden_dim=5):
    model = MLP(n_hidden_dim=n_hidden_dim)
    model.fit(X, y)
    plot_decision_boundary(lambda x: model.predict(x))
    plt.title('Hidden Layer size %d' % n_hidden_dim)

Neural network specification

The MLP implemented above uses a single hidden layer, though it allows a user-specified number of hidden layer nodes (defaults to 25). It is worth considering whether it is useful having multiple hidden layers, and whether more hidden nodes is desirable.

Unfortunately, there is no theory to guide the choice of hidden node number. As a result, we are left to experiment with this parameter, perhaps in some systematic fashion such as cross-validation.

Adding additional layers presents only additional "bookkeeping" overhead to the user, with the weight updating becoming more complicated as layers are added. So, we don't want to add more hidden layers if it does not pay off in performance. It turns out that two or three layers (including the output layer) can be shown to approximate almost any smooth function. Combining 3 sigmoid functions allows local responses to be approximated with arbitrary accuracy. This is sufficient for determining any decision boundary.

Neural network validation

Just as with other supervised learning algorithms, neural networks can be under- or over-fit to a training dataset. The degree to which a network is trained to a particular dataset depends on how long we train it on that dataset. Every time we run the MLP learning algorithm over a dataset (an epoch), it reduces the prediction error for that dataset. Thus, the number of epochs should be tuned as a hyperparameter, stopping when the testing-training error gap begins to widen.

Note that though we can also use cross-validation to tune the number of hidden layers in the network, there is no risk of overfitting by having too many layers.

Exercise: Epoch tuning for MLPs

The dataset pima-indians-diabetes.data.txt in your data folder contains eight measurements taken from a group of Pima Native Americans in Arizona, along with an indicator of the onset of diabetes. Use the MLP class to fit a neural network classifier to this dataset, and use cross-validation to examine the optimal number of epochs to use in training.

  1. Number of times pregnant
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  3. Diastolic blood pressure (mm Hg)
  4. Triceps skin fold thickness (mm)
  5. 2-Hour serum insulin (mu U/ml)
  6. Body mass index (weight in kg/(height in m)^2)
  7. Diabetes pedigree function
  8. Age (years)
  9. Class variable (0 or 1)
In [ ]:
pima = pd.read_csv('../data/pima-indians-diabetes.data.txt', header=None)
pima.head()
In [ ]:
# Write your answer here

Example: Mulitilayer perceptron in TensorFlow

TensorFlow is designed to evaluate expressions efficiently, and one of its key features is that it automatically differentiates expressions. This saves us from having to code a gradient function by hand.

Here we will construct a multilayer perceptron, and use batch updating to estimate the model by hand. This is designed to show how TensorFlow works. There is a more user-friendly API for fitting neural networks that we will discover later.

For this, we will use the MNIST dataset, which we will introduce more formally later.

In [ ]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
In [ ]:
# Parameters
learning_rate = 0.001
training_epochs = 15
batch_size = 100
display_step = 1

# Network Parameters
nn_hdim = 256 # hidden layer number of neurons
nn_input_dim = 784 # MNIST data input (img shape: 28*28)
nn_output_dim = 10 # MNIST total classes (0-9 digits)

Defining the Computation Graph in TensorFlow

The first thing we need to is define our computations using TensorFlow. We start by defining our input data matrix X and our training labels y:

In [ ]:
# Our data vectors
X = tf.placeholder('float', [None, nn_input_dim], name='X')
Y = tf.placeholder('float', [None, nn_output_dim], name='y')

Remember, we have not assigned any values to X or y. All we have done is defined mathematical expressions for them. We can use these expressions in subsequent calculations. If we want to evaluate an expression we can call its eval method. For example, to evaluate the expression X * 2 for a given value of X we could do the following:

In [ ]:
tf.InteractiveSession()

This creates a TensorFlow Session that can be used interactively. More on this later.

In [ ]:
(Y * 2).eval(feed_dict={Y : np.random.randn(10)[None, :]})

While a placeholder represents a tensor for feeding data into the model, another type represents shared, persistent states that are manipulated by a model. The Variable class is used for model values, such as network weights, which are iteratively updated through learning.

Our network parameters $W_1, b_1, W_2, b_2$ are updated using gradient descent, so they can be represented by Variables:

In [ ]:
# Store layers weight & bias
weights = {
    'W1': tf.Variable(tf.random_normal([nn_input_dim, nn_hdim])),
    'W2': tf.Variable(tf.random_normal([nn_hdim, nn_hdim])),
    'out': tf.Variable(tf.random_normal([nn_hdim, nn_output_dim]))
}

biases = {
    'b1': tf.Variable(tf.random_normal([nn_hdim])),
    'b2': tf.Variable(tf.random_normal([nn_hdim])),
    'out': tf.Variable(tf.random_normal([nn_output_dim]))
}

We can now build our forward propagation step for the perceptron.

Note that we don't need to explicitly do a forward propagation here. TensorFlow knows that our gradients depend on our predictions from the forward propagation and it will handle all the necessary calculations for us. It does everything it needs to update the values.

In [ ]:
def multilayer_perceptron(x):
    # Hidden fully connected layer with 256 neurons
    layer_1 = tf.add(tf.matmul(x, weights['W1']), biases['b1'])
    # Hidden fully connected layer with 256 neurons
    layer_2 = tf.add(tf.matmul(layer_1, weights['W2']), biases['b2'])
    # Output fully connected layer with a neuron for each class
    out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
    return out_layer

Rather than calling the eval method to evaluate our TF expressions, we can instead define operators for expressions we want to evaluate.

For example, to calculate the loss, we need to know the values for $X$ and $y$, and pass them to our loss function.

In [ ]:
# Construct model
logits = multilayer_perceptron(X)

# Define loss and optimizer
loss_op = tf.reduce_mean(tf.losses.softmax_cross_entropy(
    logits=logits, onehot_labels=Y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)
# Initializing the variables
init = tf.global_variables_initializer()
In [ ]:
with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(mnist.train.num_examples/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            # Run optimization op (backprop) and cost op (to get loss value)
            _, c = sess.run([train_op, loss_op], feed_dict={X: batch_x,
                                                            Y: batch_y})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        if epoch % display_step == 0:
            print("Epoch:", '%04d' % (epoch+1), "cost={:.9f}".format(avg_cost))
    print("Optimization Finished!")

    # Test model
    pred = tf.nn.softmax(logits)  # Apply softmax to logits
    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(Y, 1))
    
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    print("Accuracy:", accuracy.eval({X: mnist.test.images, Y: mnist.test.labels}))

References