Linear Classification

  • Algorithms that learn linear decision boundaries for classification tasks.
  • Note that the model can be non-linear such as logistic regression or SVM. But the decision boundary is linear.
  • The goal is to learn a hyper-plane $\mathbf{x}^T \mathbf{w} + b = 0$ to separate the date.

Least square classification

In assignment 1, we used linear regression for classification: $$y(\mathbf{x}, \mathbf{w}) = \mathbf{x}^T \mathbf{w} + b$$

Logistic Regression Model

We will consider linear model for classification. Note that the model is linear in parameters.

$$y(\mathbf{x}, \mathbf{w}) = \sigma (\mathbf{x}^T \mathbf{w} + b)$$

where

$$ \sigma(x) = {1 \over {1 + e^{-x}}}$$

Logistic Regression Example

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%pylab inline

import warnings
warnings.filterwarnings('ignore')
Populating the interactive namespace from numpy and matplotlib

Loading the dataset

In [2]:
with np.load("TINY_MNIST.npz") as data:
  x_train, t_train = data["x"], data["t"]
  x_eval, t_eval = data["x_eval"], data["t_eval"]
In [3]:
import nn_utils as nn
nn.show_images(x_train[:200], (10, 20), scale=1)
nn.show()

Placeholders and Variables

In [4]:
#Placeholders
X = tf.placeholder("float", shape=(None, 64))
Y = tf.placeholder("float", shape=(None, 1))

#Varialbels
W = tf.Variable(np.random.randn(64, 1).astype("float32"), name="weight")
b = tf.Variable(np.random.randn(1).astype("float32"), name="bias")
In [5]:
X.get_shape()
Out[5]:
TensorShape([Dimension(None), Dimension(64)])

Logistic Regression Model

We will consider linear model for classification. Note that the model is linear in parameters.

$$y(\mathbf{x}, \mathbf{w}) = \sigma (\mathbf{x}^T \mathbf{w} + b)$$
In [6]:
logits = tf.add(tf.matmul(X, W), b)
output = tf.nn.sigmoid(logits)

print output.get_shape()
TensorShape([Dimension(None), Dimension(1)])

Cross-Entropy Cost

Cross-Entropy cost = $t * -\text{log}(y) + (1 - t) * -\text{log}(1 - y)$

Cross-Entropy cost = $t * -\text{log}(\sigma(x)) + (1 - t) * -\text{log}(1 - \sigma(x))$

where $ \sigma(x) = {1 \over {1 + e^{-x}}}$

Problem: This cost will give rise to NaN when $x \rightarrow -\infty, \infty$

In [7]:
def sigmoid(x):
  return 1.0 / (1.0 + np.exp(-x))
sigmoid(100)
Out[7]:
1.0

How not to do Cross Entropy

In [27]:
#This doesn't work!
def xentropy(x, t):
  return t*-np.log(sigmoid(x)) + (1-t)*-np.log(1.0 - sigmoid(x))
print xentropy(10, 1)
print xentropy(-1000, 0)
print xentropy(1000, 0)
print xentropy(-1000, 1)
4.53988992168e-05
nan
inf
inf
In [12]:
#This kind of works!
def hacky_xentropy(x, t):
  return t*-np.log(1e-15 + sigmoid(x)) + (1-t)*-np.log(1e-15 + 1.0 - sigmoid(x))
print hacky_xentropy(1000, 1)
print hacky_xentropy(-1000, 0)
print hacky_xentropy(1000, 0)
print hacky_xentropy(-1000, 1)
-1.11022302463e-15
-1.11022302463e-15
34.4342154767
34.5387763949
In [13]:
#This kind of works!
def another_hacky_xentropy(x, t):
  return -np.log(t*sigmoid(x) + (1-t)*(1-sigmoid(x)))
print another_hacky_xentropy(1000, 1)
print another_hacky_xentropy(-1000, 0)
print another_hacky_xentropy(1000, 0)
print another_hacky_xentropy(-1000, 1)
-0.0
-0.0
inf
inf

How to do Cross Entropy

Cross-Entropy = $x - x * t + log(1 + e^{-x}) = max(x, 0) - x * t + log(1 + e^{-|x|}))$

In [14]:
def good_xentropy(x, t):
  return np.maximum(x, 0) - x * t + np.log(1 + np.exp(-np.abs(x)))
print good_xentropy(1000, 1)
print good_xentropy(-1000, 0)
print good_xentropy(1000, 0)
print good_xentropy(-1000, 1)
0.0
0.0
1000.0
1000.0
In [15]:
x = np.arange(-10, 10, 0.1)
y = [good_xentropy(i, 1) for i in x]
plt.plot(x, y)
plt.grid(); plt.xlabel("logit"); plt.ylabel("Cross-Entropy")
Out[15]:
<matplotlib.text.Text at 0x1093eecd0>
  1. Logistic Regression penalizes you linearly when you are on the wrong side of the hyperplane.
  2. Logistic Regression doesn't penalizes you when you are on the right side of the hyper-plane but far away. (Not sensitive to outliers)
  3. This is why we should use logistic regression instead of linear regression for classification (comes at the cost of not having a closed form solution).

Support Vector Machines (SVM):

  • The logistic regression cost function is very similar to the cost function of SVM. SVM only considers the points that are close to the hyperplane (support vectors) and ignores the rest of the points:
In [16]:
def svm_cost(x):
  return - x + 1 if x < 1 else 0
x = np.arange(-10, 10, 0.1)
y = [svm_cost(i) for i in x]
plt.plot(x, y)
plt.grid(); plt.xlabel("logit"); plt.ylabel("Cross-Entropy")
Out[16]:
<matplotlib.text.Text at 0x109374050>

Cross Entropy in TensorFlow

In [17]:
cost_batch = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, targets=Y)
cost = tf.reduce_mean(cost_batch)
In [18]:
print logits.get_shape()
print cost.get_shape()
TensorShape([Dimension(None), Dimension(1)])
TensorShape([])
In [19]:
norm_w = tf.nn.l2_loss(W)

Momentum Optimizer

"This is logistic regression on noisy moons dataset from sklearn which shows the smoothing effects of momentum based techniques (which also results in over shooting and correction). The error surface is visualized as an average over the whole dataset empirically, but the trajectories show the dynamics of minibatches on noisy data. The bottom chart is an accuracy plot." (Image by Alec Radford)

Momentum

In [21]:
optimizer = tf.train.MomentumOptimizer(learning_rate=1.0, momentum=0.99)
# optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = optimizer.minimize(cost)

Compute Accuracy

In [22]:
#a hack for binary thresholding
pred = tf.greater(output, 0.5)
pred_float = tf.cast(pred, "float")

#accuracy
correct_prediction = tf.equal(pred_float, Y)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

Creating a session

In [23]:
sess = tf.InteractiveSession()

Intitializing Variables

In [24]:
init = tf.initialize_all_variables()
sess.run(init)

Training

In [25]:
for epoch in range(2000):
  for i in xrange(8):
    x_batch = x_train[i * 100: (i + 1) * 100]
    y_batch = t_train[i * 100: (i + 1) * 100]
    cost_np, _ = sess.run([cost, train_op],
                          feed_dict={X: x_batch, Y: y_batch})
    #Display logs per epoch step
  if epoch % 50 == 0:
    cost_train, accuracy_train = sess.run([cost, accuracy],
                                          feed_dict={X: x_train, Y: t_train})
    cost_eval, accuracy_eval, norm_w_np = sess.run([cost, accuracy, norm_w],
                                                   feed_dict={X: x_eval, Y: t_eval})    
    print ("Epoch:%04d, cost=%0.9f, Train Accuracy=%0.4f, Eval Accuracy=%0.4f,    Norm of Weights=%0.4f" %
           (epoch+1, cost_train, accuracy_train, accuracy_eval, norm_w_np))
Epoch:0001, cost=1.221206784, Train Accuracy=0.6900, Eval Accuracy=0.6650,    Norm of Weights=78.0369
Epoch:0051, cost=0.108576626, Train Accuracy=0.9925, Eval Accuracy=0.9525,    Norm of Weights=21902.0879
Epoch:0101, cost=0.039827708, Train Accuracy=0.9962, Eval Accuracy=0.9400,    Norm of Weights=21526.6953
Epoch:0151, cost=0.016990982, Train Accuracy=0.9975, Eval Accuracy=0.9400,    Norm of Weights=21287.8965
Epoch:0201, cost=0.008619667, Train Accuracy=0.9987, Eval Accuracy=0.9375,    Norm of Weights=21630.7266
Epoch:0251, cost=0.006221656, Train Accuracy=1.0000, Eval Accuracy=0.9375,    Norm of Weights=22097.6465
Epoch:0301, cost=0.005005265, Train Accuracy=1.0000, Eval Accuracy=0.9375,    Norm of Weights=22532.3867
Epoch:0351, cost=0.004219066, Train Accuracy=1.0000, Eval Accuracy=0.9375,    Norm of Weights=22931.1543
Epoch:0401, cost=0.003665265, Train Accuracy=1.0000, Eval Accuracy=0.9375,    Norm of Weights=23299.3828
Epoch:0451, cost=0.003253295, Train Accuracy=1.0000, Eval Accuracy=0.9350,    Norm of Weights=23642.1973
Epoch:0501, cost=0.002933940, Train Accuracy=1.0000, Eval Accuracy=0.9325,    Norm of Weights=23963.8164
Epoch:0551, cost=0.002679895, Train Accuracy=1.0000, Eval Accuracy=0.9350,    Norm of Weights=24267.5586
Epoch:0601, cost=0.002475345, Train Accuracy=1.0000, Eval Accuracy=0.9350,    Norm of Weights=24556.2324
Epoch:0651, cost=0.002309947, Train Accuracy=1.0000, Eval Accuracy=0.9350,    Norm of Weights=24832.2344
Epoch:0701, cost=0.002175402, Train Accuracy=1.0000, Eval Accuracy=0.9350,    Norm of Weights=25097.5879
Epoch:0751, cost=0.002064377, Train Accuracy=1.0000, Eval Accuracy=0.9350,    Norm of Weights=25353.9160
Epoch:0801, cost=0.001970768, Train Accuracy=1.0000, Eval Accuracy=0.9350,    Norm of Weights=25602.5332
Epoch:0851, cost=0.001890046, Train Accuracy=1.0000, Eval Accuracy=0.9350,    Norm of Weights=25844.4199
Epoch:0901, cost=0.001819121, Train Accuracy=1.0000, Eval Accuracy=0.9350,    Norm of Weights=26080.3594
Epoch:0951, cost=0.001755936, Train Accuracy=1.0000, Eval Accuracy=0.9350,    Norm of Weights=26310.9219
Epoch:1001, cost=0.001699073, Train Accuracy=1.0000, Eval Accuracy=0.9325,    Norm of Weights=26536.5977
Epoch:1051, cost=0.001647520, Train Accuracy=1.0000, Eval Accuracy=0.9325,    Norm of Weights=26757.8105
Epoch:1101, cost=0.001600507, Train Accuracy=1.0000, Eval Accuracy=0.9325,    Norm of Weights=26974.9258
Epoch:1151, cost=0.001557441, Train Accuracy=1.0000, Eval Accuracy=0.9325,    Norm of Weights=27188.2422
Epoch:1201, cost=0.001517841, Train Accuracy=1.0000, Eval Accuracy=0.9325,    Norm of Weights=27398.0293
Epoch:1251, cost=0.001481280, Train Accuracy=1.0000, Eval Accuracy=0.9325,    Norm of Weights=27604.5820
Epoch:1301, cost=0.001447390, Train Accuracy=1.0000, Eval Accuracy=0.9325,    Norm of Weights=27808.0020
Epoch:1351, cost=0.001415835, Train Accuracy=1.0000, Eval Accuracy=0.9325,    Norm of Weights=28008.5566
Epoch:1401, cost=0.001386318, Train Accuracy=1.0000, Eval Accuracy=0.9325,    Norm of Weights=28206.3594
Epoch:1451, cost=0.001358589, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=28401.5312
Epoch:1501, cost=0.001332433, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=28594.2266
Epoch:1551, cost=0.001307678, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=28784.4531
Epoch:1601, cost=0.001284176, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=28972.3613
Epoch:1651, cost=0.001261803, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=29158.0625
Epoch:1701, cost=0.001240452, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=29341.5488
Epoch:1751, cost=0.001220033, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=29522.9199
Epoch:1801, cost=0.001200472, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=29702.2598
Epoch:1851, cost=0.001181695, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=29879.6602
Epoch:1901, cost=0.001163654, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=30055.0801
Epoch:1951, cost=0.001146289, Train Accuracy=1.0000, Eval Accuracy=0.9300,    Norm of Weights=30228.6738

$L_2$ Regularization

As you can see when the data is linearly separable, the norm of W goes to infinity! (Can you explain why?)

Add L2 regularization to the above code so as to prevent this from happening (only one line of code! Thanks to awesome TensorFlow!)

Non-Linear Feature Space