Tensorflow

11 - Lesson

Batch Processing¶

In this lesson you are going to learn how to train your NN in batches.

What do you mean batches?

So far when we have been training our model, we have been feeding it all of our training data on every iteration. Sometimes it may make sense to feed the model small batches of 10 or 100 samples at a time. This will allow our model to update its weights more often and possibly give you better results. In addition you may also end up in a situation that you just do not have enough memory to feed the entire training data. By training in batches, this problem would be solved.

Let's Get to Work!¶

In [1]:

# import libraries
import tensorflow as tf
import pandas as pd
import numpy as np
import sys
import datetime
import matplotlib.pyplot as plt
plt.style.use('ggplot') # use this plot style
%matplotlib inline

In [2]:

print('Python version ' + sys.version)
print('Tensorflow version ' + tf.VERSION)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)

Python version 3.5.1 |Anaconda custom (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Tensorflow version 0.12.0-rc0
Pandas version 0.19.0
Numpy version 1.11.0

Function to model¶

y = a * x^4 + b

TIP: Recommended percentages

Training - *70%*
Validation - *15%*
Test - *15%*

In [3]:

# Let's generate 1000 random samples
pool = np.random.rand(1000,1).astype(np.float32)

# Shuffle the samples
np.random.shuffle(pool)

# sample size of 15%
sample = int(1000 * 0.15)

# 15% test
test_x = pool[0:sample]

# 15% validation
valid_x = pool[sample:sample*2]

# 70% training
train_x = pool[sample*2:]

print('Testing data points: ' + str(test_x.shape))
print('Validation data points: ' + str(valid_x.shape))
print('Training data points: ' + str(train_x.shape))

# Let's compute the ouput using 2 for a, 5 for b
test_y = 2.0 * test_x**4 + 5
valid_y = 2.0 * valid_x**4 + 5
train_y = 2.0 * train_x**4 + 5

Testing data points: (150, 1)
Validation data points: (150, 1)
Training data points: (700, 1)

In [4]:

df = pd.DataFrame({'x':train_x[:,0],
                   'y':train_y[:,0]})
df.head()

Out[4]:

	x	y
0	0.072982	5.000057
1	0.627874	5.310827
2	0.751243	5.637018
3	0.291485	5.014438
4	0.559812	5.196426

In [5]:

df.describe()

Out[5]:

	x	y
count	700.000000	700.000000
mean	0.475430	5.353024
std	0.286284	0.491342
min	0.000471	5.000000
25%	0.228471	5.005450
50%	0.482200	5.108128
75%	0.718817	5.533954
max	0.999141	6.993135

In [6]:

df.plot.scatter(x='x', y='y', figsize=(15,5));

Helper Functions¶

Make a function that will help you create layers easily

In [7]:

def add_layer(inputs, in_size, out_size, activation_function=None):
    
    # tf.random_normal([what is the size of your batches, size of output layer])
    Weights = tf.Variable(tf.truncated_normal([in_size, out_size], mean=0.1, stddev=0.1))
    
    # tf.random_normal([size of output layer])
    biases = tf.Variable(tf.truncated_normal([out_size], mean=0.1, stddev=0.1))
    
    # shape of pred = [size of your batches, size of output layer]
    pred = tf.matmul(inputs, Weights) + biases

    if activation_function is None:
        outputs = pred
    else:
        outputs = activation_function(pred)
    return outputs

Model your Graph¶

Start to use W (for weight) and b (for bias) when setting up your variables. Aside from adding your ReLU activation function, it is a good idea to use Tensorflow's *matrix multiplication function (matmul)* as shown below.

The ? in the shape output just means it can be of any shape.

Pick Your Batch Size¶

In [8]:

# larger batch sizes help you get to the local minimum faster at a cost of more cpu power
# The strategy is to use batch_size when you cannot fit the entire dataset into memory
# In practice, small to moderate mini-batches (10-500) are generally used
batch_size = 10

In [9]:

# you can adjust the number of neurons in the hidden layers here
hidden_size = 10

# placeholders
# shape=[how many samples do you have, how many input neurons]
x = tf.placeholder(tf.float32, shape=[None, 1], name="01_x")
y = tf.placeholder(tf.float32, shape=[None, 1], name="01_y")

print("shape of x and y:")
print(x.get_shape(),y.get_shape())

shape of x and y:
(?, 1) (?, 1)

Add Your Drop Out Placeholder¶

We will be feeding in the percentage of neurons to keep on every epoch

In [10]:

# drop out
keep_prob = tf.placeholder(tf.float32)

Note that the input of one layer becomes the input of the next layer.

In [11]:

# create your hidden layers!
h1 = add_layer(x, 1, hidden_size, tf.nn.relu)

# here is where we shoot down some of the neurons
h1_drop = tf.nn.dropout(h1, keep_prob)

# add a second layer
h2 = add_layer(h1_drop, hidden_size, hidden_size, tf.nn.relu)
h2_drop = tf.nn.dropout(h2, keep_prob)

# add a third layer
h3 = add_layer(h2_drop, hidden_size, hidden_size, tf.nn.relu)
h3_drop = tf.nn.dropout(h3, keep_prob)

# add a fourth layer
h4 = add_layer(h3_drop, hidden_size, hidden_size, tf.nn.relu)
h4_drop = tf.nn.dropout(h4, keep_prob)

print("shape of hidden layers:")
print(h1_drop.get_shape(), h2_drop.get_shape(), h3_drop.get_shape(), h4_drop.get_shape())

shape of hidden layers:
(?, 10) (?, 10) (?, 10) (?, 10)

In [12]:

# Output Layers
pred = add_layer(h4_drop, hidden_size, 1)

print("shape of output layer:")
print(pred.get_shape())

shape of output layer:
(?, 1)

In [37]:

# minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(pred - y))

# pick optimizer
optimizer = tf.train.GradientDescentOptimizer(0.0099)
train = optimizer.minimize(loss)

How Good is Your model?¶

Set up the following variables to calculate the accuracy rate of your model. You will do that shortly.

In [38]:

# check accuracy of model
correct_prediction = tf.equal(tf.round(pred), tf.round(y))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Set Up Your Early Stoppage Variables¶

Code borrowed from this great Tensorflow Jupyter Notebook.

In [39]:

# Best validation accuracy seen so far.
best_valid_acc = 0.0

# Iteration-number for last improvement to validation accuracy.
last_improvement = 0

# Stop optimization if no improvement found in this many iterations.
require_improvement = 1500

Training Time!¶

Variable *i* will pull a random sample of n (size of your batches) on every training iteration. Take a look how the variable *train_data* was modified.

In [40]:

# initialize the variables
init = tf.global_variables_initializer()

# hold step and error values
t = []

# Run your graph
with tf.Session() as sess:
    
    # initialize variables
    sess.run(init)

    # Fit the function.
    for step in range(6000):
        
        # pull batches at random
        i = np.random.permutation(train_x.shape[0])[:batch_size]

        # get your data
        train_data = {x:train_x[i,:], y:train_y[i,:], keep_prob: 0.975}
        valid_data = {x:valid_x, y:valid_y, keep_prob: 1.0}
        test_data = {x:test_x, y:test_y, keep_prob: 1.0}
        
        # training in progress...
        train_loss, train_pred = sess.run([loss, train], feed_dict=train_data)        
        
        # print every n iterations
        if step%100==0:
           
            # capture the step and error for analysis
            valid_loss = sess.run(loss, feed_dict=valid_data) 
            t.append((step, train_loss, valid_loss))    
            
            # get snapshot of current training and validation accuracy       
            train_acc = accuracy.eval(train_data)
            valid_acc = accuracy.eval(valid_data)           

            # If validation accuracy is an improvement over best-known.
            if valid_acc > best_valid_acc:
                # Update the best-known validation accuracy.
                best_valid_acc = valid_acc
                
                # Set the iteration for the last improvement to current.
                last_improvement = step

                # Flag when ever an improvement is found
                improved_str = '*'
            else:
                # An empty string to be printed below.
                # Shows that no improvement was found.
                improved_str = ''   
                
            print("Training loss at step %d: %f %s" % (step, train_loss, improved_str))        
            print("Validation %f" % (valid_loss))            
                
            # If no improvement found in the required number of iterations.
            if step - last_improvement > require_improvement:
                print("No improvement found in a while, stopping optimization.")

                # Break out from the for-loop.
                break                
            
            
    # here is where you see how good of a Data Scientist you are        
    print("Accuracy on the Training Set:", accuracy.eval(train_data) )
    print("Accuracy on the Validation Set:", accuracy.eval(valid_data) ) 
    print("Accuracy on the Test Set:", accuracy.eval(test_data) )
    
    # capture predictions on test data 
    test_results = sess.run(pred, feed_dict={x:test_x, keep_prob: 1.0})  
    df_final = pd.DataFrame({'test_x':test_x[:,0],
                             'pred':test_results[:,0]})
    
    # capture training and validation loss
    df_loss = pd.DataFrame(t, columns=['step', 'train_loss', 'valid_loss'])

Training loss at step 0: 23.402122 
Validation 20.980297
Training loss at step 100: 0.187115 *
Validation 0.143108
Training loss at step 200: 0.250679 
Validation 0.077954
Training loss at step 300: 0.152195 *
Validation 0.087788
Training loss at step 400: 0.225003 *
Validation 0.100714
Training loss at step 500: 0.161775 
Validation 0.093067
Training loss at step 600: 0.254302 
Validation 0.102409
Training loss at step 700: 0.103872 *
Validation 0.109399
Training loss at step 800: 0.274327 
Validation 0.077394
Training loss at step 900: 0.148338 
Validation 0.103547
Training loss at step 1000: 0.048606 
Validation 0.067885
Training loss at step 1100: 0.096235 
Validation 0.063774
Training loss at step 1200: 0.072514 
Validation 0.064017
Training loss at step 1300: 0.211790 
Validation 0.053174
Training loss at step 1400: 0.091291 
Validation 0.044657
Training loss at step 1500: 0.081252 
Validation 0.037878
Training loss at step 1600: 0.190677 
Validation 0.030551
Training loss at step 1700: 0.025660 
Validation 0.031784
Training loss at step 1800: 0.077246 
Validation 0.021971
Training loss at step 1900: 0.196296 
Validation 0.018666
Training loss at step 2000: 0.056428 
Validation 0.020353
Training loss at step 2100: 0.043981 *
Validation 0.018860
Training loss at step 2200: 0.043882 
Validation 0.015705
Training loss at step 2300: 0.033422 
Validation 0.010563
Training loss at step 2400: 0.018767 *
Validation 0.009835
Training loss at step 2500: 0.040599 *
Validation 0.006505
Training loss at step 2600: 0.053013 
Validation 0.005265
Training loss at step 2700: 0.040391 
Validation 0.004556
Training loss at step 2800: 0.091411 *
Validation 0.005383
Training loss at step 2900: 0.104919 *
Validation 0.004556
Training loss at step 3000: 0.031472 
Validation 0.006872
Training loss at step 3100: 0.019333 
Validation 0.004266
Training loss at step 3200: 0.098784 
Validation 0.009901
Training loss at step 3300: 0.113221 
Validation 0.005738
Training loss at step 3400: 0.050050 
Validation 0.003017
Training loss at step 3500: 0.063836 
Validation 0.002980
Training loss at step 3600: 0.037302 
Validation 0.005684
Training loss at step 3700: 0.027242 
Validation 0.003440
Training loss at step 3800: 0.045567 
Validation 0.001505
Training loss at step 3900: 0.030083 
Validation 0.001552
Training loss at step 4000: 0.019632 *
Validation 0.001199
Training loss at step 4100: 0.015194 
Validation 0.001870
Training loss at step 4200: 0.047915 
Validation 0.002910
Training loss at step 4300: 0.021621 
Validation 0.001246
Training loss at step 4400: 0.033902 
Validation 0.003228
Training loss at step 4500: 0.014637 
Validation 0.004395
Training loss at step 4600: 0.053357 
Validation 0.001634
Training loss at step 4700: 0.039563 
Validation 0.001245
Training loss at step 4800: 0.317875 
Validation 0.007926
Training loss at step 4900: 0.060057 
Validation 0.001299
Training loss at step 5000: 0.008178 
Validation 0.004327
Training loss at step 5100: 0.038080 
Validation 0.001113
Training loss at step 5200: 0.028573 
Validation 0.002215
Training loss at step 5300: 0.019867 
Validation 0.000795
Training loss at step 5400: 0.021241 
Validation 0.005972
Training loss at step 5500: 0.010574 
Validation 0.003042
Training loss at step 5600: 0.020798 
Validation 0.004027
No improvement found in a while, stopping optimization.
Accuracy on the Training Set: 0.9
Accuracy on the Validation Set: 0.953333
Accuracy on the Test Set: 0.98

In [41]:

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(15, 5))

# Chart 1 - Shows the line we are trying to model
df.plot.scatter(x='x', y='y', ax=axes, color='red')

# Chart 2 - Shows the line our trained model came up with
df_final.plot.scatter(x='test_x', y='pred', ax=axes, alpha=0.3)

# add a little sugar
axes.set_title('target vs pred', fontsize=20)
axes.set_ylabel('y', fontsize=15)
axes.set_xlabel('x', fontsize=15)
axes.legend(["target", "pred"], loc='best');

Check for Overfitting¶

If the *valid_loss* is increasing and your *train_loss* is decreasing then you have a problem. Since you have implemented early stopping, your model will not over train and prevents this issue from getting out of control.

In [42]:

df_loss.set_index('step').plot(logy=True, figsize=(15,5));

Your Turn¶

Experiment with the batch size and the size of each layer.

This tutorial was created by HEDARO