Eager Execution and Keras

The architecture of TensorFlow is designed to make it easy to fit machine learning models across a variety of architectures, automatically optimizing how computations are allocated to the available resources.

The TensorFlow user is responsible for specifying their model in the Python client layer. When this is done well, it results in a very performant model. However, as we have seen constructing a static model is not particularly intuitive (and certainly not Pythonic). In order to help alleviate this, some changes to the API have been implemented in newer versions of TensorFlow. Most important among them are:

  1. Eager execution model
  2. Integration of the Keras library

We will introduce both of these here.

Eager Execution

TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, without building graphs. Operations return concrete values instead of constructing a computational graph to run later. This better aligns users’ expectations about the programming model better with TensorFlow, making it easier to learn and apply.

Eager execution is a flexible machine learning platform for research and experimentation, providing:

  • An intuitive interface—Structure your code naturally and use Python data structures. Quickly iterate on small models and small data.
  • Easier debugging—Call ops directly to inspect running models and test changes. Use standard Python debugging tools for immediate error reporting.
  • Natural control flow—Use Python control flow instead of graph control flow, simplifying the specification of dynamic models.

The tradeoff inherent in eager execution is that models run with increased overhead, typically resulting in slower performance (though this is continually being improved).

In an interactive computing environment like Jupyter, eager execution must be specified before TensorFlow is used. This is done by calling tf.enable_eager_execution() near the top of the notebook.

In [ ]:
import tensorflow as tf


Enabling eager execution changes how TensorFlow operations behave—now they immediately evaluate and return their values to Python. tf.Tensor objects reference concrete values instead of symbolic handles to nodes in a computational graph. Since there isn't a computational graph to build and run later in a session, it's easy to inspect results using print() or a debugger. Evaluating, printing, and checking tensor values does not break the flow for computing gradients.

In [ ]:
import numpy as np

x = np.random.random((2,2))
m = tf.matmul(x, x)

print("matrix multiplication result:, {}".format(m))
In [ ]:
a = tf.constant([[1, 2],
                 [3, 4]])

# Broadcasting support
b = tf.add(a, 1)
In [ ]:
# Operator overloading is supported
print(a * b)

Eager execution works nicely with NumPy. NumPy operations accept tf.Tensor arguments. TensorFlow math operations convert Python objects and NumPy arrays to tf.Tensor objects. The tf.Tensor.numpy method returns the object's value as a NumPy ndarray.

In [ ]:

The tf.contrib.eager module contains symbols available to both eager and graph execution environments and is useful for writing code to work with graphs:

In [ ]:
tfe = tf.contrib.eager

Dynamic control flow

A major benefit of eager execution is that all the functionality of the host language is available while your model is executing. For example, we can use Python's control flow statements like for loops or conditionals:

In [ ]:
def fibonacci(n):
    n = tf.convert_to_tensor(n)
    if n < 2:
        return n
    a, b = tf.constant(0), tf.constant(1)
    for i in range(n.numpy()+1):
        a, b = b, a + b
    return b
In [ ]:

Eager training

Computing gradients

Automatic differentiation is useful for implementing machine learning algorithms such as backpropagation for training neural networks. During eager execution, use tf.GradientTape to trace operations for computing gradients later.

tf.GradientTape is an opt-in feature to provide maximal performance when not tracing. Since different operations can occur during each call, all forward-pass operations get recorded to a "tape". To compute the gradient, play the tape backwards and then discard. A particular tf.GradientTape can only compute one gradient; subsequent calls throw a runtime error.

In [ ]:
w = tf.Variable([[1.0]])

with tf.GradientTape() as tape:
    loss = w * w

tape.gradient(loss, w)

Fitting Neural Networks with Keras

In much the same way that PyMC3 allows Bayesian models to be specified in Theano at a high level, Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow. Keras is a modular, extensible library that allows for easy construction of deep learning models. It includes classes for building either convolutional networks and recurrent networks, and supports CPU and GPU computation.

Keras is used for fast prototyping, advanced research, and production, with three key advantages:

  1. User friendly: Keras has a simple, consistent interface optimized for common use cases. It provides clear and actionable feedback for user errors.
  2. Modular and composable: Keras models are made by connecting configurable building blocks together, with few restrictions.
  3. Easy to extend: Write custom building blocks to express new ideas for research. Create new layers, loss functions, and develop state-of-the-art models.

Keras was recently integrated into the TensorFlow project, so it does not have to be separately downloaded, but is available as a sub-module.

In [ ]:
%matplotlib inline
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation

To learn how deep neural networks are constructed in Keras, we will use a famous benchmarking dataset, MNIST. The MNIST database of handwritten digits, which includes a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST.

The original black and white images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field. This results in a vector of 784 values for each image.

In [ ]:
from tensorflow.keras.datasets import mnist

# Fetch and format the mnist data
(mnist_images, mnist_labels), _ = mnist.load_data()
In [ ]:
plt.imshow(mnist_images[1].reshape(28,28), cmap='gray');

We can convert the raw data for use in Keras.

The Dataset API

A more elegant way to feed data into your model (compared to feeding NumPy arrays into a session) is to set up an input pipeline, using the TensorFlow Dataset API. A Dataset can be used to represent an input pipeline as a nested structures of tensors and an associated set of transformations that act on those tensors.

This is what it looks like. Consider some arbitrary input data, in the form of a NumPy array:

In [ ]:
fake_data = np.random.normal((100, 5))

The from_tensor_slices function creates a Dataset whose elements are slices of the given tensors:

In [ ]:
a_dataset = tf.data.Dataset.from_tensor_slices(fake_data)

The make_one_shot_iterator creates an Iterator for enumerating the elements of this dataset. As we will see, this facilitates mini-batch processing.

In [ ]:
data = a_dataset.make_one_shot_iterator().get_next()

In the case of our MNIST image data, we first flatten the image data, convert it to floats, and scale before feeding it into a Dataset.

In [ ]:
dataset = tf.data.Dataset.from_tensor_slices(
  (tf.cast(mnist_images.reshape(mnist_images.shape[0], 784)/255, tf.float32),

Finally, the dataset is shuffled and configured to batch-update.

In [ ]:
dataset = dataset.shuffle(1000).batch(32)

Building the model

The simplest model class in Keras is the Sequential class object. It allows networks to be constructed layer by layer, beginning with the data input and terminating with an output layer. Only the input layer requires explicit dimensions to be passed (via the keyword argument input_dim); the rest are inferred based on the size of the layer.

Between layers, we also define an activation function for the outputs from the previous layer.

Here is a simple network with two hidden layers. The output layer will be of size 10, corresponding the the number of classes in the dataset.

In [ ]:
mnist_model = Sequential()
mnist_model.add(Dense(512, input_shape=(784,)))


Train a model

The following example creates a multi-layer model that classifies the standard MNIST handwritten digits. It demonstrates the optimizer and layer APIs to build trainable graphs in an eager execution environment.

Activations can either be used through an Activation layer, as we have done here, or through the activation argument supported by all forward layers:


This is equivalent to:

model.add(Dense(64, activation='tanh'))

Thus, a more concise way of specifying the same model is:

In [ ]:
mnist_model = Sequential([
    Dense(512, activation='relu', input_shape=(784,)),
    Dense(512, activation='relu'),
    Dense(10, activation='softmax')


For the hidden layers, we have used a rectified linear unit (RELU). This is the simple function:

$$f(x) = \max(0, x)$$

This activation has beens shown to perform well in the training of deep neural networks for supervised learning. It is a sparse activation, and has efficient gradient propagation.

We use the softmax activation for the output layer because, like the logistic, it transforms inputs to the unit interval.

Configure the layers

There are many tf.keras.layers available with some common constructor parameters:

  • activation: Set the activation function for the layer. This parameter is specified by the name of a built-in function or as a callable object. By default, no activation is applied.
  • kernel_initializer and bias_initializer: The initialization schemes that create the layer's weights (kernel and bias). This parameter is a name or a callable object. This defaults to the "Glorot uniform" initializer.
  • kernel_regularizer and bias_regularizer: The regularization schemes that apply the layer's weights (kernel and bias), such as L1 or L2 regularization. By default, no regularization is applied.

The following instantiates tf.keras.layers.Dense layers using constructor arguments:

In [ ]:
from tensorflow.keras import regularizers, initializers

# Create a sigmoid layer:
Dense(64, activation='sigmoid')
# Or:
Dense(64, activation=tf.sigmoid)

# A linear layer with L1 regularization of factor 0.01 applied to the kernel matrix:
Dense(64, kernel_regularizer=regularizers.l1(0.01))
# A linear layer with L2 regularization of factor 0.01 applied to the bias vector:
Dense(64, bias_regularizer=regularizers.l2(0.01))

# A linear layer with a kernel initialized to a random orthogonal matrix:
Dense(64, kernel_initializer='orthogonal')
# A linear layer with a bias vector initialized to 2.0s:
Dense(64, bias_initializer=initializers.constant(2.0))

Fitting the model

Fitting the model first requires a compilation step, for which we specify three arguments:

  • an optimizer. This could be the string identifier of an existing optimizer (such as rmsprop or adagrad), or an instance of the Optimizer class. See: optimizers.
  • a loss function. This is the objective that the model will try to minimize. It can be the string identifier of an existing loss function (such as categorical_crossentropy or mse), or it can be an objective function. See: loss functions.
  • a list of metrics. For any classification problem you will want to set this to metrics=['accuracy']. A metric could be the string identifier of an existing metric (only accuracy is supported at this point), or a custom metric function. See: metrics.

Here, we will use the sparse_softmax_cross_entropy loss function, which computes sparse softmax cross entropy between logits and labels by measuring the probability error in discrete classification tasks in which the classes are mutually exclusive.

Even without training, call the model and inspect the output in eager execution:

In [ ]:
for images,labels in dataset.take(1):
  print("Logits: ", mnist_model(images[0:1]).numpy())

While keras models have a builtin training loop (using the fit method), sometimes you need more customization. Here's an example, of a training loop implemented with eager:

In [ ]:
optimizer = tf.train.AdamOptimizer()

loss_history = []
In [ ]:
for (batch, (images, labels)) in enumerate(dataset.take(400)):
    if batch % 80 == 0:
    print('.', end='')
    with tf.GradientTape() as tape:
        logits = mnist_model(images, training=True)
        loss_value = tf.losses.sparse_softmax_cross_entropy(labels, logits)

    grads = tape.gradient(loss_value, mnist_model.variables)
    optimizer.apply_gradients(zip(grads, mnist_model.variables),
In [ ]:
import matplotlib.pyplot as plt

plt.xlabel('Batch #')
plt.ylabel('Loss [entropy]')

Build a model

Many machine learning models are represented by composing layers. When using TensorFlow with eager execution you can either write your own layers or use a layer provided in the tf.keras.layers package.

As we have seen, when composing layers into models you can use tf.keras.Sequential to represent models which are a linear stack of layers. It is easy to use for basic models:

In [ ]:
model = Sequential([
    Dense(512, activation='relu', input_shape=(784,)),
    Dense(512, activation='relu'),
    Dense(10, activation='softmax')

Alternatively, organize models in classes by inheriting from tf.keras.Model. This is a container for layers that is a layer itself, allowing tf.keras.Model objects to contain other tf.keras.Model objects.

In [ ]:
class MNISTModel(tf.keras.Model):
    def __init__(self):
        super(MNISTModel, self).__init__()
        self.dense1 = tf.keras.layers.Dense(units=512, activation='relu')
        self.dense2 = tf.keras.layers.Dense(units=512, activation='relu')
        self.dense_out = tf.keras.layers.Dense(units=10, activation='softmax')

    def call(self, input):
        """Run the model."""
        result = self.dense1(input)
        result = self.dense2(result)
        result = self.dense_out(result)
        return result

model = MNISTModel()

It's not required to set an input shape for the tf.keras.Model class since the parameters are set the first time input is passed to the layer.

tf.keras.layers classes create and contain their own model variables that are tied to the lifetime of their layer objects. To share layer variables, share their objects.

In [ ]:
optimizer = tf.train.AdamOptimizer()

loss_history = []

for (batch, (images, labels)) in enumerate(dataset.take(400)):
    if batch % 80 == 0:
    print('.', end='')
    with tf.GradientTape() as tape:
        logits = model(images)
        loss_value = tf.losses.sparse_softmax_cross_entropy(labels, logits)

    grads = tape.gradient(loss_value, model.variables)
    optimizer.apply_gradients(zip(grads, model.variables),

Since tf.gradients does not work under eager execution, the tf.GradientTape class, which records operations within in its context manager, and constructs a computation graph from them. They are then applied with apply_gradients for backpropagation.

In [ ]:
plt.xlabel('Batch #')
plt.ylabel('Loss [entropy]')

Example: Iris classification problem

Recall the iris morphometric dataset, which includes measurements from three species:

  • Iris setosa
  • Iris virginica
  • Iris versicolor
Petal geometry compared for three iris species: Iris setosa, Iris virginica, and Iris versicolor
Figure 1. Iris setosa (by Radomil, CC BY-SA 3.0), Iris versicolor, (by Dlanglois, CC BY-SA 3.0), and Iris virginica (by Frank Mayfield, CC BY-SA 2.0).

Let's create a custom neural network classifier using Keras in eager execution mode.

Import and parse the training dataset

Download the dataset file using the tf.keras.utils.get_file function. This returns the file path of the downloaded file.

In [ ]:
train_dataset_url = "http://download.tensorflow.org/data/iris_training.csv"

train_dataset_fp = tf.keras.utils.get_file(fname=os.path.basename(train_dataset_url),

print("Local copy of the dataset file: {}".format(train_dataset_fp))
In [ ]:
In [ ]:
# column order in CSV file
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

feature_names = column_names[:-1]
label_name = column_names[-1]

print("Features: {}".format(feature_names))
print("Label: {}".format(label_name))
In [ ]:
class_names = ['Iris setosa', 'Iris versicolor', 'Iris virginica']

Create a Dataset

Since the dataset is a CSV-formatted text file, we can use the make_csv_dataset function to parse the data into a Dataset. Since this function generates data for training models, the default behavior is to shuffle the data (shuffle=True, shuffle_buffer_size=10000), and repeat the dataset forever (num_epochs=None). We can also set the batch_size parameter.

In [ ]:
batch_size = 32

train_dataset = tf.data.experimental.make_csv_dataset(

The make_csv_dataset function returns a tf.data.Dataset of (features, label) pairs, where features is a dictionary: {'feature_name': value}

With eager execution enabled, these Dataset objects are iterable. Let's look at a batch of features:

In [ ]:
features, labels = next(iter(train_dataset))


To simplify the model building step, create a function to repackage the features dictionary into a single array with shape: (batch_size, num_features).

This function uses the tf.stack function which takes values from a list of tensors and creates a combined tensor at the specified dimension.

In [ ]:
def pack_features_vector(features, labels):
    """Pack the features into a single array."""
    features = tf.stack(list(features.values()), axis=1)
    return features, labels

Then use the Dataset.map method to pack the features of each (features,label) pair into the training dataset:

In [ ]:
train_dataset = train_dataset.map(pack_features_vector)

The features element of the Dataset are now arrays with shape (batch_size, num_features). Let's look at the first few examples:

In [ ]:
features, labels = next(iter(train_dataset))


Create a model using Keras

We will construct a simple network of two Dense layers with 10 nodes each, and an output layer with 3 nodes representing our label predictions. The first layer's input_shape parameter corresponds to the number of features from the dataset, and is required.

In [ ]:
model = tf.keras.Sequential([
  tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(4,)),  # input shape required
  tf.keras.layers.Dense(10, activation=tf.nn.relu),

Using the model

Let's have a quick look at what this model does to a batch of features:

In [ ]:
predictions = model(features)

Here, each example returns a logit for each class.

The softmax function transforms these logits to a probability for each class.

In [ ]:

Taking the tf.argmax across classes gives us the predicted class index. But, the model hasn't been trained yet, so these aren't good predictions.

In [ ]:
print("Prediction: {}".format(tf.argmax(predictions, axis=1)))
print("    Labels: {}".format(labels))

Define the loss and gradient function

We will use the categorical_crossentropy loss function, which takes the model's class probability predictions and the desired label, and returns the average loss across the examples.

In [ ]:
def loss(model, x, y):
    y_ = model(x)
    return tf.losses.sparse_softmax_cross_entropy(labels=y, logits=y_)

l = loss(model, features, labels)
print("Loss test: {}".format(l))

Since we are operating in eager mode, we will create a GradientTape context to apply the loss function:

In [ ]:
def grad(model, inputs, targets):
    with tf.GradientTape() as tape:
        loss_value = loss(model, inputs, targets)
    return loss_value, tape.gradient(loss_value, model.trainable_variables)

Create an optimizer

We will use the GradientDescentOptimizer that implements the stochastic gradient descent (SGD) algorithm. The learning_rate sets the step size to take for each iteration down the hill.

Let's setup the optimizer and the global_step counter:

In [ ]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)

global_step = tf.train.get_or_create_global_step()

We'll use this to calculate a single optimization step:

In [ ]:
loss_value, grads = grad(model, features, labels)

print("Step: {}, Initial Loss: {}".format(global_step.numpy(),

optimizer.apply_gradients(zip(grads, model.variables), global_step)

print("Step: {},         Loss: {}".format(global_step.numpy(),
                                          loss(model, features, labels).numpy()))

Training loop

With all the pieces in place, the model is ready for training! A training loop feeds the dataset examples into the model to help it make better predictions. The following code block sets up these training steps:

  1. Iterate each epoch. An epoch is one pass through the dataset.
  2. Within an epoch, iterate over each example in the training Dataset grabbing its features (x) and label (y).
  3. Using the example's features, make a prediction and compare it with the label. Measure the inaccuracy of the prediction and use that to calculate the model's loss and gradients.
  4. Use an optimizer to update the model's variables.
  5. Keep track of some stats for visualization.
  6. Repeat for each epoch.

The num_epochs variable is the number of times to loop over the dataset collection. Counter-intuitively, training a model longer does not guarantee a better model. Choosing the right number usually requires both experience and experimentation.

In [ ]:
# keep results for plotting
train_loss_results = []
train_accuracy_results = []

num_epochs = 201

for epoch in range(num_epochs):
  epoch_loss_avg = tfe.metrics.Mean()
  epoch_accuracy = tfe.metrics.Accuracy()

  # Training loop - using batches of 32
  for x, y in train_dataset:
    # Optimize the model
    loss_value, grads = grad(model, x, y)
    optimizer.apply_gradients(zip(grads, model.variables),

    # Track progress
    epoch_loss_avg(loss_value)  # add current batch loss
    # compare predicted label to actual label
    epoch_accuracy(tf.argmax(model(x), axis=1, output_type=tf.int32), y)

  # end epoch
  if epoch % 50 == 0:
    print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,

Visualize the loss function over time

While it's helpful to print out the model's training progress, it's often more helpful to see this progress. We want to ensure that the loss go down and the accuracy go up.

In [ ]:
fig, axes = plt.subplots(2, sharex=True, figsize=(12, 8))
fig.suptitle('Training Metrics')

axes[0].set_ylabel("Loss", fontsize=14)

axes[1].set_ylabel("Accuracy", fontsize=14)
axes[1].set_xlabel("Epoch", fontsize=14)

Model testing

The setup for the test Dataset is similar to the setup for training Dataset. Download the CSV text file and parse that values, then give it a little shuffle:

In [ ]:
test_url = "http://download.tensorflow.org/data/iris_test.csv"

test_fp = tf.keras.utils.get_file(fname=os.path.basename(test_url),
In [ ]:
test_dataset = tf.contrib.data.make_csv_dataset(

test_dataset = test_dataset.map(pack_features_vector)

Evaluate the model on the test dataset

Unlike the training stage, the model only evaluates a single epoch of the test data. In the following code cell, we iterate over each example in the test set and compare the model's prediction against the actual label.

In [ ]:
test_accuracy = tfe.metrics.Accuracy()

for (x, y) in test_dataset:
  logits = model(x)
  prediction = tf.argmax(logits, axis=1, output_type=tf.int32)
  test_accuracy(prediction, y)

print("Test set accuracy: {:.3%}".format(test_accuracy.result()))

We can see on the last batch, for example, the model is usually correct:

In [ ]:


Build a multi-layer neural network to predict wine varietals using the wine chemistry dataset.

In [ ]:
import pandas as pd

wine = pd.read_table("../data/wine.dat", sep='\s+')

attributes = ['Alcohol',
            'Malic acid',
            'Alcalinity of ash',
            'Total phenols',
            'Nonflavanoid phenols',
            'Color intensity',
            'OD280/OD315 of diluted wines',

grape = wine.pop('region')
y = grape.values-1
X = wine.values
In [ ]:
# Write your answer here