Generative Adversarial Networks

Generative adversarial networks (GANs) are a powerful approach for probabilistic modeling (I. Goodfellow et al., 2014; I. Goodfellow, 2016). They posit a deep generative model and they enable fast and accurate inferences.

We demonstrate with an example in Edward. A webpage version is available at http://edwardlib.org/tutorials/gan.

In [2]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import edward as ed
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import os
import tensorflow as tf

from edward.models import Uniform
from tensorflow.contrib import slim
from tensorflow.examples.tutorials.mnist import input_data
In [3]:
def plot(samples):
  fig = plt.figure(figsize=(4, 4))
  gs = gridspec.GridSpec(4, 4)
  gs.update(wspace=0.05, hspace=0.05)

  for i, sample in enumerate(samples):
    ax = plt.subplot(gs[i])
    plt.axis('off')
    ax.set_xticklabels([])
    ax.set_yticklabels([])
    ax.set_aspect('equal')
    plt.imshow(sample.reshape(28, 28), cmap='Greys_r')

  return fig


ed.set_seed(42)

M = 128  # batch size during training
d = 100  # latent dimension

DATA_DIR = "data/mnist"
IMG_DIR = "img"

if not os.path.exists(DATA_DIR):
  os.makedirs(DATA_DIR)
if not os.path.exists(IMG_DIR):
  os.makedirs(IMG_DIR)

Data

We use training data from MNIST, which consists of 55,000 $28\times 28$ pixel images (LeCun, Bottou, Bengio, & Haffner, 1998). Each image is represented as a flattened vector of 784 elements, and each element is a pixel intensity between 0 and 1.

GAN Fig 0

The goal is to build and infer a model that can generate high quality images of handwritten digits.

During training we will feed batches of MNIST digits. We instantiate a TensorFlow placeholder with a fixed batch size of $M$ images.

In [4]:
mnist = input_data.read_data_sets(DATA_DIR, one_hot=True)
x_ph = tf.placeholder(tf.float32, [M, 784])
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting data/mnist/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting data/mnist/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting data/mnist/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/mnist/t10k-labels-idx1-ubyte.gz

Model

GANs posit generative models using an implicit mechanism. Given some random noise, the data is assumed to be generated by a deterministic function of that noise.

Formally, the generative process is

\begin{align*} \mathbf{\epsilon} &\sim p(\mathbf{\epsilon}), \\ \mathbf{x} &= G(\mathbf{\epsilon}; \theta), \end{align*}

where $G(\cdot; \theta)$ is a neural network that takes the samples $\mathbf{\epsilon}$ as input. The distribution $p(\mathbf{\epsilon})$ is interpreted as random noise injected to produce stochasticity in a physical system; it is typically a fixed uniform or normal distribution with some latent dimensionality.

In Edward, we build the model as follows, using TensorFlow Slim to specify the neural network. It defines a 2-layer fully connected neural network and outputs a vector of length $28\times28$ with values in $[0,1]$.

In [5]:
def generative_network(eps):
  h1 = slim.fully_connected(eps, 128, activation_fn=tf.nn.relu)
  x = slim.fully_connected(h1, 784, activation_fn=tf.sigmoid)
  return x

with tf.variable_scope("Gen"):
  eps = Uniform(a=tf.zeros([M, d]) - 1.0, b=tf.ones([M, d]))
  x = generative_network(eps)

We aim to estimate parameters of the generative network such that the model best captures the data. (Note in GANs, we are interested only in parameter estimation and not inference about any latent variables.)

Unfortunately, probability models described above do not admit a tractable likelihood. This poses a problem for most inference algorithms, as they usually require taking the model's density. Thus we are motivated to use "likelihood-free" algorithms (Marin, Pudlo, Robert, & Ryder, 2012), a class of methods which assume one can only sample from the model.

Inference

A key idea in likelihood-free methods is to learn by comparison (e.g., Rubin (1984; Gretton, Borgwardt, Rasch, Schölkopf, & Smola, 2012)): by analyzing the discrepancy between samples from the model and samples from the true data distribution, we have information on where the model can be improved in order to generate better samples.

In GANs, a neural network $D(\cdot;\phi)$ makes this comparison, known as the discriminator. $D(\cdot;\phi)$ takes data $\mathbf{x}$ as input (either generations from the model or data points from the data set), and it calculates the probability that $\mathbf{x}$ came from the true data.

In Edward, we use the following discriminative network. It is simply a feedforward network with one ReLU hidden layer. It returns the probability in the logit (unconstrained) scale.

In [6]:
def discriminative_network(x):
  """Outputs probability in logits."""
  h1 = slim.fully_connected(x, 128, activation_fn=tf.nn.relu)
  logit = slim.fully_connected(h1, 1, activation_fn=None)
  return logit

Let $p^*(\mathbf{x})$ represent the true data distribution. The optimization problem used in GANs is

\begin{equation*} \min_\theta \max_\phi~ \mathbb{E}_{p^*(\mathbf{x})} [ \log D(\mathbf{x}; \phi) ] + \mathbb{E}_{p(\mathbf{x}; \theta)} [ \log (1 - D(\mathbf{x}; \phi)) ]. \end{equation*}

This optimization problem is bilevel: it requires a min-max solution. In practice, the algorithm proceeds by iterating among these two optimizations, alternating gradient updates. An additional heuristic also modifies the objective function for the generative model in order to avoid saturation of gradients (I. J. Goodfellow, 2014).

Many sources of intuition exist behind GAN-style training. One, which is the original motivation, is based on idea that the two neural networks are playing a game. The discriminator tries to best distinguish samples away from the generator. The generator tries to produce samples that are indistinguishable by the discriminator. The goal of training is to reach a Nash equilibrium.

Another source is the idea of casting unsupervised learning as supervised learning (M. U. Gutmann, Dutta, Kaski, & Corander, 2014; M. Gutmann & Hyvärinen, 2010). This allows one to leverage the power of classification—a problem that in recent years is (relatively speaking) very easy.

A third comes from classical statistics, where the discriminator is interpreted as a proxy of the density ratio between the true data distribution and the model (Mohamed & Lakshminarayanan, 2016; Sugiyama, Suzuki, & Kanamori, 2012). By augmenting an original problem that may require the model's density with a discriminator (such as maximum likelihood), one can recover the original problem when the discriminator is optimal. Furthermore, this approximation is very fast, and it justifies GANs from the perspective of approximate inference.

In Edward, the GAN algorithm (GANInference) simply takes the implicit density model on x as input, binded to its realizations x_ph. In addition, a parameterized function discriminator is provided to distinguish their samples.

In [7]:
inference = ed.GANInference(
    data={x: x_ph}, discriminator=discriminative_network)

We'll use ADAM as optimizers for both the generator and discriminator. We'll run the algorithm for 15,000 iterations and print progress every 1,000 iterations.

In [9]:
optimizer = tf.train.AdamOptimizer()
optimizer_d = tf.train.AdamOptimizer()

inference = ed.GANInference(
    data={x: x_ph}, discriminator=discriminative_network)
inference.initialize(
    optimizer=optimizer, optimizer_d=optimizer_d,
    n_iter=15000, n_print=1000)

We now form the main loop which trains the GAN. At each iteration, it takes a minibatch and updates the parameters according to the algorithm. At every 1000 iterations, it will print progress and also saves a figure of generated samples from the model.

In [10]:
sess = ed.get_session()
tf.global_variables_initializer().run()

idx = np.random.randint(M, size=16)
i = 0
for t in range(inference.n_iter):
  if t % inference.n_print == 0:
    samples = sess.run(x)
    samples = samples[idx, ]

    fig = plot(samples)
    plt.savefig(os.path.join(IMG_DIR, '{}.png').format(
        str(i).zfill(3)), bbox_inches='tight')
    plt.close(fig)
    i += 1

  x_batch, _ = mnist.train.next_batch(M)
  info_dict = inference.update(feed_dict={x_ph: x_batch})
  inference.print_progress(info_dict)
Iteration     1 [  0%]: Gen Loss = 0.667: Disc Loss = 1.303
Iteration  1000 [  6%]: Gen Loss = 14.898: Disc Loss = 0.026
Iteration  2000 [ 13%]: Gen Loss = 4.941: Disc Loss = 0.034
Iteration  3000 [ 20%]: Gen Loss = 5.195: Disc Loss = 0.054
Iteration  4000 [ 26%]: Gen Loss = 4.918: Disc Loss = 0.107
Iteration  5000 [ 33%]: Gen Loss = 4.584: Disc Loss = 0.165
Iteration  6000 [ 40%]: Gen Loss = 4.029: Disc Loss = 0.221
Iteration  7000 [ 46%]: Gen Loss = 3.286: Disc Loss = 0.594
Iteration  8000 [ 53%]: Gen Loss = 3.374: Disc Loss = 0.312
Iteration  9000 [ 60%]: Gen Loss = 2.803: Disc Loss = 0.611
Iteration 10000 [ 66%]: Gen Loss = 2.362: Disc Loss = 0.745
Iteration 11000 [ 73%]: Gen Loss = 2.974: Disc Loss = 0.526
Iteration 12000 [ 80%]: Gen Loss = 2.787: Disc Loss = 0.514
Iteration 13000 [ 86%]: Gen Loss = 2.413: Disc Loss = 0.852
Iteration 14000 [ 93%]: Gen Loss = 1.990: Disc Loss = 0.696
Iteration 15000 [100%]: Gen Loss = 1.929: Disc Loss = 0.781

Examining convergence of the GAN objective can be meaningless in practice. The algorithm is usually run until some other criterion is satisfied, such as if the samples look visually okay, or if the GAN can capture meaningful parts of the data.

Criticism

Evaluation of GANs remains an open problem---both in criticizing their fit to data and in assessing convergence. Recent advances have considered alternative objectives and heuristics to stabilize training (see also Soumith Chintala's GAN hacks repo).

As one approach to criticize the model, we simply look at generated images during training. Below we show generations after 14,000 iterations (that is, 14,000 gradient updates of both the generator and the discriminator).

GAN Fig 1

The images are meaningful albeit a little blurry. Suggestions for further improvements would be to tune the hyperparameters in the optimization, to improve the capacity of the discriminative and generative networks, and to leverage more prior information (such as convolutional architectures).