The goal of this tutorial is to break down all the parts of the variational autoencoder and discuss them from fundamentals.
Suppose we have samples from an (unknown) underlying distribution $w_i \sim p_D(w)$, and we want to learn what $p_D(w)$ is.
This problem is extremely general, and many learning problems fall under this viewpoint. Take $w = y|x$, then this would be a supervised learning problem of learning $p_D(y|x)$.
We would want to approximate the true distribution $p_D(w)$ with a parametric approximation $q_\theta(w)$. Any objective between distributions can be used. A notable one is the (forward) KL divergence $KL(p_D(w)\;||\; q_\theta(w))$.
In the special case of a classification problem with $K$ classes, this objective is the cross entropy (CE) loss.
$$ \min_\theta KL(p_D(y|x)\;||\; q_\theta(y|x)) = \underbrace{\mathbb{E}_{t \sim p_D(y|x)}[\log p_D(t|x) - \log q_\theta(t|x)]}_{\text{definition}} = \underbrace{\mathbb{E}_{t \sim p_D(y|x)}[\log p_D(t|x)]}_{\text{no effect on optimization of $\theta$}} - \mathbb{E}_{t \sim p_D(y|x)} [\log q_\theta(t|x)] $$The first term does not depend on $\theta$, and we evaluate the second term with a single sample to obtain the CE loss:
$$ -\mathbb{E}_{t \sim p_D(y|x)} [\log q_\theta(t|x)] = - \log q_\theta(t|x) $$More generally, if we just want to learn an approximate distribution $q_\theta(w)$ to $p_D(w)$, many problems arise.
Model constraint. $q_\theta$ needs to be normalized, ie. $\int_w q_\theta(w) dw = 1$. In classification, $w = y|x$ is a discrete distribution with only $K$ possible values, so we can just enforce the sum to be 1 (e.g. by using softmax).
Problem knowledge. (Partial) information about $p_D(w)$ can come in many forms. Typical scenarios are (i) i.i.d. samples can be obtained from $p_D$, (ii) $p_D$ can be evaluated at any value of $w$, perhaps up to a normalizing constant. It may be that you only have (i) but not (ii) or vice-versa. The training objective for $q\theta$ needs to take into account what information is available._
One way to tackle the normalization problem is to output the statistics of a known distribution instead of the probability density function itself. e.g. if we take
$$q_\theta(w) = \mathcal{N}(\mu_\theta(w), \sigma^2_\theta(w))$$then we can model $q_\theta(w)$ by having a neural network that takes a value $w$ and outputs $\mu_\theta(w)$ and $\sigma_\theta(w)$.
Variational inference can be used to approximate an unnormalized density $\tilde{p}(z) = \frac{p(z)}{C}$ with a learned distribution $q_\theta(z)$.
The main application (and why it's referred to as inference) is to learn an approximation to the conditional distribution $p(z|x)$ when we only have access to the joint distribution $p(z,x)=p(z)p(x|z)$.
Bayes rule: $$ \underbrace{p(z|x)}_{\text{posterior}} = \frac{p(x|z)p(z)}{p(x)} \propto \underbrace{p(x|z)}_{\text{likelihood}}\underbrace{p(z)}_{\text{prior}} $$
Trying to figure out what the posterior should be given the likelihood and prior is an inference problem.
This is in the same form as $$ p(z) = \frac{\tilde{p}(z)}{C} $$ if we fix the variable $x$, which means we are trying to learn the posterior $p(z|x)$.
The "variational" part of the name comes from approximating the true distribution $p(z|x)$ with an approximate (a.k.a. "variational") distribution $q_\theta(z)$.
We want an objective such that the minmizer is $q_\theta(z) = p(z)$.
Since we only have an unnormalized density $\tilde{p}(z)$ (that we can evaluate at any of $z$), a natural objective is the forward KL
$$ L(\theta) := KL(q_\theta\;||\;p) = \mathbb{E}_{z\sim q(z)}[\log q(z) - \log p(z)] = \mathbb{E}_{z\sim q(z)}[\log q(z) - \log \tilde{p}(z) + \log C] $$The constant $C$ does not affect the optimization of $L(\theta)$.
%matplotlib inline
import math
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from tqdm import tqdm
def normal_logprob(z, mean, log_std):
mean = mean + torch.tensor(0.)
log_std = log_std + torch.tensor(0.)
c = torch.tensor([math.log(2 * math.pi)]).to(z)
inv_sigma = torch.exp(-log_std)
tmp = (z - mean) * inv_sigma
return -0.5 * (tmp * tmp + 2 * log_std + c)
def log_density_funnel(x):
x1, x2 = x[:, 0], x[:, 1]
x2_logdensity = normal_logprob(x2, 0, np.log(1.35))
x1_logdensity = normal_logprob(x1, 0, x2)
return x2_logdensity + x1_logdensity
def plot_isocontours(ax, func, xlimits=[-3, 3], ylimits=[-5, 3], numticks=101, alpha=1.0, cmap=plt.get_cmap('viridis')):
x = np.linspace(*xlimits, num=numticks)
y = np.linspace(*ylimits, num=numticks)
X, Y = np.meshgrid(x, y)
pts = torch.from_numpy(np.concatenate([np.atleast_2d(X.ravel()), np.atleast_2d(Y.ravel())]).T)
zs = torch.exp(func(pts)).detach().cpu().numpy()
Z = zs.reshape(X.shape)
ax.contour(X, Y, Z, alpha=alpha, cmap=cmap)
ax.set_yticks([])
ax.set_xticks([])
ax.set_xlim(xlimits)
ax.set_ylim(ylimits)
fig = plt.figure(figsize=(6, 6), facecolor='white')
ax = fig.add_subplot(111, frameon=False)
plot_isocontours(ax, log_density_funnel)
Here we try to fit this funnel-like distribution with a Gaussian distribution $q$.
# Model parameters.
mean = nn.Parameter(torch.randn(1, 2))
log_std = nn.Parameter(torch.randn(1, 2))
def q_logprob(x):
return normal_logprob(x.float(), mean, log_std).sum(1)
# Objective is the forward KL: E_q[logq - logp].
def calculate_kl(logprob_func, n_samples):
###########################################
# Note: Sample from Q, evaluate logQ - logP
###########################################
samples = torch.randn(n_samples, 2).to(mean) * torch.exp(log_std) + mean
logq = q_logprob(samples)
logp = logprob_func(samples)
return torch.mean(logq - logp)
fig = plt.figure(figsize=(8, 4), facecolor='white')
ax = fig.add_subplot(121, frameon=False)
plot_isocontours(ax, log_density_funnel)
plot_isocontours(ax, q_logprob)
ax.set_title('Initial')
# Optimizer.
optimizer = torch.optim.Adam([mean, log_std], lr=1e-2)
# Training loop.
for i in range(3000):
optimizer.zero_grad()
kl = calculate_kl(log_density_funnel, 100)
kl.backward()
optimizer.step()
ax = fig.add_subplot(122, frameon=False)
plot_isocontours(ax, log_density_funnel)
plot_isocontours(ax, q_logprob)
ax.set_title('After Training')
Text(0.5,1,'After Training')
KL divergence is asymmetric, implying $KL(q_\theta\;||\;p_D) \ne KL(p_D\;||\;q_\theta)$. We may want to use either one depending on what we know about $p_D$.
If we only have samples from $p_D$, use forward KL: $KL(p_D\;||\;q_\theta)$. This objective is equivalent to performing "maximum likelihood".
If we can only evaluate $p_D$ up to a normalizing constant, use reverse KL: $KL(q_\theta\;||\;p_D)$.
Now instead of having a fixed $p(x|z)$, suppose we also want to learn this generative model $p_\theta(x|z)$.
Instead of directly learning a distribution on $p(x)$, we may want to learn $p(x|z)$ because:
Why not:
Let's go back to Bayes rule:
$$ p_\theta(z|x) = \frac{p_\theta(x|z)p(z)}{p_\theta(x)} $$Here we use $\theta$ to denote that the distribution depends on $\theta$ even if we're only directly parameterizing $p_\theta(x|z)$. We can no longer treat $p_\theta(x) = \int p_\theta(x|z)p(z) dz $ as a constant because it depends on the parameters $\theta$.
Ideally, we directly optimize based on $p_\theta(x)$, but this is hidden behind an expensive integral. Instead, the ELBO is typically used to train latent variable models by introducing a variational distribution $q(z)$.
The ELBO can be derived in many different ways. Here's a short one: take Bayes' rule and apply to both sides, (i) log, (ii) subtract $\log q(z)$, (iii) take expectation w.r.t. $q(z)$.
$$ \log p(z|x) = \log p(x|z) + \log p(z) - \log p(x) $$$$ \log p(z|x) - \log q(z) = \log p(x|z) + \log p(z) - \log p(x) - \log q(z) $$$$ \underbrace{\mathbb{E}_{z\sim q(z)} [ \log p(z|x) - \log q(z) ]}_{- \text{reverse KL}} = \underbrace{\mathbb{E}_{z\sim q(z)} [ \log p(x|z) + \log p(z) - \log q(z) ]}_{\text{ELBO}} - \log p(x) $$Re-arrange this, and we get
$$ \log p(x) = \text{ELBO(x)} + \underbrace{KL(q(z)\;||\;p(z|x))}_{\geq 0} $$$$ \text{ELBO(x)} := \mathbb{E}_{z\sim q(z)} [ \log p(x|z) + \log p(z) - \log q(z)] $$When we want to train a generative model $p_\theta(x|z)$, maximizing the ELBO simultaneously:
The ELBO is a function of $x$, so if we have a dataset of samples $x_i$, we can maximize
$$ \sum_{i=1}^n ELBO(x_i) $$Aside: This is equivalent to minimizing an upper bound on the forward KL divergence $KL(p_D(x);||; p\theta(x))$ with $n$ samples from $p_D$. Can you derive this equivalence?_
class VAE(nn.Module):
def __init__(self):
super(VAE, self).__init__()
self.encoder = nn.Sequential(nn.Linear(2, 100), nn.ReLU(), nn.Linear(100, 4))
self.decoder = nn.Sequential(nn.Linear(2, 100), nn.ReLU(), nn.Linear(100, 2))
def elbo(self, x):
# Get q(z|x). Get p(x|z). Kind of like an autoencoder.
q_params = self.encoder(x)
q_mean, q_logstd = q_params[:, :2], q_params[:, 2:]
q_samples = torch.randn(x.shape[0], 2).to(q_mean) * torch.exp(q_logstd) + q_mean
px_mean = self.decoder(q_samples)
# Compute ELBO = logp(x|z) + logp(z) - logq(z)
logqz = normal_logprob(q_samples, q_mean, q_logstd).sum(1)
logpz = normal_logprob(q_samples, 0., 0.).sum(1)
logpxz = normal_logprob(x, px_mean, 0.).sum(1)
elbo = torch.mean(logpxz + logpz - logqz)
return elbo
# We train on a dataset of just four points.
data = torch.tensor([
[5, 5],
[0, 5],
[5, 0],
[-5, -5],
]).float()
# VAE model.
model = VAE()
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
# Just plotting code.
fig = plt.figure(figsize=(12, 3), facecolor='white')
ax = fig.add_subplot(141, frameon=False)
plot_isocontours(ax, lambda x: normal_logprob(x, 0., 0.).sum(1))
ax.set_title('Prior')
ax.set_ylim([-3, 3])
ax.set_xlim([-3, 3])
ax = fig.add_subplot(142, frameon=False)
q_params = model.encoder(data)
q_mean, q_logstd = q_params[:, :2], q_params[:, 2:]
for i in range(4):
plot_isocontours(ax, lambda x: normal_logprob(x.float(), q_mean[i], q_logstd[i]).sum(1))
ax.set_title('q(z|x) (Init)')
ax.set_ylim([-3, 3])
ax.set_xlim([-3, 3])
# Training loop.
for _ in tqdm(range(10000)):
optimizer.zero_grad()
elbo = model.elbo(data)
elbo.mul(-1).backward()
optimizer.step()
# More plotting code.
ax = fig.add_subplot(143, frameon=False)
q_params = model.encoder(data)
q_mean, q_logstd = q_params[:, :2], q_params[:, 2:]
for i in range(q_mean.shape[0]):
plot_isocontours(ax, lambda x: normal_logprob(x.float(), q_mean[i], q_logstd[i]).sum(1))
ax.set_title('q(z|x) (After Training)')
ax.set_ylim([-3, 3])
ax.set_xlim([-3, 3])
ax = fig.add_subplot(144, frameon=False)
for i in range(q_mean.shape[0]):
samples = torch.randn(1000, 2).to(q_mean) * torch.exp(q_logstd[i][None]) + q_mean[i][None]
ax.scatter(samples[:, 0].detach().numpy(), samples[:, 1].detach().numpy())
ax.set_title('q(z|x) (Samples)')
ax.set_ylim([-3, 3])
ax.set_xlim([-3, 3])
ax.set_yticks([])
ax.set_xticks([])
print('Final ELBO: {:.3f}'.format(elbo.item()))
100%|██████████| 10000/10000 [00:16<00:00, 619.29it/s]
Final ELBO: -4.607
Previously, we stuck to modeling using Gaussians because we know the normalization constant of a Gaussian distribution. Normalizing flows offer a more flexible way to tackle the normalization constant.
The change of variables applied to random variables:
$$ \log p(x) = \underbrace{\log p(x_0)}_{\text{simple distribution}} - \underbrace{\log \left| \det \frac{\partial T(x_0)}{\partial x_0} \right|}_{\text{automatically normalizes $p(x)$}} $$where $T$ is an invertible transformation applied to samples from a simple distribution $p(x_0)$ into samples from $p(x)$. Here $p_\theta(x)$ is defined by the transformation $T_\theta$ itself.
An example of an invertible transformation is (in two dimensions):
$$T(z_0) = z_0$$$$T(z_1) = z_1 + f(z_0)$$Then the Jacobian is $$ \begin{bmatrix} 1 & 0 \\ \frac{\partial f}{\partial z_0} & 1\\ \end{bmatrix} $$
Since this is a triangular matrix, the determinant is the product of the diagonal elements, which is 1. The log determinant is 0. So for this type of transformation $p(x) = p(x_0)$ where $x = T(x_0)$. A transformation that doesn't change the probability density of the transformed sample is called volume preserving.
Previously, we made the Gaussian assumption in two places:
We could replace either with a NF, but we'll focus on using NF to express a more complex variational posterior, since this would help make training the generative model easier as well (because it can now have any posterior, not just ones easily approximated by a Gaussian).
To model $q(z|x)$ using a normalizing flow, we first define a simple distribution on $z_0$ then apply an invertible transformation to obtain:
$$ \log q(z|x) = \log q(z_0|x) - \log \left| \det \frac{\partial T(z_0)}{\partial z_0} \right| $$The second term is zero when using the volume preserving transformation from before.
##!! The new parts of the code are commented with two hash symbols
##!! and two exclaimation marks.
class VAE(nn.Module):
def __init__(self):
super(VAE, self).__init__()
self.encoder = nn.Sequential(nn.Linear(2, 100), nn.Tanh(), nn.Linear(100, 4))
self.decoder = nn.Sequential(nn.Linear(2, 100), nn.Tanh(), nn.Linear(100, 2))
##!! We use 10 series of invertible transformations.
self.transforms = nn.ModuleList([
nn.Sequential(nn.Linear(1, 20), nn.Tanh(), nn.Linear(20, 1)) for _ in range(10)
])
def transform(self, z):
##!! Alternate dimensions for transformation.
for i, transform in enumerate(self.transforms):
if i % 2 == 0:
z = torch.cat([z[:, 0:1], z[:, 1:2] + transform(z[:, 0:1])], dim=1)
else:
z = torch.cat([z[:, 0:1] + transform(z[:, 1:2]), z[:, 1:2]], dim=1)
return z
def elbo(self, x):
# Get q(z|x). Get p(x|z). Kind of like an autoencoder.
q_params = self.encoder(x)
q_mean, q_logstd = q_params[:, :2], q_params[:, 2:]
q_samples = torch.randn(x.shape[0], 2).to(q_mean) * torch.exp(q_logstd) + q_mean
##!! Compute log q(z_0) based on pre-transform samples.
logqz = normal_logprob(q_samples, q_mean, q_logstd).sum(1)
##!! Transformed samples.
q_samples = self.transform(q_samples)
px_mean = self.decoder(q_samples)
# Compute ELBO = logp(x|z) + logp(z) - logq(z)
logpz = normal_logprob(q_samples, 0., 0.).sum(1)
logpxz = normal_logprob(x, px_mean, 0.).sum(1)
elbo = torch.mean(logpxz + logpz - logqz)
return elbo
# We train on a dataset of just four points.
data = torch.tensor([
[5, 5],
[0, 5],
[5, 0],
[-5, -5],
]).float()
# VAE model.
model = VAE()
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
# Just plotting code.
fig = plt.figure(figsize=(6, 3), facecolor='white')
ax = fig.add_subplot(121, frameon=False)
plot_isocontours(ax, lambda x: normal_logprob(x, 0., 0.).sum(1))
ax.set_title('Prior')
ax.set_ylim([-3, 3])
ax.set_xlim([-3, 3])
# Training loop.
for _ in tqdm(range(10000)):
optimizer.zero_grad()
elbo = model.elbo(data)
elbo.mul(-1).backward()
optimizer.step()
# More plotting code.
ax = fig.add_subplot(122, frameon=False)
for i in range(q_mean.shape[0]):
samples = torch.randn(1000, 2).to(q_mean) * torch.exp(q_logstd[i][None]) + q_mean[i][None]
samples = model.transform(samples)
ax.scatter(samples[:, 0].detach().numpy(), samples[:, 1].detach().numpy())
ax.set_title('q(z|x) (Samples)')
ax.set_ylim([-3, 3])
ax.set_xlim([-3, 3])
ax.set_yticks([])
ax.set_xticks([])
print('Final ELBO: {:.3f}'.format(elbo.item()))
100%|██████████| 10000/10000 [00:59<00:00, 166.69it/s]
Final ELBO: -4.083