Denoising Diffusion Model¶

This numerical tours explains denoising diffusion model on a simple 1-D example. The density to sample from is a mixture of Gaussian, so that the diffused densities can be computed in closed form.

To learn more about diffiusion model, you can have a look at Valentin de Bortoli courses's slides. I have written a short mathematical memo here.

In [1]:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

Mixture of Gaussians¶

In order to be able to have closed form formula for the flows simulated bellow, we consider 1-D mixtures of Gaussians $\sum_i a_i \mathcal{N}(\mu_i,\sigma_i^2)$ parameterized by $(a,\mu,\sigma)$. We define their density functions and a sampling operator.

In [2]:

def gauss(x,m,s): return 1/( torch.sqrt( 2*torch.tensor(torch.pi) ) * s ) * torch.exp( -(x-m)**2 / (2*s**2) )
def gaus_mixture(x,mu,sigma,a):
    y = x*0
    for i in range(len(mu)):
        y = y + a[i] * gauss(x,mu[i],sigma[i])
    return y
def sample_mixture(P, mu,sigma,a):
    a1 = np.array( a/torch.sum(a) )
    I = np.random.choice( np.arange(0, len(a)), size=(P,1), p=a1 )
    return ( mu[I] + torch.randn((P,1))*sigma[I] ).flatten()

Define also helpers for the display of histograms and trajectories.

In [3]:

fs = (6, 3) # size of figures for display
xmax = 3
nbins = 200
def display_histogram(X0):
    P = len(X0)
    h0,b = torch.histogram(X0.flatten(), bins=nbins, range=[-xmax, xmax])
    plt.bar( torch.linspace(-xmax,xmax,nbins), h0/P * nbins/(2*xmax), width=2*xmax/(nbins-1), align='center', color='red')
def display_trajectories(Y, m = 1000, alpha=.3, linewidth=.2):  # m=number of trajectories to display
    disp_ind = torch.round( torch.linspace(0,P-1,m) ).type(torch.int32)
    I = torch.argsort(Y[:,-1])
    Y = Y[I,:]
    for i in range(m):
        k = disp_ind[i]
        s = k/(P-1)
        plt.plot( torch.linspace(0,T,N), Y[k,:], color=[s.item(),0,1-s.item()], alpha=alpha, linewidth=linewidth )

Define a measure $\rho_0(x)$ defined as a mixture and display.

In [4]:

mu = torch.tensor([-.7, 0, .5])*3 # mean
sigma = torch.tensor([.02, .05, .03])*9 # std
a = torch.tensor([2, 3, 2]) # weight, should sum to 1
a = a/torch.sum(a)
def rho0(x): return gaus_mixture(x,mu,sigma,a)

Evaluate the density on a grid and display it.

In [5]:

x = torch.linspace(-xmax,xmax,1000)
plt.figure(figsize=fs)
plt.fill_between(x,rho0(x));

Langevin sampling¶

To sample from a density $\rho_0(x)$, Langevin diffusion performs a gradient descent on the negative log-density, which reads, for a step size $\tau>0$, $$ X_{k+1} := X_k + \tau \eta_0(X_k) + \sqrt{2\tau}W_k $$ where the $W_{k} \sim \mathcal{N}(0,\text{Id})$ are i.d.d. Gaussian white noise, and where the ``non-smoothed'' (we will introduced its smoothed version later) score reads $$ \eta_0(x) := \nabla \log \rho_0(x) = \frac{\nabla \rho_0(x)}{\rho_0(x)}. $$

When $\tau \rightarrow 0$, this corresponds to the following Langevin SDE $$ d X_t = \eta_0(X_t) d t + \sqrt{2\tau} d W_t $$ where $W_t$ is a Wiener process.

One can show that the law of $X_T$ converges in law toward $\rho_0$.

We first compute and display this score.

In [6]:

def eta0(x):
    x1 = x.detach().clone()
    x1.requires_grad = True
    L = torch.sum( rho0(x1) )
    L.backward()
    return x1.grad / rho0(x)
plt.figure(figsize=fs)
plt.plot(x,rho0(x).detach().numpy(), label='$\\rho_0$' )
plt.plot(x,eta0(x).detach().numpy() /40, label='$\\eta_0$')
plt.legend();

Sample $Z_0$ from the initial density at time $t=0$, which is uniform on $[-w/2,w/2]$.

In [7]:

P = 50000 # number of particles
w = 6 # support of the indicator
Z0 = w*torch.rand((P,))-w/2

Run the recursion of Langevin.

In [8]:

T = 10 # final time
N = 2500 # number of steps
tau = T / N # step size
Z = torch.zeros((P,N))
Z[:,0] = Z0
for i in range(N-1):
    Z[:,i+1] = Z[:,i] + tau * eta0(Z[:,i]) + torch.sqrt(2 * torch.tensor(tau)) * torch.randn((P,))

Display evolving empirical densities by showing the histograms of the samples.

In [9]:

disp_ind = ( torch.tensor([0, .001, .005, .01, .99])*N ).type(torch.int32)
for i in range(len(disp_ind)):
    k = disp_ind[i]
    plt.subplot(len(disp_ind),1,i+1)
    display_histogram(Z[:,k])
    plt.fill_between(x, rho0(x).detach().numpy(), color=[0,0,1,.25] )

Display sample path $t \mapsto X_t$.

In [10]:

plt.figure(figsize=fs)
display_trajectories(Z, m=100)

Forward noising process¶

The two major issues with Langevin is that it converges slowly for singular $\eta_0$ (in particular with highly non concave log-density) and that it needs $\eta_0$ to be known in closed form or estimated. For applications to generative model, this is unacceptable, because $\rho_0$ is only known approximately from sample. It is very hard to estimate $\eta_0$ from sample, but as we will see, it is possible to instead estimated smoothed version of it. Diffusion model leverage this by replacing Lancegin sampling by another sampling method that rely on such smoothed score.

This sampling method is achieved by inversing a forward ``noising process''. We consider the following noising model $$ d X_t = -X_t + \sqrt{2} dW_t $$ where $W_t$ is a Wiener process. From the knowledge of $X_0$, the law $\rho_t$ of $X_t$ can be computed exactly as the law of $$ X_t \sim e^{-t} X_0 + \sqrt{1-e^{-2t}} Z $$ where $Z \sim \mathcal{N}(0,1)$.

In the case of a mixture of Gaussian considered here, $$ X_0 \sim \rho_0 \triangleq \sum_i a_i \mathcal{N}(\mu_i,\sigma_i^2) $$ the law of $X_t$ can thus be computed exactly in closed form as $$ X_t \sim \rho_t \triangleq \sum_i a_i \mathcal{N}(e^{-t} \mu_i, e^{-2t} \sigma_i^2 + 1-e^{-2t}) $$

In [11]:

def rho(x,t):
    sigma_t = torch.sqrt( torch.exp(-2*t)*sigma**2 + 1 -  torch.exp(-2*t) )
    return gaus_mixture(x, mu*torch.exp(-t), sigma_t,a)

Display evolution with $t$ of the density $\rho_t(x)$

In [12]:

plt.figure(figsize=fs)
t_list = torch.tensor([0, .1, .5, 2, 100])/10
for i in range(len(t_list)):
    s = i/len(t_list)
    t = t_list[i]
    plt.fill_between(x,rho(x,t), color=[s,0,1-s, .25], label=f't={t:.1f}', linewidth=0 )
plt.legend();

We first generate $P$ sample $X_0$ from the mixture of Gaussian at time $t=0$, and compare the associated empirical histogram with the density.

In [13]:

X0 = sample_mixture(P, mu,sigma,a)
plt.figure(figsize=fs)
display_histogram(X0)
plt.plot(x, rho0(x).detach().numpy() )

Out[13]:

[<matplotlib.lines.Line2D at 0x168900cd0>]

The forward noising process is computed numerically approximately with using an Euler-Maruyama scheme $$ X_{k+1} = X_k - \tau X_k + \sqrt{2\tau} W_k, $$ where $W_k$ are i.i.d. white noises.

Set the final time $T$, which should be large enough so that the final distribution of $X_T$ is approximately $\mathcal{N}(0,1)$.

In [14]:

T = 2 # final time
tau = T / N # step size

Now run the recursion.

In [15]:

X = torch.zeros((P,N))
X[:,0] = X0
for i in range(N-1):
    X[:,i+1] = X[:,i]  - tau * X[:,i] + torch.sqrt(2 * torch.tensor(tau)) * torch.randn((P,))

In [16]:

# display evolving density
t_list = torch.tensor([0, .01, .05, .2, T*.99])
for i in range(len(t_list)):
    s = i/len(t_list)
    t = t_list[i]
    k = torch.round( t/tau ).type(torch.int32) 
    plt.subplot(len(t_list),1,i+1)
    plt.plot(x,rho(x,t), color='blue' )
    display_histogram(X[:,k])

Reverse diffusion¶

The score is the vector field $$ \eta_t(x) := \nabla \log(\rho(x,t)) = \frac{\nabla \rho(x,t)}{\rho(x,t)}. $$ Informally, it points toward regions of high densities.

Since here $\rho(x,t)$ can be computed in closed form, the score can itself be computed in closed form. Of course, in real situation, it is impossible, and we will see in the next section how to approximate it numerically from sampled of $\rho(x,t)$.

In [17]:

def eta(x,t): 
    x1 = x.detach().clone()
    x1.requires_grad = True
    L = torch.sum( rho(x1,t) )
    L.backward()
    return x1.grad / rho(x,t)

In [18]:

for i in range(len(t_list)):
    s = i/len(t_list)
    t = t_list[i]
    plt.subplot(len(t_list),1,i+1)
    plt.plot( x, rho(x, t).detach().numpy() )
    plt.plot( x, eta(x, t).detach().numpy() / 20 )

The actual sampling (the generative process) is now done by reverting in time this process, i.e., for a large enough $T \gg 0$, one seeks to approximate $X_{T-t}$.

The main result of diffusion model is that the law of $X_{T-t}$ is exaclty equal to the law of $Y_t$ satisfying the following Langevin SDE, initialized with $Y_0=X_T$ $$ \mathrm{d} Y_t = [ Y_t + (1+\alpha) \nabla \log(\rho_{T-t})(Y_t) ] \mathrm{d} t + \sqrt{2 \alpha} \mathrm{d} W_t $$ where $\alpha \geq 0$ is a parameter controlling the randomness introduced during the bacwkard flow. This can be discretized using an Euler-Maruyama scheme, starting from $Y_0=X_{T/\tau}$ $$ Y_{k+1} = Y_k + \tau Y_t + (1+\alpha) \nabla \log(\rho_{T-t})(Y_t) + \sqrt{2\tau\alpha} W_k. $$

The case $\alpha=1$ is the one used in diffusion model. Setting different value of $\alpha$ is equivalent to changing the time of the diffusion. The case $\alpha=0$ corresponds to a singular limit which is only random at $Y_0$ but after follow an advection equation $$ \frac{\mathrm{d} Y_t}{\mathrm{d} t} = Y_t + \nabla \log(\rho_{T-t})(Y_t) $$

Sample $Y_0$ from the stationary distribution $\mathcal{N}(0,1)$. Since we are in 1-D, instead of using random sample, we use deterministic "optimal" (in quantization error aka Optimal Transport) sample location, given by the uniform sampling of the inverse cumulative distribution (quantile) of the Gaussian.

In [19]:

import scipy as sp
Y0 = torch.tensor( sp.stats.norm.ppf( np.arange(0,P)/P + 1/(2*P) ) )
plt.figure(figsize=fs)
display_histogram(Y0)
plt.plot( x, rho(x, torch.tensor(10)) );

Implement the backward recursion.

In [20]:

alpha = 0 # pure advection
alpha = .01 # original diffusion is alpha=1
Y = torch.zeros((P,N))
Y[:,0] = Y0
t = torch.tensor(T)
for i in range(N-1):
    Y[:,i+1] = Y[:,i]  + tau * ( Y[:,i] + (1+alpha)*eta(Y[:,i],t) ) + torch.sqrt(2 * torch.tensor(tau * alpha)) * torch.randn((P,))
    t = t-tau

Display the evolution of the density of the law of $Y_{T-t}$. It is close to $\rho_t(x)$, and one can show that it matches exactly when $T \to +\infty$ and $\tau \to 0$.

In [21]:

# display evolving density
t_list = T-torch.tensor([0.01, .01, .05, .2, .99*T])
for i in range(len(t_list)):
    s = i/len(t_list)
    t = t_list[i]
    k = torch.round( t/tau ).type(torch.int32) 
    plt.subplot(len(t_list),1,i+1)
    plt.plot(x,rho(x,T-t), color='blue' )
    display_histogram(Y[:,k])

Displays sample of trajectories $t \mapsto Y_t$.

In [22]:

display_trajectories(Y, m=100, alpha=.5, linewidth=1)
plt.axis([.4*T, T, -3, 3]);

Denoising Score Matching¶

In order to be able to implement the reverse diffusion, one needs to compute the score $\nabla \log(\rho_t)$, where $\rho_t$ is the density of the distribution of $X_t$. The idea is to approximate this score using a function computed from samples of $X_t$.

$$ \nabla \log \rho_t(z) \approx \phi_\theta(z,t) $$$$ \underset{\theta}{\min} \quad \int_t \int_x \int_z \| \frac{e^{-t} x - z}{1-e^{-2t}} - \phi_\theta(z,t) \|^2 \, \text{d} \mathbb{P}(z|x) \text{d} \rho_0(x) \lambda_t \text{d} t $$

where $\lambda_t$ is a density of sampled time.

To generate the pairs $(X_0,X_t)$ we first sample $X_0$ and then sample $\mathbb{P}(X_t|X_0)$ which we know in closed form.

In [23]:

N = 20000 # number of training samples
X0 = sample_mixture(N, mu,sigma,a)

For the sampling density $\lambda_t$ of time, we use a uniform distribution of time within $[10^{-2},T]$.

In [24]:

Tmin = 1e-1
Tmax = .5
t_list = torch.linspace(Tmin,Tmax,N)

To sample from $\mathbb{P}(X_t|X_0)$, recall that given $X_0$, then $X_t$ can be sampled as $$ X_t \sim e^{-t} X_0 + \sqrt{1-e^{-2t}} Z $$ where $Z \sim \mathcal{N}(0,1)$.

In [25]:

W = torch.randn( (N,) )
Xt = torch.exp(-t_list) * X0 + torch.sqrt(1-torch.exp(-2*t_list)) * W

Generate for $i \in \{1,\ldots,N\}$ the input data points $U_i := (z_i \sim X_{t_i},t_i) \in \mathbb{R}^2$ and the output data to predict $V_i := \frac{e^{-t_i} x_i - z_i}{1-e^{-2t_i}}$ for $x_i \sim X_0$.

In [26]:

def tMap(t): return ( (t-Tmin)/(Tmax-Tmin) - 1/2 ) * 2*xmax # map t to the same range as x to ease training
U = torch.vstack((Xt,tMap(t_list))).T
V = ( torch.exp(-t_list) * X0 - Xt ) / ( 1-torch.exp(-2*t_list) )

Uncomment the following code if you want to use ground trust $\eta_t$ values for training.

In [27]:

if 0:
    rho_v = torch.func.vmap(rho) # parrallel version of rho
    def eta_v(x,t):
        x1 = x.detach().clone()
        x1.requires_grad = True
        L = torch.sum( rho_v(x1,t) )
        L.backward()
        return x1.grad / rho_v(x,t)
    Data_out = eta_v(Xt,t_list)

For $\phi_\theta(z,t)$, we use a Multilayers Perceptron (MLP) layers neural network with input dimension $2$ (for $(z,t)$) and the output should approximate $\frac{e^{-t} x - z}{1-e^{-2t}}$ and has dimension $1$. The MLP has a varying number of layers and dimensions, encoded in d_hidden.

In [28]:

def weights_init(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.xavier_normal_(m.weight)
        if m.bias is not None:
                torch.nn.init.zeros_(m.bias)
class MLP(nn.Module):
    def __init__(self, d_in, d_hidden, d_out, dropout_probability=0.1):
        super(MLP, self).__init__()
        layers = []
        layers.append(nn.Linear(d_in, d_hidden[0]))
        layers.append(nn.Sigmoid())
        layers.append(nn.Dropout(dropout_probability))
        for i in range(1, len(d_hidden)):
            layers.append(nn.Linear(d_hidden[i-1], d_hidden[i]))
            layers.append(nn.Sigmoid())
            layers.append(nn.Dropout(dropout_probability))
        layers.append(nn.Linear(d_hidden[-1], d_out))
        self.mlp = nn.Sequential(*layers)
        self.apply(weights_init)
    def forward(self, x):
        return self.mlp(x)

Create the neural network with 3 hidden layers and initialize its weights.

In [29]:

d_in = 2
d_hidden = [300, 100, 20]
d_out = 1
model = MLP(d_in, d_hidden, d_out, dropout_probability=0)

Train the network by minimzing $$ \text{Loss}(\theta) := \frac{1}{N} \sum_{i=1}^N \| \phi_\theta(U_i)-V_i \|^2 $$ with respect to the neural network weights and biases $\theta$ using the Adam optimizer.

In [30]:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001*10)
num_epochs = 600
Loss_list = []
for epoch in range(num_epochs):
    optimizer.zero_grad()
    loss = torch.sum( ( model(U).flatten() - V.flatten() )**2 ) / N
    loss.backward()
    optimizer.step()
    Loss_list.append( loss.item())
plt.plot(Loss_list);

Evaluate $g_\theta(z,t)$ on a grid and compare it with $\eta_t(z)$.

In [31]:

nT = 100
nX = 200
Tlist = torch.linspace(Tmin,Tmax,nT)
Xlist = torch.linspace(-xmax,xmax,nX)
A = torch.zeros((nX,nT))
B = torch.zeros((nX,nT))
for i in range(nT):
    A[:,i] = eta( Xlist, Tlist[i] ) # ground trust to predict 
    Z1 = torch.vstack((Xlist, Xlist*0+tMap(Tlist[i]) )).T
    B[:,i] = model( Z1 ).flatten() # approximation with neural network
plt.subplot(1,2,1)
plt.imshow(A, extent=[Tmin, Tmax, -xmax,xmax], aspect='auto')
plt.title('$\\eta_t(z)$')
plt.ylabel('z')
plt.xlabel('t')
plt.subplot(1,2,2)
plt.imshow(B.detach().numpy(), extent=[Tmin, Tmax, -xmax,xmax], aspect='auto')
plt.title('$\\phi_{\\theta}(z,t)$')
plt.xlabel('t');