berkeley logoProgramming Project #5 (proj5)
CS180: Intro to Computer Vision and Computational Photography

Part B: Diffusion Models from Scratch!

The first part of a larger project.

Due: 11/19/24 11:59pm

We recommend using GPUs from Colab to finish this project!

Overview

In part B you will train your own diffusion model on MNIST. Starter code can be found in the provided notebook.

START EARLY!

This project, in many ways, will be the most difficult project this semester.

Part 1: Training a Single-Step Denoising UNet

1.0 Problem Formulation

Let's warmup by building a simple one-step denoiser. Given a noisy image $z$, we aim to train a denoiser $D_\theta$ such that it maps $z$ to a clean image $x$. To do so, we can optimize over an L2 loss: $$L = \mathbb{E}_{z,x} \|D_{\theta}(z) - x\|^2 \tag{1}$$

1.1 Implementing the UNet

In this project, we implement the denoiser as a UNet. It consists of a few downsampling and upsampling blocks with skip connections.

UNet Architecture

Figure 1: Unconditional UNet

The diagram above uses a number of standard tensor operations defined as follows:

UNet Operations

Figure 2: UNet Operations

Note:

We define composed operations using our simple operations in order to make our network deeper. This doesn't change the tensor's height, width, or number of channels, but simply adds more learnable parameters.

Within the simple operations:

1.2 Using the UNet to Train a Denoiser

Recall from equation 1 that we aim to solve the following denoising problem: Given a noisy image $z$, we aim to train a denoiser $D_\theta$ such that it maps $z$ to a clean image $x$. To do so, we can optimize over an L2 loss $$ L = \mathbb{E}_{z,x} \|D_{\theta}(z) - x\|^2. $$ To train our denoiser, we need to generate training data pairs of ($z$, $x$), where each $x$ is a clean MNIST digit. For each training batch, we can generate $z$ from $x$ using the the following noising process: $$ z = x + \sigma \epsilon,\quad \text{where }\epsilon \sim N(0, I). \tag{2} $$ Visualize the different noising processes over $\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$, assuming normalized $x \in [0, 1]$. It should be similar to the following plot:
Varying Sigmas

Figure 3. Varying levels of noise on MNIST digits

1.2.1 Training

Now, we will train the model to perform denoising.

Training Loss Curve

Figure 4. Training Loss Curve

You should visualize denoised results on the test set at the end of training. Display sample results after the 1st and 5th epoch.

They should look something like these:

After the first epoch

Figure 5. Results on digits from the test set after 1 epoch of training

After the 5-th epoch

Figure 6. Results on digits from the test set after 5 epochs of training

1.2.2 Out-of-Distribution Testing

Our denoiser was trained on MNIST digits noised with $\sigma = 0.5$. Let's see how the denoiser performs on different $\sigma$'s that it wasn't trained for.

Visualize the denoiser results on test set digits with varying levels of noise $\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$.

Varying Sigmas

Figure 7. Results on digits from the test set with varying noise levels.

1.3 Deliverables

Hint

Part 2: Training a DDPM Denoising UNet

Now, we are ready for diffusion, where we can iteratively denoise the image. We will implement DDPM in this part.

One small change is that we're going to change our UNet from part 1 to predict the noise instead of the clean image (like in part 4A of the project).

2.0 Problem Formulation

Let's reconsider the problem in part 1, but to its extreme:

Given a pure noise image $\epsilon \sim N(0, I)$, we aim to train a denoiser $D_{\theta}$ such that it maps the noise image to a clean image $x$. To do so, we can still apply a simple L2 loss:

$$L = \mathbb{E}_{\epsilon,x} \|D_{\theta}(\epsilon) - x\|^2.$$

The difference here, compared to part 1, is that $\epsilon$ is pure noise. If we can learn to remove pure noise, this will allow us to generate novel images, not just those in our training set.

However, we saw in part A that one-step denoising does not yield good results. Instead, we can iteratively denoise the image for better results.

For iterative denoising, we condition our model on timestep $t$ such that it can learn time-specific denoising. We can equivalently predict the noise added to the image rather than the denoised image itself.

$$L = \mathbb{E}_{\epsilon,x_0,t} \|\epsilon_{\theta}(x_t, t) - \epsilon\|^2. \tag{3}$$

$$\text{where }x_t = a_t x_0 + b_t \epsilon,~x_T := \epsilon$$ $$~t \in \{0, 1, \cdots, T\},~\epsilon \sim N(0, I).$$

For now, $a_t$ and $b_t$ can be thought of as some random function of $t$.

You can imagine that, with a time-conditioned denoising UNet, we can go from one-step denoising to iterative denoising:

$$x_{T-1} = x_T - \epsilon_\theta(x_T; T)$$ $$x_{T-2} = x_{T-1} - \epsilon_\theta(x_{T-1}; T-1)$$ $$\cdots$$ $$x_0 = x_{1} - \epsilon_\theta(x_{1}; 1).$$

We can therefore perform iterative denoising to get better results than a one-step denoiser as shown in part 1, which is especially useful when our noisy inputs are pure Gaussian noise.

In practice, our model is predicting the entire noise added to $x_0$ to get $x_t$ rather than intermediate amounts of noise, but the coefficients $a_t$ and $b_t$ will appropriately scale such that we recover intermediate noise samples $x_i$ for $i \in \{1, \cdots, T-1\}$. Additionally, because of this scaling we cannot directly subtract the noise as shown above, but will instead do so under a different process shown later.

2.1 Refactoring Your Unconditional UNet for DDPM

In order to do iterative denoising, we first need to add condition $t$ into our model. We will also add a class-label condition $c$ into our model for when we later do class-conditioned denoising with classifier-free guidance.

Let's first define a new operator called FCBlock (fully-connected block):

FCBlock

Figure 8. FCBlock for conditioning

Here L(F_in, F_out) is a linear layer with F_in input features and F_out output features. You can implement it using nn.Linear.

To condition our network on time and class-label, we can apply conditioning after unflattening and the first upsample:



UNet Highlighted

Figure 9. Conditional UNet

You can embed $t$ and $c$ by following this pseudo code:


fc1_t = FCBlock(...)
fc1_c = FCBlock(...)
fc2_t = FCBlock(...)
fc2_c = FCBlock(...)

t1 = fc1_t(t)
c1 = fc1_c(c)
t2 = fc2_t(t)
c2 = fc2_c(c)

# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = fc1_c * unflatten + fc1_t
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = fc2_c * up1 + fc2_t
# Follow diagram to get the output.
...
    

Note that the class-label $c$ should be encoded as one-hot vector.

Another modification you need to make is to make model take in a batch-wise mask vector which is either 0 or 1. It indicates whether or not to drop the condition $c$: drop when mask is 0, not drop when mask is 1. It can be null, which means just using the condition. This is so we can perform line 6 of algorithm 1.

2.2 Implementing DDPM Forward and Reverse Process

Now that we have some intuition from part 2.0, it's time to implement the forward and reverse process of DDPM.
DDPM considers a very specific noising and denoising process:
DDPM Markov Process

Figure 10: DDPM markov chain. The forward process is denoted by $q(x_t\mid x_{t-1})$ and the reverse process is denoted by $p_\theta(x_{t-1}\mid x_t)$.
(Image source: Ho et al. 2020 with a few additional annotations from Lilian Weng)

Specifically, each forward step adds Gaussian noise in a variance-preserving way for some variance schedule $\{\beta_t\}_{t=1}^T$:

$$ q(\mathbf{x}_{1:T}|\mathbf{x}_0) := \prod_{t=1}^{T} q(\mathbf{x}_t|\mathbf{x}_{t-1}), \quad q(\mathbf{x}_t|\mathbf{x}_{t-1}) := N (\mathbf{x}_t; \sqrt{1 - \beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I}). \tag{4} $$

Using the reparamaterization trick presented in section 2 of the DDPM paper, we can compute effective one step noising function, since a Gaussian convolved with a Gaussian is still Gaussian (see here for more details).

Concretely, let $\alpha_t := 1 -\beta_t$ and $\bar{\alpha_t} := \prod^t_{s=1}\alpha_s$, then we can sample a noisy $x_t$ for an arbitrary $t$:

$$ q(\mathbf{x}_t|\mathbf{x}_0) = N(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I}). \tag{5} $$

DDPM Scheduler

Let's first implement the DDPM scheduler to fetch all relevant variables. Given $(\beta_0, \beta_T, T)$, follow the doc-string to get all useful values. You will use them in a bit!

TODO: Implement ddpm_schedule()

DDPM Forward Process

For brevity, we don't show the mathematical details here. If you'd like to see the mathematical details, check out here.

TODO: Implement our ddpm_forward() function by following algorithm 1:

Algorithm Diagram


DDPM Reverse Process

Recall that in the reverse process we progressively work backwards to reconstruct the original image $x_0$ from noise $x_T$. We can sample from this process following: $$p(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I}),\tag{6}$$ $$\text{where} \quad \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0) := \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t,\text{and} \quad \tilde{\beta}_t := \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t. \tag{7}$$ We can think of $\tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0)$ as a linear combination of $x_0$ and $x_t$:
Linear Combination

Figure 11: Interpolation of $x_0$ and $x_t$
(Image source)

Using the same reparamaterization trick from equation 5, we can solve for $x_0$: $$x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \epsilon_t \right),\tag{8}$$

If we plug this back into equation 7, we get: $$\tilde{\mu}_(\mathbf{x}_t, x_0) := \frac{1}{\sqrt{\alpha_t}}(\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}}\epsilon_t)\tag{9}$$

Thus, by using the reparameterization trick and following equation 6, we can cleanly represent $x_{t-1}$ as: $$\mathbf{x}_{t-1} = \tilde{\mu}_t(\mathbf{x}_t, \mathbf{x}_0) + \sqrt{\tilde{\beta}_t} \mathbf{z}, \quad \mathbf{z} \sim \mathcal{N}(0, \mathbf{I})\tag{10}$$ $$\implies \mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}(\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}}\epsilon_t) + \sigma_t \mathbf{z}, \quad \text{where } \sigma_t = \sqrt{\tilde{\beta}_t}\tag{11}$$

To model this, we train network $\mu_\theta(x_t, x_0)$ We don't need to learn the $\sigma_t$ term (since it is defined by the schedule), therefore we only optimize $\mu_\theta(x_t, x_0)$.

Similarly, we have access to $\mathbf{x}_t$ and $\alpha$ terms, so we only need to optimize a network $\epsilon_\theta(x_t, t)$ to predict the noise term.

TODO: Implement the reverse sampling process ddpm_sample():


Algorithm Diagram

Again, if you are interested in the math, check out here.

Some important details:

if t[0] % 20 == 0 or t[0] == num_ts or t[0] < 8:
    caches.append(x)

2.3 Putting It All Together

We have all the pieces, let's now train our diffusion model. Please consider this pseudo code for your training step.

def train_step():
    x0 = sample_from_data()

    t = uniform_sample_T()
    loss = diffusion_forward(x0, t)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Use the same network hyperparameters as in part 1. You might need to increase num_epochs = 20 to get good results (staff solution takes ~26 minutes on a Colab T4 GPU)

2.4 Deliverables

Your deliverables should include the following for this problem:

For reference, here are the staff solution results (without skip connections) for epochs 1, 5, 10, 15, and 20 with guidance scale 5.0. Note: you do not need to generate gifs (this can be done as B&W below).

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Bells & Whistles

Acknowledgements

This project was a joint effort by Daniel Geng, Hang Gao, and Ryan Tabrizi, advised by Liyue Shen, Andrew Owens, and Alexei Efros.