berkeley logoProgramming Project #5 (proj5)
CS180: Intro to Computer Vision and Computational Photography

Part B: Diffusion Models from Scratch!

The first part of a larger project.

Due: 11/19/24 11:59pm

We recommend using GPUs from Colab to finish this project!

Overview

In part B you will train your own diffusion model on MNIST. Starter code can be found in the provided notebook.

START EARLY!

This project, in many ways, will be the most difficult project this semester.

Note: this is an updated, clearer version of the part B instructions. For the old version, please see here.

Part 1: Training a Single-Step Denoising UNet

Let's warmup by building a simple one-step denoiser. Given a noisy image $z$, we aim to train a denoiser $D_\theta$ such that it maps $z$ to a clean image $x$. To do so, we can optimize over an L2 loss: $$L = \mathbb{E}_{z,x} \|D_{\theta}(z) - x\|^2 \tag{B.1}$$

1.1 Implementing the UNet

In this project, we implement the denoiser as a UNet. It consists of a few downsampling and upsampling blocks with skip connections.

UNet Architecture

Figure 1: Unconditional UNet

The diagram above uses a number of standard tensor operations defined as follows:

UNet Operations

Figure 2: Standard UNet Operations

where: At a high level, the blocks do the following:

We define composed operations using our simple operations in order to make our network deeper. This doesn't change the tensor's height, width, or number of channels, but simply adds more learnable parameters.

1.2 Using the UNet to Train a Denoiser

Recall from equation 1 that we aim to solve the following denoising problem: Given a noisy image $z$, we aim to train a denoiser $D_\theta$ such that it maps $z$ to a clean image $x$. To do so, we can optimize over an L2 loss $$ L = \mathbb{E}_{z,x} \|D_{\theta}(z) - x\|^2. $$ To train our denoiser, we need to generate training data pairs of ($z$, $x$), where each $x$ is a clean MNIST digit. For each training batch, we can generate $z$ from $x$ using the the following noising process: $$ z = x + \sigma \epsilon,\quad \text{where }\epsilon \sim N(0, I). \tag{B.2} $$ Visualize the different noising processes over $\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$, assuming normalized $x \in [0, 1]$. It should be similar to the following plot:
Varying Sigmas

Figure 3: Varying levels of noise on MNIST digits

1.2.1 Training

Now, we will train the model to perform denoising.

Training Loss Curve

Figure 4: Training Loss Curve

You should visualize denoised results on the test set at the end of training. Display sample results after the 1st and 5th epoch.

They should look something like these:

After the first epoch

Figure 5: Results on digits from the test set after 1 epoch of training

After the 5-th epoch

Figure 6: Results on digits from the test set after 5 epochs of training

1.2.2 Out-of-Distribution Testing

Our denoiser was trained on MNIST digits noised with $\sigma = 0.5$. Let's see how the denoiser performs on different $\sigma$'s that it wasn't trained for.

Visualize the denoiser results on test set digits with varying levels of noise $\sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$.

Varying Sigmas

Figure 7: Results on digits from the test set with varying noise levels.

Deliverables

Hint

Part 2: Training a Diffusion Model

Now, we are ready for diffusion, where we will train a UNet model that can iteratively denoise an image. We will implement DDPM in this part.

Let's revisit the problem we solved in equation B.1:

$$L = \mathbb{E}_{z,x} \|D_{\theta}(z) - x\|^2.$$

We will first introduce one small difference: we can change our UNet to predict the added noise $\epsilon$ instead of the clean image $x$ (like in part 4A of the project). Mathematically, these are equivalent since $x = z - \sigma \epsilon$ (equation B.2). Therefore, we can turn equation B.1 into the following:

$$L = \mathbb{E}_{\epsilon,z} \|\epsilon_{\theta}(z) - \epsilon\|^2 \tag{B.3}$$

where $\epsilon_\theta$ is a UNet trained to predict noise.

For diffusion, we eventually want to sample a pure noise image $\epsilon \sim N(0, I)$ and generate a realistic image $x$ from the noise. However, we saw in part A that one-step denoising does not yield good results. Instead, we need to iteratively denoise the image for better results.

Recall in part A that we used equation A.2 to generate noisy images $x_t$ from $x_0$ for some timestep $t$ for $t \in \{0, 1, \cdots, T\}$: $$ x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1).$$ Intuitively, when $t = 0$ we want $x_t$ to be the clean image $x_0$, when $t = T$ we want $x_t$ to be pure noise $\epsilon$, and for $t \in \{1, \cdots, T-1\}$, $x_t$ should be some linear combination of the two. The precise derivation of $\bar\alpha$ is beyond the scope of this project (see DDPM paper for more details). Here, we provide you with the DDPM recipe to build a list $\bar\alpha$ for $t \in \{0, 1, \cdots, T\}$ utilizing lists $\alpha$ and $\beta$:

Because we are working with simple MNIST digits, we can afford to have a smaller $T$ of 300 instead of the 1000 used in part A. Observe how $\bar\alpha_t$ is close to 1 for small $t$ and close to 0 for $T$. $\beta$ is known as the variance schedule; it controls the amount of noise added at each timestep.

Now, to denoise image $x_t$, we could simply apply our UNet $\epsilon_\theta$ on $x_t$ and get the noise $\epsilon$. However, this won't work very well because the UNet is expecting the noisy image to have a noise variance $\sigma = 0.5$ for best results, but the variance of $x_t$ varies with $t$. One could train $T$ separate UNets, but it is much easier to simply condition a single UNet with timestep $t$, giving us our final objective: $$L = \mathbb{E}_{\epsilon,x_0,t} \|\epsilon_{\theta}(x_t, t) - \epsilon\|^2. \tag{B.4}$$

2.1 Adding Time Conditioning to UNet

We need a way to inject scalar $t$ into our UNet model to condition it. There are many ways to do this. Here is what we suggest:
UNet Highlighted

Figure 8: Conditioned UNet

This uses a new operator called FCBlock (fully-connected block) which we use to inject the conditioning signal into the UNet:

FCBlock

Figure 9: FCBlock for conditioning

Here Linear(F_in, F_out) is a linear layer with F_in input features and F_out output features. You can implement it using nn.Linear.

Since our conditioning signal $t$ is a scalar, F_in should be of size 1. We also recommend that you normalize $t$ to be in the range [0, 1] before embedding it, i.e. pass in $\frac{t}{T}$.

You can embed $t$ by following this pseudo code:


fc1_t = FCBlock(...)
fc2_t = FCBlock(...)

# the t passed in here should be normalized to be in the range [0, 1]
t1 = fc1_t(t)
t2 = fc2_t(t)

# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = up1 + t2
# Follow diagram to get the output.
...
    

2.2 Training the UNet

Training our time-conditioned UNet $\epsilon_\theta(x_t, t)$ is now pretty easy. Basically, we pick a random image from the training set, a random $t$, and train the denoiser to predict the noise in $x_t$ We repeat this for different images and different $t$ values until the model converges and we are happy.

Algorithm Diagram

Algorithm B.1. Training time-conditioned UNet

Loss Curve

Figure 10: Time-Conditioned UNet training loss curve

2.3 Sampling from the UNet

The sampling process is very similar to part A, except we don't need to predict the variance like in the DeepFloyd model. Instead, we can use our list $\beta$.

Algorithm Diagram

Algorithm B.2. Sampling from time-conditioned UNet

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Deliverables

2.4 Adding Class-Conditioning to UNet

To make the results better and give us more control for image generation, we can also optionally condition our UNet on the class of the digit 0-9. This will require adding 2 more FCBlocks to our UNet but, we suggest that for class-conditioning vector $c$, you make it a one-hot vector instead of a single scalar. Because we still want our UNet to work without it being conditioned on the class, we implement dropout where 10% of the time ($p_{\text{uncond}}= 0.1$) we drop the class conditioning vector $c$ by setting it to 0. Here is one way to condition our UNet $\epsilon_\theta(x_t, t, c)$ on both time $t$ and class $c$:

fc1_t = FCBlock(...)
fc1_c = FCBlock(...)
fc2_t = FCBlock(...)
fc2_c = FCBlock(...)

t1 = fc1_t(t)
c1 = fc1_c(c)
t2 = fc2_t(t)
c2 = fc2_c(c)

# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = c1 * unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = c2 * up1 + t1
# Follow diagram to get the output.
...



        
Training for this section will be the same as time-only, with the only difference being the conditioning vector $c$ and doing unconditional generation periodically.
Algorithm Diagram

Algorithm B.3. Training class-conditioned UNet

Training Loss Curve

Figure 11: Class-conditioned UNet training loss curve

2.5 Sampling from the Class-Conditioned UNet

The sampling process is the same as part A, where we saw that conditional results aren't good unless we use classifier-free guidance. Use classifier-free guidance with $\gamma = 5.0$ for this part.
Algorithm Diagram

Algorithm B.4. Sampling from class-conditioned UNet

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Deliverables

Bells & Whistles

Acknowledgements

This project was a joint effort by Daniel Geng, Ryan Tabrizi, and Hang Gao, advised by Liyue Shen, Andrew Owens, and Alexei Efros.