Programming Project #5 (`proj5`)
CS180: Intro to Computer Vision and Computational Photography
University of California, Berkeley

Flow Matching from Scratch!

Due: 03/15/25 11:59pm

We recommend using GPUs from Colab to finish this project!

Overview

You will train your own flow matching model on MNIST. Starter code can be found in the provided notebook.

START EARLY!

Note: this is an updated version of CS180's Project 5 part B with flow matching instead of DDPM diffusion. For the DDPM version, please see here.

Part 1: Training a Single-Step Denoising UNet

Let's warmup by building a simple one-step denoiser. Given a noisy image $z$ , we aim to train a denoiser $D_{θ}$ such that it maps $z$ to a clean image $x$ . To do so, we can optimize over an L2 loss: $\begin{matrix} (B.1) & L = E_{z, x} ‖ D_{θ} (z) - x ‖^{2} \end{matrix}$

1.1 Implementing the UNet

In this project, we implement the denoiser as a UNet. It consists of a few downsampling and upsampling blocks with skip connections.

Figure 1: Unconditional UNet

The diagram above uses a number of standard tensor operations defined as follows:

Figure 2: Standard UNet Operations

where:

Conv2d(kernel_size, stride, padding) is nn.Conv2d()
BN is nn.BatchNorm2d()
GELU is nn.GELU()
ConvTranspose2d(kernel_size, stride, padding) is nn.ConvTranspose2d()
AvgPool(kernel_size) is nn.AvgPool2d()
D is the number of hidden channels and is a hyperparameter that we will set ourselves.

At a high level, the blocks do the following:

(1) Conv is a convolutional layer that doesn't change the image resolution, only the channel dimension.
(2) DownConv is a convolutional layer that downsamples the tensor by 2.
(3) UpConv is a convolutional layer that upsamples the tensor by 2.
(4) Flatten is an average pooling layer that flattens a 7x7 tensor into a 1x1 tensor. 7 is the resulting height and width after the downsampling operations.
(5) Unflatten is a convolutional layer that unflattens/upsamples a 1x1 tensor into a 7x7 tensor.
(6) Concat is a channel-wise concatenation between tensors with the same 2D shape. This is simply torch.cat().

We define composed operations using our simple operations in order to make our network deeper. This doesn't change the tensor's height, width, or number of channels, but simply adds more learnable parameters.

(7) ConvBlock, is similar to Conv but includes an additional Conv. Note that it has the same input and output shape as (1) Conv.
(8) DownBlock, is similar to DownConv but includes an additional ConvBlock. Note that it has the same input and output shape as (2) DownConv.
(9) UpBlock, is similar to UpConv but includes an additional ConvBlock. Note that it has the same input and output shape as (3) UpConv.

1.2 Using the UNet to Train a Denoiser

Recall from equation 1 that we aim to solve the following denoising problem: Given a noisy image

z

, we aim to train a denoiser

D_{θ}

such that it maps

z

to a clean image

x

. To do so, we can optimize over an L2 loss

L = E_{z, x} ‖ D_{θ} (z) - x ‖^{2} .

To train our denoiser, we need to generate training data pairs of (

z

x

), where each

x

is a clean MNIST digit. For each training batch, we can generate

z

from

x

using the the following noising process:

\begin{matrix} (B.2) & z = x + σ ϵ, where ϵ \sim N (0, I) . \end{matrix}

Visualize the different noising processes over

σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

, assuming normalized

x \in [0, 1]

. It should be similar to the following plot:

Figure 3: Varying levels of noise on MNIST digits

1.2.1 Training

Now, we will train the model to perform denoising.

Objective: Train a denoiser to denoise noisy image $z$ with $σ = 0.5$ applied to a clean image $x$ .
Dataset and dataloader: Use the MNIST dataset via torchvision.datasets.MNIST with flags to access training and test sets. Train only on the training set. Shuffle the dataset before creating the dataloader. Recommended batch size: 256. We'll train over our dataset for 5 epochs.
- You should only noise the image batches when fetched from the dataloader so that in every epoch the network will see new noised images, improving generalization.
Model: Use the UNet architecture defined in section 1.1 with recommended hidden dimension D = 128.
Optimizer: Use Adam optimizer with learning rate of 1e-4.

Figure 4: Training Loss Curve

You should visualize denoised results on the test set at the end of training. Display sample results after the 1st and 5th epoch.

They should look something like these:

Figure 5: Results on digits from the test set after 1 epoch of training

Figure 6: Results on digits from the test set after 5 epochs of training

1.2.2 Out-of-Distribution Testing

Our denoiser was trained on MNIST digits noised with $σ = 0.5$ . Let's see how the denoiser performs on different $σ$ 's that it wasn't trained for.

Visualize the denoiser results on test set digits with varying levels of noise $σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$ .

Figure 7: Results on digits from the test set with varying noise levels.

1.2.3 Denoising Pure Noise

To make denoising a generative task, we'd like to be able to denoise pure, random Gaussian noise. We can think of this as starting with a blank canvas $z = ϵ$ where $ϵ \sim N (0, I)$ and denoising it to get a clean image $x$ .

Repeat the same denoising process as in part 1.2.1, but start with pure noise $ϵ \sim N (0, I)$ and denoise it for 5 epochs. Display your results after 1 and 5 epochs.

Additionally, compute the average image of the training set. What do you notice between the average image and our attempts to denoise pure noise? Why might this be happening?

We won't provide reference images for this part.

Deliverables

A visualization of the noising process using $σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$ . (figure 3)
A training loss curve plot every few iterations during the whole training process (figure 4).
Sample results on the test set after the first and the 5-th epoch (staff solution takes ~3 minutes for 5 epochs on a Colab T4 GPU). (figure 5, 6)
Sample results on the test set with out-of-distribution noise levels after the model is trained. Keep the same image and vary $σ = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]$ . (figure 7)
Sample results on the test set with pure noise $ϵ \sim N (0, I)$ .
Average image of the training set along with a brief description comparing it to the denoising results.

Hint

Since training can take a while, we strongly recommend that you checkpoint your model every epoch onto your personal Google Drive. This is because Colab notebooks aren't persistent such that if you are idle for a while, you will lose connection and your training progress. This consists of:
- Google Drive mounting.
- Epoch-wise model & optimizer checkpointing.
- Model & optimizer resuming from checkpoints.

Part 2: Training a Flow Matching Model

We just saw that one-step denoising does not work well for generative tasks. Instead, we need to iteratively denoise the image, and we will do so with flow matching. Here, we will iteratively denoise an image by training a UNet model to predict the `flow' from our noisy data to clean data. In our flow matching setup, we sample a pure noise image

x_{0} \sim N (0, I)

and generate a realistic image

x_{1}

For iterative denoising, we need to define how intermediate noisy samples are constructed. The simplest approach would be a linear interpolation between noisy $x_{0}$ and clean $x_{1}$ for some $x_{1}$ in our training data:

\begin{matrix} (B.3) & x_{t} = (1 - t) x_{0} + t x_{1} where x_{0} \sim N (0, 1), t \in [0, 1] . \end{matrix}

This is a vector field describing the position of a point

x_{t}

at time

t

relative to the clean data distribution

p_{1} (x_{1})

and the noisy data distribution

p_{0} (x_{0})

. Intuitively, we see that for small

t

, we remain close to noise, while for larger

t

, we approach the clean distribution.

Flow can be thought of as the velocity (change in posiiton w.r.t. time) of this vector field, describing how to move from $x_{0}$ to $x_{1}$ : $\begin{matrix} (B.4) & u (x_{t}, t) = \frac{d}{d t} x_{t} = x_{1} - x_{0} . \end{matrix}$

Our aim is to learn a UNet $u_{θ} (x_{t}, t)$ which approximates this flow $u (x_{t}, t) = x_{1} - x_{0}$ , giving us our learning objective: $\begin{matrix} (B.5) & L = E_{x_{0} \sim p_{0} (x_{0}), x_{1} \sim p_{1} (x_{1}), t \sim U [0, 1]} ‖ (x_{1} - x_{0}) - u_{θ} (x_{t}, t) ‖^{2} . \end{matrix}$

2.1 Adding Time Conditioning to UNet

We need a way to inject scalar

t

into our UNet model to condition it. There are many ways to do this. Here is what we suggest:

Figure 8: Conditioned UNet

Note: It may look like we're predicting the original image in the figure above, but we are not. We're predicting the flow from the noisy $x_{0}$ to clean $x_{1}$ , which will contain both parts of the original image as well as the noise to remove.

This uses a new operator called FCBlock (fully-connected block) which we use to inject the conditioning signal into the UNet:

Figure 9: FCBlock for conditioning

Here Linear(F_in, F_out) is a linear layer with F_in input features and F_out output features. You can implement it using nn.Linear.

Since our conditioning signal $t$ is a scalar, F_in should be of size 1.

You can embed $t$ by following this pseudo code:


fc1_t = FCBlock(...)
fc2_t = FCBlock(...)

# the t passed in here should be normalized to be in the range [0, 1]
t1 = fc1_t(t)
t2 = fc2_t(t)

# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = unflatten * t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = up1 * t2
# Follow diagram to get the output.
...

2.2 Training the UNet

Training our time-conditioned UNet

u_{θ} (x_{t}, t)

is now pretty easy. Basically, we pick a random image from the training set, a random timestep

t

, and train the denoiser to predict the flow at

x_{t}

. We repeat this for different images and different timesteps until the model converges and we are happy.

Algorithm B.1. Training time-conditioned UNet

Objective: Train a time-conditioned UNet $u_{θ} (x_{t}, t)$ to predict the flow at $x_{t}$ given a noisy image $x_{t}$ and a timestep $t$ .
Dataset and dataloader: Use the MNIST dataset via torchvision.datasets.MNIST with flags to access training and test sets. Train only on the training set. Shuffle the dataset before creating the dataloader. Recommended batch size: 64.
- As shown in algorithm B.1, You should only noise the image batches when fetched from the dataloader.
Model: Use the time-conditioned UNet architecture defined in section 2.1 with recommended hidden dimension D = 64. Follow the diagram and pseudocode for how to inject the conditioning signal $t$ into the UNet. Remember to normalize $t$ before embedding it.
Optimizer: Use Adam optimizer with an initial learning rate of 1e-2. We will be using an exponential learning rate decay scheduler with a gamma of ${0.1}^{(1.0 / num_epochs)}$ . This can be implemented using scheduler = torch.optim.lr_scheduler.ExponentialLR(...). You should call scheduler.step() after every epoch.

Figure 10: Time-Conditioned UNet training loss curve

2.3 Sampling from the UNet

We can now use our UNet for iterative denoising using the algorithm below!

Algorithm B.2. Sampling from time-conditioned UNet

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Deliverables

A training loss curve plot for the time-conditioned UNet over the whole training process (figure 10). Note that if you trained for less than 20 epochs, you will have less steps.
Sampling results for the time-conditioned UNet for 5 and 10 epochs. We provide up to 20 epochs just for reference.
- Note: providing a gif is optional and can be done as a bells and whistles below.

2.4 Adding Class-Conditioning to UNet

To make the results better and give us more control for image generation, we can also optionally condition our UNet on the class of the digit 0-9. This will require adding 2 more FCBlocks to our UNet but, we suggest that for class-conditioning vector

c

, you make it a one-hot vector instead of a single scalar. Because we still want our UNet to work without it being conditioned on the class, we implement dropout where 10% of the time (

p_{uncond} = 0.1

) we drop the class conditioning vector

c

by setting it to 0. Here is one way to condition our UNet

u_{θ} (x_{t}, t, c)

on both time

t

and class

c


fc1_t = FCBlock(...)
fc1_c = FCBlock(...)
fc2_t = FCBlock(...)
fc2_c = FCBlock(...)

t1 = fc1_t(t)
c1 = fc1_c(c)
t2 = fc2_t(t)
c2 = fc2_c(c)

# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = c1 * unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = c2 * up1 + t2
# Follow diagram to get the output.
...

Training for this section will be the same as time-only, with the only difference being the conditioning vector $c$ and doing unconditional generation periodically.

Algorithm B.3. Training class-conditioned UNet

Figure 11: Class-conditioned UNet training loss curve

2.5 Sampling from the Class-Conditioned UNet

Now we will sample with class-conditioning and will use classifier-free guidance with

γ = 5.0

Algorithm B.4. Sampling from class-conditioned UNet

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Deliverables

A training loss curve plot for the class-conditioned UNet over the whole training process (figure 11). Note that if you trained for less than 20 epochs, you will have less steps.
Sampling results for the class-conditioned UNet for 5 and 10 epochs. Class-conditioning lets us converge faster, hence why we only train for 10 epochs. Generate 4 instances of each digit as shown above.

Optional Extension: DDPM

This project originally had you implement DDPM instead of flow matching (flow matching can be thought of as a generalization of DDPM). If you'd like to implement DDPM, you can do so by following the original project instructions found here.

Acknowledgements

This project was a joint effort by Ryan Tabrizi, Daniel Geng, and Hang Gao, advised by Liyue Shen, Andrew Owens, Angjoo Kanazawa, and Alexei Efros. We also thank David McAllister and Songwei Yi for their helpful feedback and suggestions.

Programming Project #5 (proj5) CS180: Intro to Computer Vision and Computational Photography University of California, Berkeley

Flow Matching from Scratch!

Due: 03/15/25 11:59pm

We recommend using GPUs from Colab to finish this project!

Overview

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

1.2 Using the UNet to Train a Denoiser

1.2.1 Training

1.2.2 Out-of-Distribution Testing

1.2.3 Denoising Pure Noise

Deliverables

Part 2: Training a Flow Matching Model

2.1 Adding Time Conditioning to UNet

2.2 Training the UNet

2.3 Sampling from the UNet

Deliverables

2.4 Adding Class-Conditioning to UNet

2.5 Sampling from the Class-Conditioned UNet

Deliverables

Optional Extension: DDPM

Acknowledgements

Programming Project #5 (`proj5`)
CS180: Intro to Computer Vision and Computational Photography
University of California, Berkeley