Project 5

Fun With Diffusion Models!

background image of the UC Berkeley Campanile

Overview

In this project we will use diffusion models to create images!

Part A will deal with using a pretrained neural net to create new algorithms.

In part B we will construct and train our own net for image generation.

aligned onion church Coastline
aligned onion church A man with a hat

Part A: The Power of Diffusion Models!

Part 0: Setup

The model we use is the DeepFloyd IF diffusion model. Due to memory, storage, and time constraints, I only use the first stage, which creates the initial 32 x 32 images, but not the second stage, which upscales the image to be larger and more detailed (except for the below 3 images).

I used a random seed 42. However, my results were still non-deterministic. I beleive this is because of GPU non-determinism, caused by how floating point operations in the GPU are not perfectly commutative.

These are 3 example images that use both stages:

Photo 1 An oil painting of a snowy village
Photo 2 A man wearing a hat
Photo 3 A rocket ship

Using both stages yields good results! Changing the num_inference steps seems to improve the quality of the images.

Part 1: Sampling Loops

1.1: Implementing the Forward Process

Adding noise to the image: t=0 is minimum, t=1000 is maximum.

Photo 1

Original Image

Photo 2

t=250

Photo 3

t=500

Photo 3

t=750

1.2: Classical Denoising

"Denoising" by performing a gaussian blur

Photo 1

t=250, blurred

Photo 2

t=500, blurred

Photo 3

t=750, blurred

1.3: One-Step Denoising

Here we use DeepFloyd to attempt to denoise the image in a single step:

Photo 1

t=250, one step denoised

Photo 2

t=500, one step denoised

Photo 3

t=750, one step denoised

1.4: Iterative Denoising

Here we use multiple steps to denoise the image, hopefully acheiving better results:

Photo 1

t=690

Photo 1

t=540

Photo 1

t=390

Photo 1

t=240

Photo 1

t=90

Photo 1

t=0 (Denoised)

Summary of 1.1 through 1.4 - Comparison of denoising methods from an image with t=750 noise:

Photo 1

Noised

Photo 1

Classically "Denoised"

Photo 1

One-Step Denoised

Photo 1

Iteratively Denoised

1.5: Diffusion Model Sampling

Using the diffusion model to generate images from the entire high quality image space:

Photo 1

Sampled Image 1

Photo 1

Sampled Image 2

Photo 1

Sampled Image 3

Photo 1

Sampled Image 4

Photo 1

Sampled Image 5

1.6: Classifier-Free Guidance (CFG)

In the section above, the model doesen't have a specific "idea" in mind for what it is trying to target. As a result, the images are nonsensical.

To combat this, we use classifier-free guidance, which will direct the model towards a more coherent image:

Photo 1

Sampled Image 1

Photo 1

Sampled Image 2

Photo 1

Sampled Image 3

Photo 1

Sampled Image 4

Photo 1

Sampled Image 5

1.7: Image-to-image Translation

I think of this as modulating between a random point on the high quality image manifold to an image we specify:

Photo 1

Campanile

Photo 1

Darth Vader

Photo 1

A rat

Campanile Test Image:

Photo 1

i_start=1

Photo 1

i_start=3

Photo 1

i_start=5

Photo 1

i_start=7

Photo 1

i_start=10

Photo 1

i_start=20

Darth Vader:

Photo 1

i_start=1

Photo 1

i_start=3

Photo 1

i_start=5

Photo 1

i_start=7

Photo 1

i_start=10

Photo 1

i_start=20

An ugly rat:

Photo 1

i_start=1

Photo 1

i_start=3

Photo 1

i_start=5

Photo 1

i_start=7

Photo 1

i_start=10

Photo 1

i_start=20

1.7.1: Editing Hand-Drawn and Web Images

I modulate between a random point on the high quality image space to the following 3 images:

Photo 1

A photo of a minecraft house

Photo 1

A really nice self portrait

Photo 1

An elegant drawing of a house

Minecraft:

Photo 1

i_start=1

Photo 1

i_start=3

Photo 1

i_start=5

Photo 1

i_start=7

Photo 1

i_start=10

Photo 1

i_start=20

Self-portrait:

Photo 1

i_start=1

Photo 1

i_start=3

Photo 1

i_start=5

Photo 1

i_start=7

Photo 1

i_start=10

Photo 1

i_start=20

A beautiful house:

Photo 1

i_start=1

Photo 1

i_start=3

Photo 1

i_start=5

Photo 1

i_start=7

Photo 1

i_start=10

Photo 1

i_start=20

1.7.2: Inpainting

I mask off an area of the following images in order to modify them:

Photo 1

The test image (top is masked)

Photo 1

A tiger (entire head is masked)

Photo 1

Steve from minecraft (face is masked)

Photo 1 A lighthouse?
Photo 1 Zombie tiger? (Zoom in, this one is subtle)
Photo 1 Steve + A minion from dispicible me?
1.7.3: Text-Conditional Image-to-image Translation

Similar to the section above, except we now specify where on the image manifold we want to start from. In this case, the manifold of rocket ships, to the test images.

A rocket ship -> Campanile

Photo 1

i_start=1

Photo 1

i_start=3

Photo 1

i_start=5

Photo 1

i_start=7

Photo 1

i_start=10

Photo 1

i_start=20

A pencil -> Campanile

Photo 1

i_start=1

Photo 1

i_start=3

Photo 1

i_start=5

Photo 1

i_start=7

Photo 1

i_start=10

Photo 1

i_start=20

A man wearing a hat -> Campanile

Photo 1

i_start=1

Photo 1

i_start=3

Photo 1

i_start=5

Photo 1

i_start=7

Photo 1

i_start=10

Photo 1

i_start=20

1.8: Visual Anagrams:

These are images that look like a different thing depending on if you look at them upright or rotated 180 degrees:

Photo 1

An oil painting of an old man

Photo 1

An oil painting of people around a campfire

Photo 1

A hipster barista

Photo 1

A dog

Photo 1

coastline

Photo 1

A man wearing a hat

1.9: Hybrid Images

These are images that look like different things if you zoom in or zoom out:

Photo 1

Zoomed in, a waterfall. Zoomed out, a skull

Photo 1

Zoomed in, a waterfall. Zoomed out, an oil painting of an old man.

Photo 1

Zoomed in, a pencil. Zoomed out, a rocket.

Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

1.1-1.2: Implementing the UNet and Using the UNet to Train a Denoiser
Photo 1 Plot of different levels of noise on various digits

Training Loss (L2 Loss):

Photo 1

Noisy Images:

Photo 1

Results after 1st epoch:

Photo 1

Results after 5th epoch:

Photo 1

Part 2: Training a Diffusion Model

2.1 - 2.3: Adding Time Conditioning, Training, and Sampling the new Unet:

Loss Curve (l2):

Photo 1

Results after 5th epoch:

Photo 1

Results after 20th epoch:

Photo 1
2.4 - 2.5: Adding Class Conditioning, Training, and Sampling the new Unet:

Loss Curve (l2)

Photo 1

Results after 5th epoch:

Photo 1

Results after 20th epoch:

Photo 1