Using the diffusional model, DeepFloyd IF, I generated and edited images and created optical illusions!
In this part of the project, we are using the DeepFloyd IF diffusion model. I sampled images from the model with the following prompts, using a seed of 180.
The following is for 10 inference steps:
The following is for 20 inference steps:
The following is for 100 inference steps:
As you can see the images become more realistic/natural looking as the number of inference steps increases.
The forward process involves taking a clean image and adding gaussian noise to it. These are timesteps from 0 to 999, with 0 corresponding to the clean image, and an increasing t resulting in increased noise. Here is the campanile at the following timesteps.
With just a simple gaussian blur filter to get rid of high frequencies, we get the following. Not so great...
Using the UNet, in one step, we can estimate the noise in a given image, and use that to calculate the clean image.
When there is a lot of noise, one step denoising doesn't do quite well. But if we implement iterative denoising (at strided timesteps to make it more computationally inexpensive), we can do much better. In iterative denoising, we predict the noise in the previous timestep until we reach the clean image.
Passing in random noise into the diffusion model, we can generate images with the text prompt "a high quality photo".
CFG produces much more natural images by using both a conditional noise estimate (based on a prompt) and an unconditional noise estimate (null prompt). Here are some samples using CFG.
The SDEdit algorithm serves to "edit" photos through adding noise and projecting it back to the natural image manifold. Here are some Berkeley themed images using this algorithm at different noise levels.
We can use this SDEdit algorithm to edit images that are not "natural" looking (aka drawings, etc.)
Implementing the RePaint paper, we can edit specific areas of an image, using a mask. The following show the mask, what's to be edited, and the edits. I think it's kind of cool what the model hallucinates. For sather gate, you can see it added a person walking by and for the doe photo, it added a chandlier, both of which are pretty realistic edits.
Earlier, the image-to-image translation was just projecting to the natural image manifold. Here, we guide it with more specific prompts. The prompts are "a rocket ship", "an oil painting of a snowy mountain village", and "a lithograph of waterfalls". Sather gate in a snowy village is my personal favorite (i_start = 20).
If you get noise a estimate from the diffusion model for one prompt (epsilon_1), flip the current image and take another noise estimate with a different second prompt (epsilon_2) and flip it, you can average these two to be the noise estimate for a visual anagram (i.e. an image that looks like one thing in one orientation and another when flipped). This makes a lot more sense when looking at the examples below.
If you pass the noise estimate from one prompt through a low pass filter and a noise estimate of another prompt through a high pass filter and add them to iteratively denoise, you can get a hybrid image! The prompt associated with the low pass of epsilon can be seen from afar while the one associated with the high pass can be seen up close.