Image Generation Using Stable Diffusion
Generating new, high-quality images has become a popular topic in the field of Machine Learning. The goal is to train a model that can generate images that look similar to the training data. The most well-known approach to this problem is Generative Adversarial Networks (GANs), which are trained using a game-theoretic approach. However, training GANs can be challenging due to their unstable nature, and this is where Image Generation using Stable Diffusion comes into play. In this article we’ll cover the following topics:
- What is Stable Diffusion?
- How does Stable Diffusion work?
- Image Generation Code Sample using KerasCV
- References & Resources
What is Stable Diffusion?
Stable diffusion is a new approach to image generation that aims to overcome the instability of GANs. It is based on a diffusion process, which is a mathematical method for simulating the random movement of particles. In the context of image generation, the diffusion process is used to generate new images by slowly modifying the values of pixels. The goal is to generate images that are similar to the training data but not identical.
The main idea behind stable diffusion is to use a stability criterion to control the diffusion process. This criterion ensures that the generated images remain similar to the training data, even as the diffusion process progresses. The stability criterion can be thought of as a kind of regularization that helps to prevent the generated images from deviating too far from the training data.
The key to stable diffusion is to use a special type of diffusion process called a “Hamiltonian Monte Carlo” (HMC) process. This process uses a combination of gradient information and random noise to guide the diffusion process. The gradient information helps to ensure that the generated images remain similar to the training data, while the random noise ensures that the generated images are different from the training data.
How does Stable Diffusion work?
The Stable Diffusion method is not based on the idea of Super-resolution. Super-resolution is a well-known concept where a deep learning model can be trained to denoise an input image, resulting in a higher-resolution version of the same image. However, the deep learning model doesn’t achieve this by magically restoring the missing information from the noisy, low-resolution input. Instead, the model utilizes its training data distribution to imagine the visual details that are most probable based on the input.
In the realm of stable diffusion models, an intriguing notion arises: what if we apply the model to pure noise? The result would be a process of “denoising the noise” which could then be leveraged to generate a completely novel image. By repeating this procedure multiple times, a small patch of noise can transform into an increasingly clear and high-resolution artificial image.
This key concept of latent diffusion was first introduced in the 2020 paper “High-Resolution Image Synthesis with Latent Diffusion Models” [2].
Text-to-image systems using stable diffusion models
To create a text-to-image system using stable diffusion models, we require an essential feature: the ability to govern the generated visual content using specific prompt keywords. We achieve this capability through “conditioning,” a fundamental deep learning technique that involves concatenating a vector representing a text prompt to a noise patch. Then, we train the model on a dataset of {image: caption} pairs.
This technique leads to the creation of the Stable Diffusion architecture, consisting of three primary components
- A text encoder: converts the prompt into a latent vector.
- A diffusion model: repeatedly “denoises” a 64×64 latent image patch.
- A decoder: transforms the final 64×64 latent patch into a higher-resolution 512×512 image.
The text encoder is a pre-trained and frozen language model that maps the text prompt to a latent vector. The resulting latent vector is then concatenated with a randomly generated noise patch, which is “denoised” by the diffusion model over a series of “steps.” The number of steps determines the image’s clarity and quality, with the default value being 50 steps.
Finally, the 64×64 latent image is fed into the decoder to produce a high-resolution version of the generated image.
However, this seemingly straightforward system begins to appear magical when trained on billions of images and their corresponding captions.
A deeper look into Image Generation
- Diffusion is an iterative process. Having the text embeddings (or any other condition such as an image), and a random starting latent vector, the process produces an information array that the image decoder uses to paint the final image.
- This process happens in a step-by-step fashion. Each step enhances the generated image.
- Usually, the image decoder is used once after n iterations (practically 50 steps) to generate the final image. However, we can inspect the generated images step by step to see how the images conditionally emerged from the noise.
- To speed up the image generation process, the Stable Diffusion paper runs the diffusion on a compressed version of the image (aka latent space).
- An autoencoder does this compression (decompression/painting) via the active voice.
- The autoencoder compresses the image into the latent space using its encoder, then reconstructs it using only the compressed information using the decoder.
- We perform the forward diffusion process on the latent space, and we apply noise to the latent space, not to the pixel image. Therefore, we train the noise predictor to actually predict noise in the latent space.
Benefits of using Stable Diffusion over GANs
One of the key benefits of stable diffusion is that it can generate high-quality images with a wide range of styles. One can control the diffusion process in a variety of ways depending on the desired output. For example, one can make the diffusion process more or less aggressive depending on the desired level of detail in the generated images..
Another benefit of stable diffusion is that it is relatively easy to train. Unlike GANs, which require careful tuning of various hyperparameters, stable diffusion can be trained using simple gradient descent. This makes it much easier for researchers to experiment with different settings and see what works best.
Image Generation Code Sample using KerasCV
To get started, let’s import our libraries:
import time
import keras_cv
from tensorflow import keras
import matplotlib.pyplot as plt
First, we construct the model keras_cv.models.StableDiffusion()
model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)
Next, we give it a prompt:
images = model.text_to_image("temple in ruines, forest, stairs, columns, cinematic, detailed, atmospheric, epic, concept art, Matte painting, background, mist, photo-realistic, concept art, volumetric light, cinematic epic + rule of thirds octane render, 8k, corona render, movie concept art", batch_size=3)
def plot_images(images):
plt.figure(figsize=(20, 20))
for i in range(len(images)):
ax = plt.subplot(1, len(images), i + 1)
plt.imshow(images[i])
plt.axis("off")
plot_images(images)
Here’s what the model outputs:
In conclusion, stable diffusion is a promising new approach to image generation that offers a number of benefits over traditional GANs. Its stability criterion helps to ensure that the generated images remain similar to the training data, even as the diffusion process progresses. Its ease of training also makes it an attractive option for researchers who are looking for a more straightforward way to generate high-quality images. As this field continues to evolve, we can expect to see even more exciting developments in the near future. The possibilities are endless.
Source Code
- Source Code can found on our Github.