Kickstart your coding journey with our Python Code Assistant. An AI-powered assistant that's always ready to help. Don't miss out!
If you haven't been living under a rock, you have likely heard of the text-to-image AI models that can generate high-quality images from just text. Among the most well-known models in this field are DALLE-2 and GLIDE by OpenAI, Imagen by Google, Midjourney, Stable Diffusion, and many more.
All these text-to-image models come under the family of latent diffusion models. Hence, this in-depth tutorial will dive into the details of how latent diffusion models work and explore their inner workings. We will also write Python code to run a text-to-image model by dreamlike.art ourselves followed by manually implementing a diffusion framework. We will also see how to use the different components involved to generate similar images from a given image.
Table of content:
Latent diffusion models are a super popular family of models that can be used to create a wide variety of images, like drawings and paintings, portraits, landscapes, animals, the solar system, and basically anything you can imagine!
It can make several amazing images with prompts such as these:
“Cute Rabbit, Ultra HD, realistic, futuristic, sharp, octane render, photoshopped, photorealistic, soft, pastel, Aesthetic, Magical background”
“Anime style aesthetic landscape, 90's vintage style, digital art, ultra HD, 8k, photoshopped, sharp focus, surrealism, akira style, detailed line art”
“Beautiful, abstract art of a human mind, 3D, highly detailed, 8K, aesthetic”
Who would have thought AI could generate images that are so beautiful and aesthetic!? It's hard to tell if an AI-generated the image of the rabbit or if it's a real photograph. Similarly, the anime-style landscape feels like an artwork by a professional artist! And the abstract art appears to have so many hidden meanings inside of it!
As we just saw, these generative models are really powerful. In fact, the more detailed a prompt is, the better and more relevant the image the model can make.
Let us now learn how these models generate images from nothing but text.
A latent diffusion model is a generative model that belongs to the family of Denoising Diffusion Probabilistic Models (DDPMs).
Simply put, the model works by decomposing the image generation process into a series of denoising steps.
In essence, the process starts by generating some random noise and the model begins removing this noise during each denoising step, eventually resulting in an image.
To understand how the model learns to denoise the image, we need to learn the forward and backward diffusion processes.
The forward process aims to start with an image from our training set and turns it into pure noise. This is achieved by adding some noise at each step of the process for a set number of timesteps. While mathematically, it would take infinite steps to reach pure noise, adding noise for a few hundred steps provides a good approximation.
The backward process, also known as the reverse diffusion process or the generative model's sampling process, aims to recover the original image. It tries to teach a neural network that takes a noisy image as input and returns a slightly denoised image as its output. During each step of the process, the output of the forward process is given as input to the model, and the model tries to denoise this input, i.e., tries to predict the input of the forward process.
Note that the backward diffusion process is not the same as backpropagation.
Now that we have some idea about the diffusion model framework, let us get into a bit more depth and see the different components involved in the forward and backward process.
The Stable Diffusion framework consists of 4 components: an autoencoder, a text encoder and its tokenizer, a UNet model, and a scheduler:
The U-Net model architecture from the original paper.
An essential aspect of these text-to-image models is conditioning on "something" to guide the diffusion process to favor the generation of specific images or classes. If there isn't any conditioning, there is no way for a model to know which image to generate, so it would generate any arbitrary random image.
There are mainly two families of methods for doing a guided diffusion.
In this family of methods, we use another model, a classifier, trained on noisy images to predict the classes of its inputs. Using the gradients of this model, we can guide the diffusion process to our desired image! Check out this paper for more details.
We can also achieve guidance without using a classifier model. In this family of models, we perform a single step of the diffusion process twice. Once conditioned on the text prompt and again without any conditioning. We get the noise predictions from both cases and use the following formula to get our overall noise term prediction:
Here, ϵ(x|0)
is unconditioned noise prediction, ϵ(x|y)
is the conditioned noise prediction and ϵ
is the final noise that we remove from our intermediate image.
Intuitively, the difference between the conditioned and the unconditioned noise terms gives us the direction for guidance. Suppose we scale this difference by s
, our scaling factor. In that case, we are essentially going more towards the noise term, which, when removed, would lead to a conditioned image or, in other words, a guided reverse diffusion. Since this approach doesn't require a second model, it is used more often.
Finally, here's how our training pipeline looks like:
The training pipeline in a nutshell from the Latent Diffusion Models paper
During inference, we are only concerned with the denoising and so we ignore the diffusion process as shown in the image. Here are the steps for inference:
Now that we have a pretty good idea about how diffusion models work, let's see how to generate our own images, like the ones at the beginning of this article!
Note: To get a more detailed insight into the diffusion process, check out the last section of this article, which discusses them from a mathematical perspective.
Let's get a bit formal and define some notation.
T = number of timesteps for the diffusion process
x0 = Input image from the training set
q(x) = Actual distribution of the training set
xT = Pure noise from which we would like to generate the original image
xt = intermediate latent space image at time step t
The entire diffusion process is formulated as a Markov Chain of T steps. A Markov chain is a stochastic model where we have a series of events with any event's probability distribution depending only on the immediate previous event. This means any denoising step in our backward diffusion depends only on the output of the previous timestep.
The forward diffusion can be described as q(xt|xt-1)=N(xt|1−βt√xt-1,βtI)
This tells us at each step we are adding some Gaussian noise to xt-1 to get the next noisy image xt . The Gaussian noise we are adding has a mean 1-βt√xt-1 and variance βtI . Here βt is a hyperparameter that is generally controlled by the "Scheduler". It is also called the scale parameter and intuitively, controls the amount of pixel distribution. Hence a large beta would result in wider pixel distribution due to high variance and noise.
The backward diffusion can be described as finding a probability distribution for q(xt-1|xt). Since this is intractable to compute, we choose a variational distribution pθ as a Gaussian distribution and parameterize it by the mean and the variance. pθ(xt-1|xt) = N(xt-1|μθ(xt,t),Σθ(xt,t))
Repeatedly applying the above formula, we can get the distribution for the entire trajectory i.e. the entire reverse process.
pθ(x0:T) = pθ(xT) ∏ t=Tt=1 pθ(xt-1|xt)
These parameters, θ, are exactly what's learned by the neural networks during the training. In practice, using a neural network as it is doesn't work too well and hence we use a U-Net instead.
Given our previous formulation, our goal now is to find the parameters θ that best approximate q. To do so, we formulate this by minimizing the KL-divergence between the two distributions which is equivalent to optimizing the Evidence Lower Bound aka ELBo.
If you have some experience with Bayesian models, this might seem quite familiar. If not, you can read more about variational inference (or variational autoencoders) in the popular book Pattern Recognition and Machine Learning by Christopher Bishop.
After lots of calculations (of which we would not go into the details), we get the loss function as follows:
This is the loss that our neural network computes and does backpropagation to optimize the parameters! I invite you to check this blog post to learn more.
Related: How to Perform Image-to-Image Generation with Stable Diffusion in Python.
Since stable diffusion models are open-sourced, you can run them yourself! Well, if you have decent computational resources at your hand you can try even bigger models since running these models is computationally expensive. You can check this link to see the inference benchmark on known GPUs.
Nevertheless, we are going to use only the GPUs provided to us by Google Colab and see how to generate images. Make sure you have GPU enabled. If not, go to Runtime then click on Change Runtime type, and from the hardware accelerator option select GPU.
We are first going to install the necessary libraries as follows:
Next we import the StableDiffusionPipeline
and the PyTorch library.
Now we will be working using the dreamlike-art/dreamlike-photoreal-2.0
model that is hosted on the Huggingface hub, to generate images from our prompts. So we can load the entire pipeline in one go as follows:
For the images above, we make a list of our prompts and a list of images that can store all the images generated:
Finally we can generate our images by passing the prompts one at a time to the instance of our Stable Diffusion pipeline.
The generated images from these prompts are the same ones that were presented at the beginning of this article!
Let us write the code to manually use the four major components we discussed before to generate images.
Note: The code for working with the four components has been adapted from this Colab notebook.
We can now import all the libraries we need:
Let us define a Python class for the diffusion model we will code ourselves:
Here we have defined 2 schedulers. We will be working with the DDIM scheduler later on.
Now we need several helper methods to generate an image. In the pipeline we discussed above, we first need to convert text from the prompt by the user into text embeddings:
Here we have first defined a helper method to return text embeddings as PyTorch tensors for a given text. We will be doing classifier-free guidance to get the embeddings for the user's prompt. Hence we first generate text embeddings for both the prompt and an empty string and then concatenate them.
Now our diffusion model class needs a method to generate the latent image embeddings after doing the reverse diffusion process, which, when decoded, gives us the actual image:
In the method above:
float16
instead of float64
or float32
) for certain model parts without sacrificing accuracy.Now we need a method to decode the image from the latent space into the pixel space and transform this into a suitable PIL image format:
In the decode_img_latents()
method, we have first divided the image embeddings by 0.18215 since the authors also proposed this in their paper. We then use a Variational Autoencoder (VAE) to decode the latent embeddings.
Next, the transform_imgs()
method takes the image we received from the previous function and scales it appropriately so that the pixels range from 0 to 255 instead of -1 to 1. The color channels are also swapped, and pixels are rounded to integers to represent the images in the PIL image format.
We now have all the pieces, we're ready to code our prompt_to_img()
method:
In this code, we first convert the prompt by the user to a list, convert this list to get the text embeddings, perform the reverse diffusion process using these embeddings, decode the latent image embeddings, and finally transform the decoded output for the PIL image format. That's all there is to code to generate images ourselves!
Before generating images, let us load the necessary pre-trained components from the Huggingface Transformers and the diffusers library:
Let us now create an instance of the defined class and generate some images:
Here we pass a single text prompt to the instance of ImageDiffusionModel
to get the following generated image.
We can generate any image this way. You can try out more examples yourself!
Let us generate images with the same prompts we used in the StableDiffusionPipeline
:
These outputs are so much different than the ones we got before. The model “dreamlike-art/dreamlike-photoreal-2.0” is definitely much better at producing more aesthetically pleasing images.
The framework we just implemented is incredibly powerful. Just by modifying a few parts and putting things here and there, we can do a lot of amazing things, such as generating videos, image inpainting, getting similar images, and much more. In this tutorial, we will only explore how to get more similar images for an image.
Please note that we have a separate tutorial on how to do that using depth estimation, check it out here.
Once you understand how to generate an individual image, creating similar images for a given image is quite easy. To do this, we need to convert our original image into the latent embedding space, add some noise and do the reverse diffusion process starting somewhere between. We start somewhere in between because our noisy image embedding would provide a good reference for the reverse diffusion process.
To do all this, we will use the DDIM scheduler, which is more suited for image-to-image tasks. The LMSDiscrete scheduler doesn't work too well when the reverse diffusion process begins from an intermediate timestep. Let us now add a slightly modified version of the get_img_latents()
method that adapts for this particular scheduler:
We have a few important distinctions to make here. First, in this method, img_latents
is no longer a default argument. We have also included another argument, start_step
, to specify the timestep from which to begin the reverse diffusion process. Also, notice how we add noise to the image embeddings using the DDIM scheduler and don't scale the image embeddings anymore.
Next, we define the following method to generate similar images:
We first encode the image from the pixel to the latent embedding space. The prompt text is converted into a Python list from which we get the prompt text embeddings using the methods we previously defined. We pass these embeddings to the get_img_latents_similar()
method. Then we decode the final image latents that we get and transform it to the format suitable for the PIL library.
Now to test our new functionality, let’s generate an image using our model and then generate more similar images to this image.
We can save this image to our computer and load it as follows. We do this extra step to demonstrate one could also have loaded their own image:
They look quite similar, don't they? Let us try changing the prompt a little bit.
Hurray, another similar image! Note that we can control how similar the images look by appropriately choosing the number of inference steps and the starting step. We can of course also change the prompts to vary the image.
This was fun! In this article, we learned about how stable diffusion (and other latent diffusion models) work in quite a lot of depth. We also saw how to generate images ourselves from prompts using a model loaded from Hugging Face. We then saw how to work with the different components involved and also learned how to generate similar images from a given image. Now go ahead and create beautiful aesthetic images!
Learn also: Image Captioning using PyTorch and Transformers in Python.
Happy generating ♥
Save time and energy with our Python Code Generator. Why start from scratch when you can generate? Give it a try!
View Full Code Auto-Generate My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!