01 Finetuning And Guidance
Fine-Tuning and Guidance
In this notebook, we're going to cover two main approaches for adapting existing diffusion models:
- With fine-tuning, we'll re-train existing models on new data to change the type of output they produce
- With guidance, we'll take an existing model and steer the generation process at inference time for additional control
What You Will Learn:
By the end of this notebook, you will know how to:
- Create a sampling loop and generate samples faster using a new scheduler
- Fine-tune an existing diffusion model on new data, including:
- Using gradient accumulation to get around some of the issues with small batches
- Logging samples to Weights and Biases during training to monitor progress (via the accompanying example script)
- Saving the resulting pipeline and uploading it to the hub
- Guide the sampling process with additional loss functions to add control over existing models, including:
- Exploring different guidance approaches with a simple color-based loss
- Using CLIP to guide generation using a text prompt
- Sharing a custom sampling loop using Gradio and 🤗 Spaces
❓If you have any questions, please post them on the #diffusion-models-class channel on the Hugging Face Discord server. If you haven't signed up yet, you can do so here: https://huggingface.co/join/discord
Setup and Imports
To save your fine-tuned models to the Hugging Face Hub, you'll need to login with a token that has write access. The code below will prompt you for this and link to the relevant tokens page of your account. You'll also need a Weights and Biases account if you'd like to use the training script to log samples as the model trains - again, the code should prompt you to sign in where needed.
Apart from that, the only set-up is installing a few dependencies, importing everything we'll need and specifying which device we'll use:
Token is valid. Your token has been saved in your configured git credential helpers (store). Your token has been saved to /root/.huggingface/token Login successful
Loading A Pre-Trained Pipeline
To begin this notebook, let's load an existing pipeline and see what we can do with it:
Fetching 4 files: 0%| | 0/4 [00:00<?, ?it/s]
Generating images is as simple as running the __call__ method of the pipeline by calling it like a function:
0%| | 0/1000 [00:00<?, ?it/s]
Neat, but SLOW! So, before we get to the main topics of today, let's take a peek at the actual sampling loop and see how we can use a fancier sampler to speed this up:
Faster Sampling with DDIM
At every step, the model is fed a noisy input and asked to predict the noise (and thus an estimate of what the fully denoised image might look like). Initially these predictions are not very good, which is why we break the process down into many steps. However, using 1000+ steps has been found to be unnecessary, and a flurry of recent research has explored how to achieve good samples with as few steps as possible.
In the 🤗 Diffusers library, these sampling methods are handled by a scheduler, which must perform each update via the step() function. To generate an image, we begin with random noise . Then, for every timestep in the scheduler's noise schedule, we feed the noisy input to the model and pass the resulting prediction to the step() function. This returns an output with a prev_sample attribute - previous because we're going "backwards" in time from high noise to low noise (the opposite of the forward diffusion process).
Let's see this in action! First, we load a scheduler, here a DDIMScheduler based on the paper Denoising Diffusion Implicit Models which can give decent samples in much fewer steps than the original DDPM implementation:
You can see that this model does 40 steps total, each jumping the equivalent of 25 steps of the original 1000-step schedule:
tensor([975, 950, 925, 900, 875, 850, 825, 800, 775, 750, 725, 700, 675, 650, , 625, 600, 575, 550, 525, 500, 475, 450, 425, 400, 375, 350, 325, 300, , 275, 250, 225, 200, 175, 150, 125, 100, 75, 50, 25, 0])
Let's create 4 random images and run through the sampling loop, viewing both the current and the predicted denoised version as the process progresses:
0it [00:00, ?it/s]
As you can see, the initial predictions are not great but as the process goes on the predicted outputs get more and more refined. If you're curious what maths is happening inside that step() function, inspect the (well-commented) code with:
You can also drop in this new scheduler in place of the original one that came with the pipeline, and sample like so:
0%| | 0/40 [00:00<?, ?it/s]
Alright - we can get samples in a reasonable time now! This should speed things up as we move through the rest of this notebook :)
Fine-Tuning
Now for the fun bit! Given this pre-trained pipeline, how might we re-train the model to generate images based on new training data?
It turns out that this looks nearly identical to training a model from scratch (as we saw in Unit 1) except that we begin with the existing model. Let's see this in action and talk about a few additional considerations as we go.
First, the dataset: you could try this vintage faces dataset or these anime faces for something closer to the original training data of this faces model, but just for fun let's instead use the same small butterflies dataset we used to train from scratch in Unit 1. Run the code below to download the butterflies dataset and create a dataloader we can sample a batch of images from:
Using custom data configuration huggan--smithsonian_butterflies_subset-7665b1021a37404c Found cached dataset parquet (/home/lewis_huggingface_co/.cache/huggingface/datasets/huggan___parquet/huggan--smithsonian_butterflies_subset-7665b1021a37404c/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Previewing batch:
Consideration 1: our batch size here (4) is pretty small, since we're training at large image size (256px) using a fairly large model and we'll run out of GPU RAM if we push the batch size too high. You can reduce the image size to speed things up and allow for larger batches, but these models were designed and originally trained for 256px generation.
Now for the training loop. We'll update the weights of the pre-trained model by setting the optimization target to image_pipe.unet.parameters(). The rest is nearly identical to the example training loop from Unit 1. This takes about 10 minutes to run on Colab, so now is a good time to grab a coffee of tea while you wait:
0%| | 0/250 [00:00<?, ?it/s]
Epoch 0 average loss: 0.013324214214226231
0%| | 0/250 [00:00<?, ?it/s]
Epoch 1 average loss: 0.014018508377484978
[<matplotlib.lines.Line2D>]
Consideration 2: Our loss signal is extremely noisy, since we're only working with four examples at random noise levels for each step. This is not ideal for training. One fix is to use an extremely low learning rate to limit the size of the update each step. It would be even better if we could find some way to get the same benefit we would get from using a larger batch size without the memory requirements skyrocketing...
Enter gradient accumulation. If we call loss.backward() multiple times before running optimizer.step() and optimizer.zero_grad(), then PyTorch accumulates (sums) the gradients, effectively merging the signal from several batches to give a single (better) estimate which is then used to update the parameters. This results in fewer total updates being made, just like we'd see if we used a larger batch size. This is something many frameworks will handle for you (for example, 🤗 Accelerate makes this easy) but it is nice to see it implemented from scratch since this is a useful technique for dealing with training under GPU memory constraints! As you can see from the code above (after the # Gradient accumulation comment) there really isn't much code needed.
Consideration 3: This still takes a lot of time, and printing out a one-line update every epoch is not enough feedback to give us a good idea of what is going on. We should probably:
- Generate some samples occasionally to visually examine the performance qualitatively as the model trains
- Log things like the loss and sample generations during training, perhaps using something like Weights and Biases or tensorboard.
I created a quick script (finetune_model.py) that takes the training code above and adds minimal logging functionality. You can see the logs from one training run here below:
It's fun to see how the generated samples change as training progresses - even though the loss doesn't appear to be improving much, we can see a progression away from the original domain (images of bedrooms) towards the new training data (wikiart). At the end of this notebook is commented-out code for fine-tuning a model using this script as an alternative to running the cell above.
Generating some images with this model, we can see that these faces are already looking mighty strange!
0it [00:00, ?it/s]
Consideration 4: Fine-tuning can be quite unpredictable! If we trained for a lot longer, we might see some perfect butterflies. But the intermediate steps can be extremely interesting in their own right, especially if your interests are more towards the artistic side! Explore training for very short or very long periods of time, and varying the learning rate to see how this affects the kinds of output the final model produces.
Code for fine-tuning a model using the minimal example script we used on the WikiArt demo model
If you'd like to train a similar model to the one I made on WikiArt, you can uncomment and run the cells below. Since this takes a while and may exhaust your GPU memory, I recommend doing this after working through the rest of this notebook.
Saving and Loading Fine-Tuned Pipelines
Now that we've fine-tuned the U-Net in our diffusion model, let's save it to a local folder by running:
As we saw in Unit 1, this will save the config, model, scheduler:
model_index.json scheduler unet
Next, you can follow the same steps outlined in Unit 1's Introduction to Diffusers to push the model to the Hub for later use:
'https://huggingface.co/lewtun/ddpm-celebahq-finetuned-butterflies-2epochs/blob/main/README.md'
Congratulations, you've now fine-tuned your first diffusion model!
For the rest of this notebook we'll use a model I fine-tuned from this model trained on LSUN bedrooms approximately one epoch on the WikiArt dataset. If you'd prefer, you can skip this cell and use the faces/butterflies pipeline we fine-tuned in the previous section or load one from the Hub instead:
Downloading: 0%| | 0.00/180 [00:00<?, ?B/s]
Fetching 4 files: 0%| | 0/4 [00:00<?, ?it/s]
Downloading: 0%| | 0.00/938 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/455M [00:00<?, ?B/s]
Downloading: 0%| | 0.00/288 [00:00<?, ?B/s]
0it [00:00, ?it/s]
Consideration 5: It is often hard to tell how well fine-tuning is working, and what 'good performance' means may vary by use-case. For example, if you're fine-tuning a text-conditioned model like stable diffusion on a small dataset you probably want it to retain most of its original training so that it can understand arbitrary prompts not covered by your new dataset, while adapting to better match the style of your new training data. This could mean using a low learning rate alongside something like exponential model averaging, as demonstrated in this great blog post about creating a pokemon version of stable diffusion. In a different situation, you may want to completely re-train a model on new data (such as our bedroom -> wikiart example) in which case a larger learning rate and more training makes sense. Even though the loss plot is not showing much improvement, the samples clearly show a move away from the original data and towards more 'artsy' outputs, although they remain mostly incoherent.
Which leads us to a the next section, as we examine how we might add additional guidance to such a model for better control over the outputs...
Guidance
What do we do if we want some control over the samples generated? For example, say we wanted to bias the generated images to be a specific color. How would we go about that? Enter guidance, a technique for adding additional control to the sampling process.
Step one is to create our conditioning function: some measure (loss) which we'd like to minimize. Here's one for the color example, which compares the pixels of an image to a target color (by default a sort of light teal) and returns the average error:
Next, we'll make a modified version of the sampling loop where, at each step, we do the following:
- Create a new version of x that has requires_grad = True
- Calculate the denoised version (x0)
- Feed the predicted x0 through our loss function
- Find the gradient of this loss function with respect to x
- Use this conditioning gradient to modify x before we step with the scheduler, hopefully pushing x in a direction that will lead to lower loss according to our guidance function
There are two variants here that you can explore. In the first, we set requires_grad on x after we get our noise prediction from the UNet, which is more memory efficient (since we don't have to trace gradients back through the diffusion model) but gives a less accurate gradient. In the second we set requires_grad on x first, then feed it through the UNet and calculate the predicted x0.
0it [00:00, ?it/s]
0 loss: 27.279136657714844 10 loss: 11.286816596984863 20 loss: 10.683112144470215 30 loss: 10.942476272583008
This second option requires nearly double the GPU RAM to run, even though we only generate a batch of four images instead of eight. See if you can spot the difference, and think through why this way is more 'accurate':
0it [00:00, ?it/s]
0 loss: 30.750328063964844 10 loss: 18.550724029541016 20 loss: 17.515094757080078 30 loss: 17.55681037902832
In the second variant, the memory requirements are higher and the effect is less pronounced, so you may think that this is inferior. However, the outputs are arguably closer to the types of images the model was trained on, and you can always increase the guidance scale for a stronger effect. Which approach you use will ultimately come down to what works best experimentally.
CLIP Guidance
Guiding towards a color gives us a little bit of control, but what if we could just type some text describing what we want?
CLIP is a model created by OpenAI that allows us to compare images to text captions. This is extremely powerful, since it allows us to quantify how well an image matches a prompt. And since the process is differentiable, we can use this as a loss function to guide our diffusion model!
We won't go too much into the details here. The basic approach is as follows:
- Embed the text prompt to get a 512-dimensional CLIP embedding of the text
- For every step in the diffusion model process:
- Make several variants of the predicted denoised image (having multiple variations gives a cleaner loss signal)
- For each one, embed the image with CLIP and compare this embedding with the text embedding of the prompt (using a measure called 'Great Circle Distance Squared')
- Calculate the gradient of this loss with respect to the current noisy x and use this gradient to modify x before updating it with the scheduler.
For a deeper explanation of CLIP, check out this lesson on the topic or this report on the OpenCLIP project which we're using to load the CLIP model. Run the next cell to load a CLIP model:
100%|████████████████████████████████████████| 354M/354M [00:02<00:00, 120MiB/s]
With a loss function defined, our guided sampling loop looks similar to the previous examples, replacing color_loss() with our new clip-based loss function:
0it [00:00, ?it/s]
Step: 0 , Guidance loss: 7.437869548797607 Step: 25 , Guidance loss: 7.174620628356934