Notebooks
H
Hugging Face
Image Captioning Blip

Image Captioning Blip

hf-notebooksexamples

Fine-tune BLIP using Hugging Face transformers and datasets 🤗

This tutorial is largely based from the GiT tutorial on how to fine-tune GiT on a custom image captioning dataset. Here we will use a dummy dataset of football players ⚽ that is uploaded on the Hub. The images have been manually selected together with the captions. Check the 🤗 documentation on how to create and upload your own image-text dataset.

Set-up environment

[ ]
[ ]

Load the image captioning dataset

Let's load the image captioning dataset, you just need few lines of code for that.

[ ]

Let's retrieve the caption of the first example:

[ ]

And the corresponding image

[ ]

Create PyTorch Dataset

The lines below are entirely copied from the original notebook!

[ ]

Load model and processor

[ ]

Now that we have loaded the processor, let's load the dataset and the dataloader:

[ ]

Train the model

Let's train the model! Run the simply the cell below for training the model

[ ]

Inference

Let's check the results on our train dataset

[ ]
[ ]

Load from the Hub

Once trained you can push the model and processor on the Hub to use them later. Meanwhile you can play with the model that we have fine-tuned!

[ ]

Let's check the results on our train dataset!

[ ]