Notebooks
H
Hugging Face
Video Classification

Video Classification

hf-notebooksexamples

Fine-tuning for Video Classification with 🤗 Transformers

This notebook shows how to fine-tune a pre-trained Vision model for Video Classification on a custom dataset. The idea is to add a randomly initialized classification head on top of a pre-trained encoder and fine-tune the model altogether on a labeled dataset.

Dataset

This notebook uses a subset of the UCF-101 dataset. We'll be using a subset of the dataset to keep the runtime of the tutorial short. The subset was prepared using this notebook following this guide.

Model

We'll fine-tune the VideoMAE model, which was pre-trained on the Kinetics 400 dataset. You can find the other variants of VideoMAE available on 🤗 Hub here. You can also extend this notebook to use other video models such as X-CLIP.

Note that for models where there's no classification head already available you'll have to manually attach it (randomly initialized). But this is not the case for VideoMAE since we already have a VideoMAEForVideoClassification class.

Data preprocessing

This notebook leverages TorchVision's and PyTorchVideo's transforms for applying data preprocessing transformations including data augmentation.


Depending on the model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those two parameters, then the rest of the notebook should run smoothly.

[ ]

Before we start, let's install the pytorchvideo, transformers, and evaluate libraries.

[ ]

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up here if you haven't already!) then execute the following cell and input your token:

[ ]

Then you need to install Git-LFS to upload your model checkpoints:

[ ]

Fine-tuning a model on a video classification task

In this notebook, we will see how to fine-tune one of the 🤗 Transformers vision models on a Video Classification dataset.

Given a video, the goal is to predict an appropriate class for it, like "archery".

Loading the dataset

Here we first download the subset archive and un-archive it.

[ ]
[ ]

Now, let's investigate what is inside the archive.

[ ]

Broadly, dataset_root_path is organized like so:

UCF101_subset/
    train/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    val/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...
    test/
        BandMarching/
            video_1.mp4
            video_2.mp4
            ...
        Archery
            video_1.mp4
            video_2.mp4
            ...
        ...

Let's now count the number of total videos we have.

[ ]
[ ]
[ ]

The video paths, when sorted, appear like so:

...
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c04.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c06.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c01.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c02.avi',
'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c06.avi'
...

We notice that there are video clips belonging to the same group / scene where group is denoted by g in the video file paths. v_ApplyEyeMakeup_g07_c04.avi and v_ApplyEyeMakeup_g07_c06.avi, for example.

For the validation and evaluation splits, we wouldn't want to have video clips from the same group / scene to prevent data leakage. The subset that we're using in this tutorial takes this information into account.

Next up, we derive the set of labels we have in the dataset. Let's also create two dictionaries that'll be helpful when initializing the model:

  • label2id: maps the class names to integers.
  • id2label: maps the integers to class names.
[ ]

We've got 10 unique classes. For each class we have 30 videos in the training set.

Loading the model

In the next cell, we initialize a video classification model where the encoder is initialized with the pre-trained parameters and the classification head is randomly initialized. We also initialize the feature extractor associated to the model. This will come in handy during writing the preprocessing pipeline for our dataset.

[ ]

The warning is telling us we are throwing away some weights (e.g. the weights and bias of the classifier layer) and randomly initializing some other (the weights and bias of a new classifier layer). This is expected in this case, because we are adding a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

Note that this checkpoint leads to better performance on this task as the checkpoint was obtained fine-tuning on a similar downstream task having considerable domain overlap. You can check out this checkpoint which was obtained by fine-tuning MCG-NJU/videomae-base-finetuned-kinetics and it obtains much better performance.

Constructing the datasets for training

For preprocessing the videos, we'll leverage the PyTorch Video library. We start by importing the dependencies we need.

[ ]

For the training dataset transformations, we use a combination of uniform temporal subsampling, pixel normalization, random cropping, and random horizontal flipping. For the validation and evaluation dataset transformations, we keep the transformation chain the same except for random cropping and horizontal flipping. To learn more about the details of these transformations check out the official documentation of PyTorch Video.

We'll use the image_processor associated with the pre-trained model to obtain the following information:

  • Image mean and standard deviation with which the video frame pixels will be normalized.
  • Spatial resolution to which the video frames will be resized.
[ ]

Note: The above dataset pipelines are taken from the official PyTorch Video example. We're using the pytorchvideo.data.Ucf101() function because it's tailored for the UCF-101 dataset. Under the hood, it returns a pytorchvideo.data.labeled_video_dataset.LabeledVideoDataset object. LabeledVideoDataset class is the base class for all things video in the PyTorch Video dataset. So, if you wanted to use a custom dataset not supported off-the-shelf by PyTorch Video, you can extend the LabeledVideoDataset class accordingly. Refer to the data API documentation to learn more. Also, if your dataset follows a similar structure (as shown above), then using the pytorchvideo.data.Ucf101() should work just fine.

[ ]

Let's now take a preprocessed video from the dataset and investigate it.

[ ]
[ ]

We can also visualize the preprocessed videos for easier debugging.

[ ]
[ ]

Training the model

We'll leverage Trainer from 🤗 Transformers for training the model. To instantiate a Trainer, we will need to define the training configuration and an evaluation metric. The most important is the TrainingArguments, which is a class that contains all the attributes to configure the training. It requires an output folder name, which will be used to save the checkpoints of the model. It also helps sync all the information in the model repository on 🤗 Hub.

Most of the training arguments are pretty self-explanatory, but one that is quite important here is remove_unused_columns=False. This one will drop any features not used by the model's call function. By default it's True because usually it's ideal to drop unused feature columns, making it easier to unpack inputs into the model's call function. But, in our case, we need the unused features ('video' in particular) in order to create pixel_values (which is a mandatory key our model expects in its inputs).

[ ]

There's no need to define max_steps when instantiating TrainingArguments. Since the dataset returned by pytorchvideo.data.Ucf101() doesn't implement the __len__() method we had to specify max_steps.

Next, we need to define a function for how to compute the metrics from the predictions, which will just use the metric we'll load now. The only preprocessing we have to do is to take the argmax of our predicted logits:

[ ]
[ ]

A note on evaluation:

In the VideoMAE paper, the authors use the following evaluation strategy. They evaluate the model on several clips from test videos and apply different crops to those clips and report the aggregate score. However, in the interest of simplicity and brevity, we don't consider that in this tutorial.

We also define a collate_fn, which will be used to batch examples together. Each batch consists of 2 keys, namely pixel_values and labels.

[ ]

Then we just need to pass all of this along with our datasets to the Trainer:

[ ]

You might wonder why we pass along the image_processor as a tokenizer when we already preprocessed our data. This is only to make sure the feature extractor configuration file (stored as JSON) will also be uploaded to the repo on the hub.

Now we can finetune our model by calling the train method:

[ ]

We can check with the evaluate method that our Trainer did reload the best model properly (if it was not the last one):

[ ]
[ ]

You can now upload the result of the training to the Hub, just execute this instruction (note that the Trainer will automatically create a model card as well as Tensorboard logs - see the "Training metrics" tab - amazing isn't it?):

[ ]

Now that our model is trained, let's use it to run inference on a video from test_dataset.

Inference

Let's load the trained model checkpoint and fetch a video from test_dataset.

[ ]
[ ]

We then prepare the video as a torch.Tensor and run inference.

[ ]
[ ]

We can now check if the model got the prediction right.

[ ]
[ ]

And it looks like it got it right!

You can also use this model to bring in your own videos. Check out this Space to know more. The Space will also show you how to run inference for a single video file.


Next steps

Now that you've learned to train a well-performing video classification model on a custom dataset here is some homework for you:

  • Increase the dataset size: include more classes and more samples per class.
  • Try out different hyperparameters to study how the model converges.
  • Analyze the classes for which the model fails to perform well.
  • Try out a different video encoder.

Don't forget to share your models with the community =)