Train YOLOv5 Model On A Custom Dataset With Weights & Biases
Train YOLOv5 model on a Custom Dataset with Weights & Biases (as a part of the YOLOv5 Series)
This is a Colab for training a custom YOLOv5 model and using Weights & Biases to track training metrics, checkpoint weights and datasets. This Colab is featured in part 3 of the YOLOv5 Series.
Follow along with YOLOv5 Series →
Setup
We begin by downloading the
YOLOv5 GitHub repo and installing all the requirements for YOLOv5 and wandb.
Here's an example of a wandb dashboard.
Detect
YOLOv5 provides highly-accurate, fast models that are pretrained on the Common Objects in COntext (COCO) dataset.
If your object detection application involves only classes from the COCO dataset, like "Stop Sign" and "Pizza", then these pretrained models may be all you need!
The cell below runs a pretrained model on an example image
using detect.py from the YOLOv5 toolkit.
Generating a .yaml file for training on the bus dataset that's featured in the YOLOv5 Series. You can skip this step when using your own custom dataset.
Train
Your custom classes (probably) are not among the objects in COCO,
so our pretrained models don't know how to detect them
and we can't just use detect.py with one of those models.
Instead, we need to train the models to detect our custom classes,
using YOLOv5's train.py.
We don't have to start our models from scratch though!
We can finetune the pretrained models on our custom dataset.
This substantially speeds up training.
Model training is a complex process, so we'll want to track the inputs and outputs, log information about model behavior during training, and record system state and metrics.
That's where Weights & Biases
comes in:
the wandb library provides all the tools you need to thoroughly
and effectively log model training experiments.
YOLOv5 comes with wandb already integrated,
so all you need to do is configure the logging
with command line arguments.
--projectsets the W&B project to which we're logging (akin to a GitHub repo).--upload_datasettellswandbto upload the dataset as a dataset-visualization Table. At regular intervals set by--bbox_interval, the model's outputs on the validation set will also be logged to W&B.--save_periodsets the number of epochs to wait in between logging the model checkpoints. If not set, only the final trained model is logged.
Even without these arguments, basic model metrics and some model outputs will still be saved to W&B.
To train on your custom dataset you'll need a special .yaml file. In the YOLOv5 Series we use Weights & Biases to upload our custom dataset to the cloud and generate the required .yaml file.
To learn more you can watch part 2 of the series →
Here's where you can find the uploaded evaluation results in the W&B UI:
Resume Crashed Runs
In addition to making it easier to debug our models, the W&B integration can help rescue crash or interrupted runs.
Two steps above helped set us up for this:
- By setting a
--save_period, we regularly logged the model to W&B, which means we can recreate our model and then resume the run on any device with the dataset available. - By using
--upload_dataset, we logged the data to W&B, which means we can recreate the data as well and so resume runs on any device, whether the dataset is present on disk or not
To resume a crashed or interrupted run:
- Go to that run's overview section on W&B dashboard
- Copy the run path
- Pass the run path as the
--resumeargument, plus the prefixwandb-artifact://. This prefix tells YOLO that the files are located on wandb, rather than locally.
End Notes
Distributed Data-Parallel Training
All YOLO+W&B features are DDP-aware and compatible. Train on as many GPUs as you can muster, and we'll keep logging!
Logging Large Datasets
For very large datasets,
the initial dataset upload triggered by --log_dataset
might be prohibitively expensive.
In that case,
check out the
log_dataset.py script
included in YOLOv5.
stripped Models
At the end of training, a "stripped" version of the model is saved to W&B. This version of the model file is much smaller, but is missing accumulated data required for resuming training. It's intended for use in downstream inference.
