Visual Object Detection
Finetune Visual Object Detection Models Using Pre-trained Sagemaker Models
This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
This notebook introduces finetuning pretrained object detection (OD) models on new dataset.
Training a model from scratch in general is time-consuming and requires large compute resources. When the training data is small, we cannot expect to train a very performant model. A better alternative is to finetune a pretrained model on the target dataset. AWS Sagemaker provides high-quality pretrained models that were trained on very large datasets. Finetuning these models on new dataset takes only fractional training time compard to training from scratch.
In this notebook, we demonstrate how to use two types of Amazon Sagemaker built-in OD models to finetune on the Steel Surface Defect dataset, which is used in this solution.
- Type 1 (legacy): uses a built-in legacy Object Detection algorithm and uses the Single Shot multibox Detector (SSD) model with either VGG or ResNet backbone, and was pretrained on the ImageNet dataset.
- Type 2 (latest): provides 9 pretrained OD models, including 8 SSD models and 1 FasterRCNN model. These models use VGG, ResNet, or MobileNet as backbone, and were pretrained on COCO, VOC, or FPN datasets.
For each type of model, besides training with default hyperparameters, we also perform hyperparameter tuning (i.e., HPO) using Sagemaker Automatic Model Tuning (AMT) to train even better model.
Running the whole notebook takes about 8 hours. The most time-consuming part is running HPO jobs for both types of models. You could choose to run more HPO jobs in parallel in order to reduce running time if there are more EC2 instances available.
Content
- Data Preparation
- Training: Finetune Type 1 (Legacy) OD Model
- Training: Finetune Type 1 (Legacy) OD Model with HPO
- Training: Finetune Type 2 (Latest) OD Model
- Training: Finetune Type 2 (Latest) OD Model with HPO
- Inference and Model Comparison
- Clean Up the Endpoints
- Conclusion
** ATTENTION **
- Running the notebook end-to-end takes 8~9 hours. We changed some parameter values so that the notebook took much shorter time to finish, at the cost of model trainig non-convergence.
- Please change them back when you want to train till convergence. These parameters include
num_epochs=100for training all models, andmax_jobs=20,max_parallel_jobs=10for hyperparameter tuning. - The shown results in this notebook is for fully-convergent models.
1. Data Preparation
The two types of OD models require different data formats.
The steel surface dataset used in this solution contains one xml file for each image as annotation. However,
neither model uses xml annotations. The Type 1 (legacy) OD model requires either RecordIO or image format in either file mode or pipe mode. The Type 2 (latest) OD model requires the input must be a directory with a sub-directory of images and a annotations.json file. Please check Section 3 of this notebook for more explanation.
In this notebook, we split the data to be train:val:test = 64:16:20. We allocate 20% data as test data to numerically compare all trained models in the end of the notebook. The steel surface dataset has 1800 images in 6 categories, we randomly allocate 20% images from each category to the test data.
We provide a script to convert the remaining 80% xmls to a single annotations.json for training the Type 2 (latest) OD model (under the hood, the source code automatically splits the data to be train:val=80:20, equivalent to 64% of all data as train and 16% as val). We provide another script to convert the annotations.json and corresponding images to RecordIO data for the Type 1 (legacy) OD model.
If your dataset follows the required input format for Type 1 (legacy) or Type 2 (latest) OD model, you do not need these conversions.
PRE images/ 2022-09-09 18:51:38 368437 annotations.json
2. Training: Finetune Type 1 (Legacy) OD Model
We start from finetuning the Type 1 (legacy) OD model, which is the SSD model with ResNet as backbone, and pretrained on ImageNet.
Input data: follow the instruction, the legacy OD model supports both RecordIO and image types for training in file mode, or RecordIO in pipe mode. In this notebook, we use RecordIO in file mode.
We provide a script for converting the annotations.json to RecordIO format. The document and example provide some context for understanding the script.
This script first splits the data to train:val = 80:20 according to the train-ratio. This is equivalent to use 64% of all data for training and 16% for validation. Then converts each partition, including images and annotations, to a .rec file. We use the validation data for selecting the best job in HPO training in the next section, and use the test data for numerically comparing all finetuned models.
Visualize Training Progress
During training, the loss function is the sum of CrossEntropy loss and SmoothL1 loss. We visualize the two losses on the training data as well as the mean Average Precision (mAP) on the validation data.
Deployment
The inference will be deferred to the end of the notebook
3. Training: Finetune Type 1 (Legacy) OD model with HPO
Now we run HPO to find better hyperparameters which lead to better model. You could find all finetunable hyperparameters for the Type 1 (legacy) OD model. In this notebook, we only finetune learning rate, momentum, and weight decay.
We use Sagemaker Automatic Model Tuning (AMT) to run HPO. We need to provide hyperparameter ranges and objective metrics. AMT monitors the log and parses the objective metrics. For object detection, we use mean Average Precision (mAP) on the validation dataset as our metric. mAP is the standard evaluation metric used in the COCO Challenge for object detection tasks. Here is a nice blog post explaining mAP for object detection.
We run max_jobs=20 jobs in this HPO. You could run more jobs to find even better hyperparameters, at the cost of more compute resources and training time. This HPO job takes about 1 hour using p3.2xlarge EC2 instance and run max_parallel_jobs=10 jobs in parallel.
best job: sagemaker-soln-dfd-c-220805-0846-014-f4010610 best job final validation:mAP = 0.694232
Warning: No metrics called train:throughput found
All metrics: ['train:progress', 'validation:mAP', 'train:smooth_l1', 'ObjectiveMetric', 'train:cross_entropy'] ObjectiveMetric is exactly the same as validation:mAP
Deploy the best model from HPO
The inference will be deferred to the end of the notebook
4. Training: Finetune Type 2 (Latest) OD Model
For the Type 2 (latest) OD model, we follow Fine-tune a Model and Deploy to a SageMaker Endpoint and use standard Sagemaker APIs.
You can find all finetunable Type 2 (latest) OD models in Built-in Algorithms with pre-trained Model Table by searching with keywords "object detection" and set FineTunable?=True.
Currently there are 9 finetunable OD models:
- mxnet-od-ssd-300-vgg16-atrous-coco
- mxnet-od-ssd-512-vgg16-atrous-voc
- mxnet-od-ssd-512-resnet50-v1-coco
- mxnet-od-ssd-512-mobilenet1-0-coco
- mxnet-od-ssd-300-vgg16-atrous-voc
- mxnet-od-ssd-512-resnet50-v1-voc
- mxnet-od-ssd-512-mobilenet1-0-voc
- mxnet-od-ssd-512-vgg16-atrous-coco
- pytorch-od1-fasterrcnn-resnet50-fpn
There are two major differences between training the two types of OD models:
- The entry point
transfer_learning.pyfor finetuning a Type 2 (latest) OD model does not accept a validation data channel. Instead, it splits the input data provided throughestimator.fit({"training": s3_input_train})to be train:val=80:20, corresponding to use 64% of total data for training and 16% for validation. Note, the train/val data are different from train/val for training Type 1 (legacy) OD model. - The evaluation metrics are different. While Type 1 (legacy) OD model reports mAP on the validation data, which is standard, the Type 2 (latest) OD model only reports CrossEntropy loss and SmoothL1 loss on the validation data.
Deployment
5. Training: Finetune Type 2 (Latest) OD model with HPO
The Type 2 (latest) OD model training reports Val_CrossEntropy loss and Val_SmoothL1 loss instead of mAP on the validation dataset. Since we can only specify one evaluation metric for AMT, we choose to minimize Val_CrossEntropy. It is not the standard practice for evaluating OD models, but is the best choice for now.
best job: sagemaker-soln-dfd-c-220805-1125-003-3d4b78cc best job final Val_CrossEntropy = 2.192000 All metrics: ['SmoothL1', 'Val_CrossEntropy', 'Val_SmoothL1', 'CrossEntropy', 'ObjectiveMetric'] ObjectiveMetric is exactly the same as Val_CrossEntropy
6. Inference and Model Comparison
We compare model performance both visually and numerically.
- Visually, we sample images from the test data, one image from each category, and show the predicted bounding boxes, their predicted categories, and the confidence scores.
- Numerically, we compute mAP on the pre-allocated test data. This is a fair comparison because we use the same metric and evaluate on the same test data.
Visual comparison
Numerical comparison
If you predict all test images using all endpoints, you end up with this table. The pycocotools package reports more metric values. We wil focus on row 1 - the mAP averaged over all IoU thresholds, all recall thresholds, all region sizes (small, medium, large), and all numbers of predicted bbox (1, 10, and 100), and all object categories. It's the standard practice to use this metric for evaluating object detection algorithms.
7. Clean Up the Endpoints
When you are done with the endpoint, you should clean it up.
All of the training jobs, models and endpoints we created can be viewed through the SageMaker console of your AWS account.
8. Conclusion
Both visual and numerical comparison confirm that the Type 2 (latest) OD model or Type 2 (latest) OD + HPO performs the best.
- Training models from scratch can be very time-consuming and less effective. In this example, the target dataset is very small, consisting of only 1,800 images in 6 categories, and the training data is only 64% of this small dataset.
- The built-in Sagemaker OD models were pre-trained on large-scale dataset, e.g., the ImageNet dataset includes 14,197,122 images for 21,841 categories, and the PASCAL VOC dataset includes 11,530 images for 20 categories. The pre-trained models have learned rich and diverse low level features, and can efficiently transfer knowledge to finetuned models and focus on learning high-level semantic features for the target dataset.
- HPO is extremely effective, especially for models with large hyperparameter search spaces. Since we finetuned on three hyperparameters (learning rate, momentum, and weight decay) for the Type 1 (legacy) OD models and only one hyperparameter (adam learning rate) for the Type 2 (latest) OD model, there is relatively larger room for improvement for the Type 1 (legacy) OD model and we do observe larger performance enhancement. Of course, we need to trade off model performance with budget (compute resource and training time) when running HPO.
- In terms of training time, for the steel surface dataset, training the Type 1 (legacy) OD model took 34 min, Type 2 (latest) OD model took 1 hour, and the model trained from scratch took 8+ hours. It indicates finetuning a pre-trained model is much more efficient.
- In summary, finetuning a pretrained model is both more efficient and more performant, we suggest taking advantage of the pre-trained Sagemaker built-in models and finetune on your target datasets.
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.