Notebooks
A
Amazon Web Services
Visual Object Detection

Visual Object Detection

data-scienceinferencearchivedamazon-sagemaker-examplesreinforcement-learningmachine-learningawsvisual_object_detectionexamplesdeep-learningsagemakerjupyter-notebooktrainingmlops

Finetune Visual Object Detection Models Using Pre-trained Sagemaker Models


This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable


This notebook introduces finetuning pretrained object detection (OD) models on new dataset.

Training a model from scratch in general is time-consuming and requires large compute resources. When the training data is small, we cannot expect to train a very performant model. A better alternative is to finetune a pretrained model on the target dataset. AWS Sagemaker provides high-quality pretrained models that were trained on very large datasets. Finetuning these models on new dataset takes only fractional training time compard to training from scratch.

In this notebook, we demonstrate how to use two types of Amazon Sagemaker built-in OD models to finetune on the Steel Surface Defect dataset, which is used in this solution.

  • Type 1 (legacy): uses a built-in legacy Object Detection algorithm and uses the Single Shot multibox Detector (SSD) model with either VGG or ResNet backbone, and was pretrained on the ImageNet dataset.
  • Type 2 (latest): provides 9 pretrained OD models, including 8 SSD models and 1 FasterRCNN model. These models use VGG, ResNet, or MobileNet as backbone, and were pretrained on COCO, VOC, or FPN datasets.

For each type of model, besides training with default hyperparameters, we also perform hyperparameter tuning (i.e., HPO) using Sagemaker Automatic Model Tuning (AMT) to train even better model.

Running the whole notebook takes about 8 hours. The most time-consuming part is running HPO jobs for both types of models. You could choose to run more HPO jobs in parallel in order to reduce running time if there are more EC2 instances available.


Content

  1. Data Preparation
  2. Training: Finetune Type 1 (Legacy) OD Model
  3. Training: Finetune Type 1 (Legacy) OD Model with HPO
  4. Training: Finetune Type 2 (Latest) OD Model
  5. Training: Finetune Type 2 (Latest) OD Model with HPO
  6. Inference and Model Comparison
  7. Clean Up the Endpoints
  8. Conclusion

** ATTENTION **

  • Running the notebook end-to-end takes 8~9 hours. We changed some parameter values so that the notebook took much shorter time to finish, at the cost of model trainig non-convergence.
  • Please change them back when you want to train till convergence. These parameters include num_epochs=100 for training all models, and max_jobs=20, max_parallel_jobs=10 for hyperparameter tuning.
  • The shown results in this notebook is for fully-convergent models.

[ ]
[14]
[4]
[15]
[16]

1. Data Preparation

The two types of OD models require different data formats. The steel surface dataset used in this solution contains one xml file for each image as annotation. However, neither model uses xml annotations. The Type 1 (legacy) OD model requires either RecordIO or image format in either file mode or pipe mode. The Type 2 (latest) OD model requires the input must be a directory with a sub-directory of images and a annotations.json file. Please check Section 3 of this notebook for more explanation.

In this notebook, we split the data to be train:val:test = 64:16:20. We allocate 20% data as test data to numerically compare all trained models in the end of the notebook. The steel surface dataset has 1800 images in 6 categories, we randomly allocate 20% images from each category to the test data.

We provide a script to convert the remaining 80% xmls to a single annotations.json for training the Type 2 (latest) OD model (under the hood, the source code automatically splits the data to be train:val=80:20, equivalent to 64% of all data as train and 16% as val). We provide another script to convert the annotations.json and corresponding images to RecordIO data for the Type 1 (legacy) OD model.

If your dataset follows the required input format for Type 1 (legacy) or Type 2 (latest) OD model, you do not need these conversions.

[3]
[8]
                           PRE images/
2022-09-09 18:51:38     368437 annotations.json

2. Training: Finetune Type 1 (Legacy) OD Model

We start from finetuning the Type 1 (legacy) OD model, which is the SSD model with ResNet as backbone, and pretrained on ImageNet.

Input data: follow the instruction, the legacy OD model supports both RecordIO and image types for training in file mode, or RecordIO in pipe mode. In this notebook, we use RecordIO in file mode. We provide a script for converting the annotations.json to RecordIO format. The document and example provide some context for understanding the script.

This script first splits the data to train:val = 80:20 according to the train-ratio. This is equivalent to use 64% of all data for training and 16% for validation. Then converts each partition, including images and annotations, to a .rec file. We use the validation data for selecting the best job in HPO training in the next section, and use the test data for numerically comparing all finetuned models.

[13]
[2]
[12]

Visualize Training Progress

During training, the loss function is the sum of CrossEntropy loss and SmoothL1 loss. We visualize the two losses on the training data as well as the mean Average Precision (mAP) on the validation data.

[2]

Deployment

The inference will be deferred to the end of the notebook

[4]

3. Training: Finetune Type 1 (Legacy) OD model with HPO

Now we run HPO to find better hyperparameters which lead to better model. You could find all finetunable hyperparameters for the Type 1 (legacy) OD model. In this notebook, we only finetune learning rate, momentum, and weight decay.

We use Sagemaker Automatic Model Tuning (AMT) to run HPO. We need to provide hyperparameter ranges and objective metrics. AMT monitors the log and parses the objective metrics. For object detection, we use mean Average Precision (mAP) on the validation dataset as our metric. mAP is the standard evaluation metric used in the COCO Challenge for object detection tasks. Here is a nice blog post explaining mAP for object detection.

We run max_jobs=20 jobs in this HPO. You could run more jobs to find even better hyperparameters, at the cost of more compute resources and training time. This HPO job takes about 1 hour using p3.2xlarge EC2 instance and run max_parallel_jobs=10 jobs in parallel.

[4]
[5]
[13]
best job: sagemaker-soln-dfd-c-220805-0846-014-f4010610
best job final validation:mAP = 0.694232
Warning: No metrics called train:throughput found
All metrics: ['train:progress', 'validation:mAP', 'train:smooth_l1', 'ObjectiveMetric', 'train:cross_entropy']
ObjectiveMetric is exactly the same as validation:mAP
Output

Deploy the best model from HPO

The inference will be deferred to the end of the notebook

[6]

4. Training: Finetune Type 2 (Latest) OD Model

For the Type 2 (latest) OD model, we follow Fine-tune a Model and Deploy to a SageMaker Endpoint and use standard Sagemaker APIs.

You can find all finetunable Type 2 (latest) OD models in Built-in Algorithms with pre-trained Model Table by searching with keywords "object detection" and set FineTunable?=True. Currently there are 9 finetunable OD models:

  1. mxnet-od-ssd-300-vgg16-atrous-coco
  2. mxnet-od-ssd-512-vgg16-atrous-voc
  3. mxnet-od-ssd-512-resnet50-v1-coco
  4. mxnet-od-ssd-512-mobilenet1-0-coco
  5. mxnet-od-ssd-300-vgg16-atrous-voc
  6. mxnet-od-ssd-512-resnet50-v1-voc
  7. mxnet-od-ssd-512-mobilenet1-0-voc
  8. mxnet-od-ssd-512-vgg16-atrous-coco
  9. pytorch-od1-fasterrcnn-resnet50-fpn

There are two major differences between training the two types of OD models:

  1. The entry point transfer_learning.py for finetuning a Type 2 (latest) OD model does not accept a validation data channel. Instead, it splits the input data provided through estimator.fit({"training": s3_input_train}) to be train:val=80:20, corresponding to use 64% of total data for training and 16% for validation. Note, the train/val data are different from train/val for training Type 1 (legacy) OD model.
  2. The evaluation metrics are different. While Type 1 (legacy) OD model reports mAP on the validation data, which is standard, the Type 2 (latest) OD model only reports CrossEntropy loss and SmoothL1 loss on the validation data.
[ ]
[16]
Output

Deployment

[7]

5. Training: Finetune Type 2 (Latest) OD model with HPO

The Type 2 (latest) OD model training reports Val_CrossEntropy loss and Val_SmoothL1 loss instead of mAP on the validation dataset. Since we can only specify one evaluation metric for AMT, we choose to minimize Val_CrossEntropy. It is not the standard practice for evaluating OD models, but is the best choice for now.

[ ]
[8]
[20]
best job: sagemaker-soln-dfd-c-220805-1125-003-3d4b78cc
best job final Val_CrossEntropy = 2.192000
All metrics: ['SmoothL1', 'Val_CrossEntropy', 'Val_SmoothL1', 'CrossEntropy', 'ObjectiveMetric']
ObjectiveMetric is exactly the same as Val_CrossEntropy
Output
[9]

6. Inference and Model Comparison

We compare model performance both visually and numerically.

  1. Visually, we sample images from the test data, one image from each category, and show the predicted bounding boxes, their predicted categories, and the confidence scores.
  2. Numerically, we compute mAP on the pre-allocated test data. This is a fair comparison because we use the same metric and evaluate on the same test data.
[22]

Visual comparison

[23]
Output

Numerical comparison

[24]
[10]
[11]
drawing

If you predict all test images using all endpoints, you end up with this table. The pycocotools package reports more metric values. We wil focus on row 1 - the mAP averaged over all IoU thresholds, all recall thresholds, all region sizes (small, medium, large), and all numbers of predicted bbox (1, 10, and 100), and all object categories. It's the standard practice to use this metric for evaluating object detection algorithms.

7. Clean Up the Endpoints

When you are done with the endpoint, you should clean it up.

All of the training jobs, models and endpoints we created can be viewed through the SageMaker console of your AWS account.

[27]

8. Conclusion

Both visual and numerical comparison confirm that the Type 2 (latest) OD model or Type 2 (latest) OD + HPO performs the best.

  1. Training models from scratch can be very time-consuming and less effective. In this example, the target dataset is very small, consisting of only 1,800 images in 6 categories, and the training data is only 64% of this small dataset.
  2. The built-in Sagemaker OD models were pre-trained on large-scale dataset, e.g., the ImageNet dataset includes 14,197,122 images for 21,841 categories, and the PASCAL VOC dataset includes 11,530 images for 20 categories. The pre-trained models have learned rich and diverse low level features, and can efficiently transfer knowledge to finetuned models and focus on learning high-level semantic features for the target dataset.
  3. HPO is extremely effective, especially for models with large hyperparameter search spaces. Since we finetuned on three hyperparameters (learning rate, momentum, and weight decay) for the Type 1 (legacy) OD models and only one hyperparameter (adam learning rate) for the Type 2 (latest) OD model, there is relatively larger room for improvement for the Type 1 (legacy) OD model and we do observe larger performance enhancement. Of course, we need to trade off model performance with budget (compute resource and training time) when running HPO.
  4. In terms of training time, for the steel surface dataset, training the Type 1 (legacy) OD model took 34 min, Type 2 (latest) OD model took 1 hour, and the model trained from scratch took 8+ hours. It indicates finetuning a pre-trained model is much more efficient.
  5. In summary, finetuning a pretrained model is both more efficient and more performant, we suggest taking advantage of the pre-trained Sagemaker built-in models and finetune on your target datasets.

Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable

This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable