How to Train PyTorch Hugging Face Transformers on Cloud TPUs
Over the past several months the Hugging Face and Google pytorch/xla teams have been collaborating bringing first class support for training Hugging Face transformers on Cloud TPUs, with significant speedups.
In this Colab we walk you through Masked Language Modeling (MLM) finetuning RoBERTa on the WikiText-2 dataset using free TPUs provided by Colab.
Last Updated: February 8th, 2021
Install and clone depedencies
Train the model
All Cloud TPU training functionality has been built into trainer.py and so we'll use the run_mlm.py script under examples/language-modeling to finetune our RoBERTa model on the WikiText-2 dataset.
Note that in the following command we use xla_spawn.py to spawn 8 processes to train on the 8 cores a single v2-8/v3-8 Cloud TPU system has (Cloud TPU Pods can scale all the way up to 2048 cores). All xla_spawn.py does, is call xmp.spawn, which sets up some environment metadata that's needed and calls torch.multiprocessing.start_processes.
The below command ends up spawning 8 processes and each of those drives one TPU core. We've set the per_device_train_batch_size=4 and per_device_eval_batch_size=4, which means that the global bactch size will be 32 (4 examples/device * 8 devices/Colab TPU = 32 examples / Colab TPU). You can also append the --tpu_metrics_debug flag for additional debug metrics (ex. how long it took to compile, execute one step, etc).
The following cell should take around 10~15 minutes to run.
Visualize Tensorboard Metrics
🎉🎉🎉 Done Training! 🎉🎉🎉
Run inference on finetuned model
And just like that, you've just used Cloud TPUs to both fine-tuned your model and run predictions! 🎉