Distillation
Distillation with Llama 4 and Synthetic Data Kit
Copyright (c) Meta Platforms, Inc. and affiliates. This software may be used and distributed according to the terms of the Llama Community License Agreement.
This notebook will walk you through distilling model knowledge from Llama 4 into a smaller Llama 3.2 model using synthetic training data from Synthetic Data Kit.
The goal
The goal of this notebook is to distill knowledge from a more powerful model (Llama 4 Scout) into a smaller, less powerful model (Llama 3.2 3B).
Smaller models have several advantages when compared with larger models: they're faster to generate text, have lower time to first token, and cost less to host since they need less hardware. However, larger models tend to be generalists – that is, they have the ability to perform a wide variety of tasks well. On specific or specialized tasks, smaller models can be just as good as the generalist, larger models. Distillation allows you to take knowledge present in a larger model and transfer it to a smaller model with a minimal drop in quality for narrow tasks.
The data
This notebook uses air traffic control data to demonstrate tuning a model towards a specialized field. During distillation, we will fully generate pairs from scratch, because our generalist teacher model has a strong understanding of ATC phraseology. During evaluation, we will evaluate both synthetic pairs as well as actual ATC data.
We will use the ATCO2 corpus of air traffic data, an MIT-licensed dataset that contains audio, transcriptions, and additional contextual and metadata for each interaction. For this exercise we will only use the text transcripts, and will use the small (1h) sample dataset to demonstrate how only a small amount of data is actually necessary for fine-tuning the model.
Evaluation
To evaluate our model, we will use standard language evaluation metrics such as perplexity and accuracy. We will also use BLEU (bilingual evaluation understudy) to measure similarity without requiring that the model matches exactly every word. While originally designed for machine translation, BLEU compares n-gram similarity, meaning that minor word order differences are not penalized.
Prerequisites
Hardware Requirements:
- NVIDIA GPU with at least 80GB VRAM (H100, A100, or similar)
- 8x GPU to run Llama 4 Scout and create the dataset
- 1x GPU to distill and fine-tune the model
- 200GB+ disk space
- 64GB+ system RAM
Software Requirements:
- CUDA 12.x
- HuggingFace account and token
- Fast internet connection for downloading models
Preparing your environment
Generate the synthetic dataset
We will use the synthetic data kit to produce synthetic data to distill our model.
First, set up the VLLM server. You will need to run this in a separate terminal window
since Jupyter doesn't support long running tasks/servers. Make sure to install vLLM with
pip install vllm
HF_HOME=/workspace/huggingface_cache \
HF_TOKEN=$HF_TOKEN \
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 8
Then check that the server is working properly.
Loading config from: /usr/local/lib/python3.10/dist-packages/synthetic_data_kit/config.yaml Config has LLM provider set to: api-endpoint Loading config from: /usr/local/lib/python3.10/dist-packages/synthetic_data_kit/config.yaml Config has LLM provider set to: api-endpoint Loading config from: config.yaml Config has LLM provider set to: vllm Environment variable check: API_ENDPOINT_KEY: Not found get_llm_provider returning: vllm vLLM server is running at http://localhost:8000/v1 Available models: {'object': 'list', 'data': [{'id': 'meta-llama/Llama-4-Scout-17B-16E-Instruct', 'object': 'model', 'created': 1752251909, 'owned_by': 'vllm', 'root': 'meta-llama/Llama-4-Scout-17B-16E-Instruct', 'parent': None, 'max_model_len': 8192, 'permission': [{'id': 'modelperm-3c8eafb867bb4df4b4d65b45a899ae7a', 'object': 'model_permission', 'created': 1752251909, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view': True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}]}]} ⠋ Checking vLLM server at http://localhost:8000/v1...
If the model is working correctly you should see VLLM server is running.
Next, we will set up our configuration file for generating the data. We will use the QA task for our task, giving an example set of data and then asking the model to create call/response pairs similar to the examples. This is slightly different than an actual QA dataset but demonstrates different tasks can fit into the general framework that synthetic data kit provides.
We also create a dataset of examples to guide the model to producing better synthetic data. We provide 20 examples to produce 500+ training examples from synthetic data kit.
We create our synthetic dataset using synthetic-data-kit, running the command in batches in order to create enough examples. This is because weaker models have issues generating large numbers of examples.
500 50
Preparing the eval dataset
Our human curated eval dataset contains text annotations in the form of XML files. We want to just produce transcripts of the conversation, and do not need to include any other metadata or audio.
Parsed 244
Evaluating the baseline model
To evaluate the baseline results of the model we will use the HuggingFace transformers package and Unsloth for inference. We use two metrics here, perplexity and BLEU. Perplexity captures the "surprise" of the model, and applies on a per-token basis. BLEU is typically used for machine translation, but here is capturing if the response gets the gist of the correct answer, accounting for differences in word order.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! INFO 07-11 18:16:50 [__init__.py:244] Automatically detected platform cuda.
==((====))== Unsloth 2025.7.3: Fast Llama patching. Transformers: 4.53.2. vLLM: 0.9.2. \\ /| NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.209 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
config.json: 0.00B [00:00, ?B/s]
model.safetensors: 0%| | 0.00/2.35G [00:00<?, ?B/s]
generation_config.json: 0%| | 0.00/234 [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
special_tokens_map.json: 0%| | 0.00/454 [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/17.2M [00:00<?, ?B/s]
chat_template.jinja: 0.00B [00:00, ?B/s]
Map: 0%| | 0/100 [00:00<?, ? examples/s]
Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
=== Evaluation Results === Average Perplexity: 597.31 Average BLEU Score: 0.04
Fine-tuning the model
🚀 Starting fine-tuning process... ==((====))== Unsloth 2025.7.3: Fast Llama patching. Transformers: 4.53.2. vLLM: 0.9.2. \\ /| NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.209 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Map: 0%| | 0/500 [00:00<?, ? examples/s]
Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters are not enabled or a bias term (like in Qwen) is used. Unsloth 2025.7.3 patched 28 layers with 28 QKV layers, 28 O layers and 0 MLP layers.
Unsloth: Tokenizing ["text"]: 0%| | 0/500 [00:00<?, ? examples/s]
🏋️ Training started...
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 500 | Num Epochs = 4 | Total steps = 250 O^O/ \_/ \ Batch size per device = 8 | Gradient accumulation steps = 1 \ / Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8 "-____-" Trainable parameters = 9,175,040 of 3,221,924,864 (0.28% trained)
Unsloth: Will smartly offload gradients to save VRAM! ✅ Training complete! Model saved to Results
Evaluating the fine-tuned model
Once we have a fine-tuned model, we can re-run our evaluation with the new model! We'll look at the metrics for both, as well as a "vibe check" where we manually inspect a few outputs to confirm the model is working how we expect. During evaluation, both metrics as well as manual spot checking are important -- metrics capture broad patterns and spot checking makes up for deficiencies in metrics.
Map: 0%| | 0/100 [00:00<?, ? examples/s]
Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
=== Evaluation Results === Average Perplexity: 229.11 Average BLEU Score: 0.20
Original Perplexity: 597.310, Tuned Perplexity: 229.106 Original BLEU: 0.042, Tuned BLEU: 0.203
ATC Request: CSA One Delta Zulu descend flight level one hundred no speed restrictions GT: descending flight level one hundred free speed CSA One Delta Zulu Original: Roger that, One Delta Zulu. Descend and maintain level one hundred. Tuned: Descend flight level one hundred no speed ATC Request: Oscar Kilo Triple Hotel please confirm one more holding GT: Oscar Kilo Hotel Hotel Hotel affirm one holding and then it should be possible to follow ILS runway zero six Original: Roger that, Oscar Kilo Triple Hotel, holding for clearance. What's your planned departure? Tuned: One more holding, Oscar Kilo Triple Hotel ATC Request: Ruzyne Tower hello again Eurowings One Tango Kilo GT: Eurowings One Tango Kilo Ruzyne Tower good afternoon go ahead Original: This is Ruzyne Tower, Eurowings One Tango Kilo, cleared to the runway. Be advised, there is a departing Boeing 737-800 on the adjacent runway, expect a possible taxi to the north. Climb to 30000 feet, contact Ground Control on 122.8 for departure clearance. Tuned: Eurowings One Tango Kilo Ruzyne Tower ATC Request: Ryanair Nine Two Bravo Quebec turn right heading zero nine zero GT: nine zero degrees Ryanair Nine Two Bravo Quebec Original: Ryanair Nine Two Bravo Quebec, cleared for departure. Report descent to twenty thousand, then turn left heading two five zero for departure from runway one four. Tuned: Turn right heading zero nine zero, Ryanair Nine Two Bravo Quebec ATC Request: Oscar Kilo Charlie Alfa Papa squawk seven thousand good bye GT: squawk seven thousand good bye Oscar Kilo Charlie Alfa Papa Original: Roger that, Oscar Kilo Charlie Alfa Papa, this is Center Control. You are cleared for departure, taxi to runway 27L. Good luck on your flight! Tuned: Seven thousand good bye Oscar Kilo Charlie Alfa Papa
Conclusion
By the end of this guide, you should have:
- ✅ A running vLLM server with a quantized Llama model
- ✅ Infrastructure to create synthetic examples for training
- ✅ A 200+ example synthetic dataset created using Llama 4 Scout
- ✅ A distilled Llama 3.1 8B model
- ✅ Test results showing improved metrics and qualitative results
What's next?
- Use an even more powerful model to generate synthetic examples, for example Llama 4 Maverick
- Develop more comprehensive evaluation strategies, including domain-specific metrics
- Extend the dataset to include more data and thus better transfer knowledge
- Examine your dataset using automated tools to understand what's inside and determine gaps