Meta Synthetic Data Llama3.1 (8B)
To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!
Join Discord if you need help + ⭐ Star us on Github ⭐
To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.
You will learn how to do data prep, how to train, how to run the model, & how to save it
News
Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog
You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog
Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog
3x faster LLM training with 30% less VRAM and 500K context. 3x faster • 500K Context
New in Reinforcement Learning: FP8 RL • Vision RL • Standby • gpt-oss RL
Visit our docs for all our model uploads and notebooks.
Installation
Unsloth
Synthetic-data-kit
nohup: redirecting stderr to stdout
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. WARNING 04-29 18:42:44 [config.py:2972] Casting torch.bfloat16 to torch.float16. WARNING 04-29 18:42:44 [config.py:2972] Casting torch.bfloat16 to torch.float16. WARNING 04-29 18:42:44 [config.py:2972] Casting torch.bfloat16 to torch.float16. INFO 04-29 18:43:00 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048. E0000 00:00:1745952186.991164 2914 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered E0000 00:00:1745952186.991164 2914 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered INFO 04-29 18:43:16 [weight_utils.py:265] Using model weights format ['*.safetensors'] INFO 04-29 18:43:16 [weight_utils.py:265] Using model weights format ['*.safetensors'] INFO 04-29 18:43:16 [weight_utils.py:265] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] INFO 04-29 18:43:35 [gpu_model_runner.py:1347] Model loading took 5.5859 GiB and 20.094828 seconds INFO 04-29 18:43:35 [gpu_model_runner.py:1347] Model loading took 5.5859 GiB and 20.094828 seconds INFO 04-29 18:43:35 [gpu_model_runner.py:1347] Model loading took 5.5859 GiB and 20.094828 seconds INFO 04-29 18:43:49 [backends.py:430] Dynamo bytecode transform time: 13.67 s INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use INFO 04-29 18:44:31 [backends.py:148] Compiling a graph for general shape takes 41.48 s INFO 04-29 18:44:31 [backends.py:148] Compiling a graph for general shape takes 41.48 s INFO 04-29 18:44:31 [backends.py:148] Compiling a graph for general shape takes 41.48 s INFO 04-29 18:44:31 [backends.py:148] Compiling a graph for general shape takes 41.48 s INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x
Optional: Function to check if vllm server is running. Change False to True and run cell
Create data directories
Ingest source file
Ingest source file "https://ai.meta.com/blog/llama-4-multimodal-intelligence/" . Can also use pdf, docx, ppt and youtube video
Text successfully extracted to data/output/ai_meta_com.txt
Generate QA pairs
Generate QA pairs with the help of vllm and Llama-3.1-8B-Instruct-unsloth-bnb-4bit. set num_pairs to the number of required pairs
Proceeding with process_file call...
Processing 1 chunks to generate QA pairs...
Batch processing complete.
Generated 10 QA pairs total
Saving result to data/generated/ai_meta_com_qa_pairs.json
Successfully wrote test file to data/generated/test_write.json
Successfully wrote result to data/generated/ai_meta_com_qa_pairs.json
Content saved to data/generated/ai_meta_com_qa_pairs.json
Generated content (first 500 chars):
{
"summary": "Here is a summary of the document in 3-5 sentences, focusing on the main topic and key concepts:\n\nMeta AI has announced the Llama 4 herd, a new era of natively multimodal AI innovation, with the release of three models: Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth. These models are designed to enable people to build more personalized multimodal experiences, with Llama 4 Scout and Llama 4 Maverick offering industry-leading performance in image and text understanding, an... Curate Data Pairs
\Curating generated pairs...
Processing 1 batches of QA pairs...
Batch processing complete.
Rated 10 QA pairs
Retained 10 pairs (threshold: 7.0)
Average score: 8.2
Cleaned content saved to data/cleaned/ai_meta_com_qa_pairs_cleaned.json
Generated content (first 500 chars):
{
"summary": "Here is a summary of the document in 3-5 sentences, focusing on the main topic and key concepts:\n\nMeta AI has announced the Llama 4 herd, a new era of natively multimodal AI innovation, with the release of three models: Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth. These models are designed to enable people to build more personalized multimodal experiences, with Llama 4 Scout and Llama 4 Maverick offering industry-leading performance in image and text understanding, an...
Save to chatML format
Converted to ft format and saved to data/final/ai_meta_com_qa_pairs_cleaned_ft.json
Converted content (first 500 chars):
[
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the name of the first open-weight natively multimodal models with unprecedented context length support in the Llama 4 series?"
},
{
"role": "assistant",
"content": "Llama 4 Scout and Llama 4 Maverick"
}
]
},
{
"messages": [
{
"role": "system",
"content": ... Attempting to terminate the VLLM server
Unsloth
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. Unsloth: Failed to patch SmolVLMForConditionalGeneration forward function. 🦥 Unsloth Zoo will now patch everything to make training faster! INFO 04-29 16:46:02 [__init__.py:239] Automatically detected platform cuda. ==((====))== Unsloth 2025.4.1: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.2. \\ /| NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
model.safetensors: 0%| | 0.00/2.35G [00:00<?, ?B/s]
generation_config.json: 0%| | 0.00/234 [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/54.7k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/17.2M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/454 [00:00<?, ?B/s]
We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
Unsloth 2025.4.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.
Data Prep
We now use the ChatML format for conversation style finetunes. We use Open Assistant conversations in ShareGPT style. ChatML renders multi turn conversations like below:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What's the capital of France?<|im_end|>
<|im_start|>assistant
Paris.
[NOTE] To train only on completions (ignoring the user's input) read our docs here
We use our get_chat_template function to get the correct chat template. We support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old and our own optimized unsloth template.
Normally one has to train <|im_start|> and <|im_end|>. We instead map <|im_end|> to be the EOS token, and leave <|im_start|> as is. This requires no additional training of additional tokens.
Note ShareGPT uses {"from": "human", "value" : "Hi"} and not {"role": "user", "content" : "Hi"}, so we use mapping to map it.
For text completions like novel writing, try this notebook.
Generating train split: 0 examples [00:00, ? examples/s]
Map: 0%| | 0/10 [00:00<?, ? examples/s]
[{'content': 'You are a helpful assistant.', 'role': 'system'},
, {'content': 'What is the context window of Llama 4 Scout?', 'role': 'user'},
, {'content': '10M', 'role': 'assistant'}] <|begin_of_text|><|start_header_id|>system<|end_header_id|> Cutting Knowledge Date: December 2023 Today Date: 29 Apr 2025 You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|> What is the context window of Llama 4 Scout?<|eot_id|><|start_header_id|>assistant<|end_header_id|> 10M<|eot_id|>
If you're looking to make your own chat template, that also is possible! You
must use the Jinja templating regime. We provide our own stripped down version of the Unsloth template which we find to be more efficient, and leverages ChatML, Zephyr and Alpaca styles.
More info on chat templates on our wiki page!
Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None. We also support DPOTrainer and GRPOTrainer for reinforcement learning!!
Unsloth: Tokenizing ["text"] (num_proc=2): 0%| | 0/10 [00:00<?, ? examples/s]
GPU = NVIDIA L4. Max memory = 22.161 GB. 3.441 GB of memory reserved.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 10 | Num Epochs = 60 | Total steps = 60 O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 4 \ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8 "-____-" Trainable parameters = 24,313,856/3,000,000,000 (0.81% trained)
64.247 seconds used for training. 1.07 minutes used for training. Peak reserved memory = 3.441 GB. Peak reserved memory for training = 0.0 GB. Peak reserved memory % of max memory = 15.527 %. Peak reserved memory for training % of max memory = 0.0 %.
Unsloth: Will map <|im_end|> to EOS = <|eot_id|>. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
['<|im_start|>user\nContinue the fibonacci sequence: 1, 1, 2, 3, 5, 8,<|im_end|>\n<|im_start|>assistant\n10, 17, 34, 55, 89, 144, 233, 377, 610, 987, 1577, 2584, 4181, 6765, 10946, 17711, 28657, 46367, 75025']
You can also use a TextStreamer for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
<|im_start|>user Continue the fibonacci sequence: 1, 1, 2, 3, 5, 8,<|im_end|> <|im_start|>assistant 8, 13<|im_end|>
('lora_model/tokenizer_config.json',
, 'lora_model/special_tokens_map.json',
, 'lora_model/tokenizer.json') <|im_start|>user What is a famous tall tower in Paris?<|im_end|> <|im_start|>assistant Eiffel <|im_end|>assistant A tower in Paris is called Eiffel.<|im_end|>
You can also use Hugging Face's AutoPeftModelForCausalLM. Only use this if you do not have unsloth installed. It can be hopelessly slow, since 4bit model downloading is not supported, and Unsloth's inference is 2x faster.
Saving to float16 for VLLM
We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.
GGUF / llama.cpp Conversion
To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.
Some supported quant methods (full list on our docs page):
q8_0- Fast conversion. High resource use, but generally acceptable.q4_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.q5_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
Some other resources:
- Looking to use Unsloth locally? Read our Installation Guide for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.
- Learn how to do Reinforcement Learning with our RL Guide and notebooks.
- Read our guides and notebooks for Text-to-speech (TTS) and vision model support.
- Explore our LLM Tutorials Directory to find dedicated guides for each model.
- Need help with Inference? Read our Inference & Deployment page for details on using vLLM, llama.cpp, Ollama etc.
