Function Calling Fine Tuning Llms On Xlam
Fine-tuning LLMs for Function Calling with xLAM Dataset
Authored by: Behrooz Azarkhalili
This notebook demonstrates how to fine-tune language models for function calling capabilities using the xLAM dataset from Salesforce and QLoRA (Quantized Low-Rank Adaptation) technique. We'll work with popular models like Llama 3, Qwen2, Mistral, and others.
What is Function Calling? Function calling enables language models to interact with external tools and APIs by generating structured function invocations. Instead of just generating text, the model learns to call specific functions with the right parameters based on user requests.
What You'll Learn:
- Data Processing: How to format the xLAM dataset for function calling training
- Model Fine-tuning: Using QLoRA for memory-efficient training on consumer GPUs
- Evaluation: Testing the fine-tuned models with example prompts
- Multi-model Support: Working with different model architectures
Key Benefits:
- Memory Efficient: QLoRA enables training on 16-24GB GPUs
- Production Ready: Modular code with proper error handling
- Flexible Architecture: Easy to adapt for different models and datasets
- Universal Support: Works with Llama, Qwen, Mistral, Gemma, Phi, and more
Hardware Requirements:
- GPU: 16GB+ VRAM (24GB recommended for larger models)
- RAM: 32GB+ system memory
- Storage: 50GB+ free space for models and datasets
Software Dependencies: The notebook will install required packages automatically, including:
transformers,peft,bitsandbytes,trl,datasets,accelerate
For detailed methodology and results, see: Function Calling: Fine-tuning Llama 3 and Qwen2 on xLAM
Basic Setup and Imports
Let's start with the essential imports and basic setup for our notebook.
PyTorch version: 2.8.0+cu128 CUDA available: True GPU: NVIDIA H100 NVL VRAM: 100.0 GB
Hugging Face Authentication Setup
Next, we'll set up authentication with HuggingFace Hub. This allows us to download models and datasets, and optionally upload our fine-tuned models.
โ Successfully authenticated with HuggingFace!
Model Configuration Classes
We'll create two configuration classes to organize our settings:
- ModelConfig: Stores model-specific settings like tokenizer configuration
- TrainingConfig: Stores training parameters like learning rate and batch size
Automatic Model Configuration
This function automatically detects the model's tokenizer settings and creates a proper configuration. It handles different model architectures (Llama, Qwen, Mistral, etc.) and their specific token requirements.
โ Configuration system ready! ๐ก Supports Llama, Qwen, Mistral, Gemma, Phi, and more
Hardware Detection and Setup
Let's detect our hardware capabilities and configure optimal settings. We'll check for bfloat16 support and set up the best attention mechanism for our GPU.
Tokenizer Setup Function
Now let's create a function to set up our tokenizer with the right configuration from our model settings.
๐ Hardware Configuration Complete: โข Compute dtype: torch.bfloat16 โข Attention implementation: flash_attention_2 โข Device: NVIDIA H100 NVL
Dataset Processing
Now we'll work with the xLAM dataset from Salesforce. This dataset contains about 60,000 examples of function calling conversations that we'll use to train our model.
Key Functions:
process_xlam_sample(): Converts a single dataset example into the training format with special tags (<user>,<tools>,<calls>) and EOS tokenload_and_process_xlam_dataset(): Loads the complete xLAM dataset (60K samples) from Hugging Face and processes all samples using multiprocessing for efficiencypreview_dataset_sample(): Displays a formatted preview of a processed dataset sample for inspection with statistics
Loading and Processing the Dataset
Now let's add functions to load the xLAM dataset and process it into the format our model needs for training.
QLoRA Training Setup
QLoRA (Quantized Low-Rank Adaptation) allows us to fine-tune large language models efficiently. It uses 4-bit quantization to reduce memory usage while maintaining training quality.
LoRA Configuration
LoRA (Low-Rank Adaptation) is the key technique that makes efficient fine-tuning possible. Instead of updating all model parameters, LoRA adds small trainable matrices to specific layers while keeping the base model frozen.
Training Execution
Now we'll create the main training function that puts everything together. This function configures the training arguments and executes the fine-tuning process using TRL's SFTTrainer.
๐ฏ Universal Model Selection
Choose any model for fine-tuning! This notebook supports a wide range of popular models. Simply uncomment the model you want to use or specify your own.
๐ Quick Model Selection
Uncomment one of these popular models or specify your own:
Why Llama 3-8B-Instruct as default?
- Proven Performance: Excellent function calling capabilities and instruction following
- Optimal Size: 8B parameters provide great balance between performance and resource usage
๐ฏ Selected Model: meta-llama/Meta-Llama-3-8B-Instruct ๐ง Auto-configuring everything for meta-llama/Meta-Llama-3-8B-Instruct... ๐ Loading model configuration: meta-llama/Meta-Llama-3-8B-Instruct ๐ Model: llama, vocab_size: 128,256 โ Configured - pad: '<|eot_id|>' (ID: 128009), eos: '<|eot_id|>' (ID: 128009) ๐ Ready to fine-tune! Everything configured automatically: โ Model type: llama โ Vocabulary: 128,256 tokens โ Pad token: '<|eot_id|>' (ID: 128009) โ Output dir: ./Meta_Llama_3_8B_Instruct_xLAM ๐ Configuration complete for meta-llama/Meta-Llama-3-8B-Instruct!
Model Loading for Inference
After training is complete, we need to load the trained model for inference. This function loads the base model with quantization and applies the trained LoRA adapters.
Text Generation for Function Calls
Now let's create the function that generates responses from our fine-tuned model. This handles tokenization, generation parameters, and decoding.
Testing Function Calling Capabilities
This function provides a comprehensive test suite to evaluate our fine-tuned model with different types of function calling scenarios.
๐งช Testing function calling capabilities...
============================================================
Test Case 1: Mathematical Function
============================================================
๐ฏ Generating response for prompt...
๐ Input: <user>Check if the numbers 8 and 1233 are powers of two.</user>
<tools>
โ
Generation completed!
๐ Generated 90 new tokens
๐ Complete Response:
----------------------------------------
<user>Check if the numbers 8 and 1233 are powers of two.</user>
<tools>{'name': 'is_power_of_two', 'description': 'Checks if a number is a power of two.', 'parameters': {'num': {'description': 'The number to check.', 'type': 'int'}}}</tools>
<calls>{'name': 'is_power_of_two', 'arguments': {'num': 8}}
{'name': 'is_power_of_two', 'arguments': {'num': 1233}}</calls>
----------------------------------------
============================================================
Test Case 2: Weather Query
============================================================
๐ฏ Generating response for prompt...
๐ Input: <user>What's the weather like in New York today?</user>
<tools>
โ
Generation completed!
๐ Generated 105 new tokens
๐ Complete Response:
----------------------------------------
<user>What's the weather like in New York today?</user>
<tools>{'name':'realtime_weather_api', 'description': 'Fetches current weather information based on the provided query parameter.', 'parameters': {'q': {'description': 'Query parameter used to specify the location for which weather data is required. It can be in various formats such as:', 'type':'str', 'default': '53.1,-0.13'}}}</tools>
<calls>{'name':'realtime_weather_api', 'arguments': {'q': 'New York'}}</calls>
----------------------------------------
============================================================
Test Case 3: Data Processing
============================================================
๐ฏ Generating response for prompt...
๐ Input: <user>Calculate the average of these numbers: 10, 20, 30, 40, 50</user>
<tools>
โ
Generation completed!
๐ Generated 81 new tokens
๐ Complete Response:
----------------------------------------
<user>Calculate the average of these numbers: 10, 20, 30, 40, 50</user>
<tools>{'name': 'average', 'description': 'Calculates the arithmetic mean of a list of numbers.', 'parameters': {'numbers': {'description': 'The list of numbers.', 'type': 'List[float]'}}}</tools>
<calls>{'name': 'average', 'arguments': {'numbers': [10, 20, 30, 40, 50]}}</calls>
----------------------------------------
โ
All test cases completed!
๐ Conclusion and Next Steps
๐ Summary
This notebook demonstrated a complete, production-ready, universal pipeline for fine-tuning language models for function calling capabilities using:
- ๐ฏ Universal Model Support: Works with any model - just change the
MODEL_NAMEvariable - ๐ง Intelligent Configuration: Automatic token detection using
auto_configure_model() - โก QLoRA Efficiency: Memory-efficient training on consumer GPUs (16-24GB)
- ๐ Comprehensive Testing: Automated evaluation and interactive testing capabilities
๐ Key Improvements Made
Universal Compatibility
- โ Multi-Model Support: Works with Llama, Qwen, Mistral, Gemma, Phi, DeepSeek, Yi, and more
- โ Smart Token Detection: Automatically finds pad/EOS tokens from any model's tokenizer
- โ Error Prevention: Validates configurations and provides helpful error messages
- โ Flexible Architecture: Easy to add new models without code changes
Code Quality
- โ Type Hints: Full type annotations for better IDE support and error catching
- โ Docstrings: Comprehensive documentation for all functions
- โ Error Handling: Robust error handling with informative messages
- โ Modular Design: Clean separation of concerns and reusable components
User Experience
- โ
One-Line Model Selection: Simply change
MODEL_NAMEvariable - โ Automatic Configuration: Everything extracted from transformers automatically
- โ Clear Progress Indicators: Emojis and detailed logging throughout
- โ Production Ready: Code suitable for research and deployment
๐ Next Steps and Extensions
Model Improvements
- Try Different Models: Simply change the
MODEL_NAMEvariable and re-run - Hyperparameter Tuning: Experiment with different LoRA ranks, learning rates
- Extended Training: Try multi-epoch training for better convergence
Evaluation Enhancements
- Quantitative Metrics: Add BLEU, ROUGE, or custom function calling accuracy
- Benchmark Datasets: Test on additional function calling benchmarks
- Multi-Model Comparison: Compare performance across different model families
Deployment Options
- Model Serving: Deploy with FastAPI, TensorRT, or vLLM
- Integration: Connect with real APIs and function execution environments
- Optimization: Implement model quantization and pruning for production
Additional Features
- Multi-turn Conversations: Extend to handle conversation context
- Tool Selection: Improve tool selection and reasoning capabilities
- Error Recovery: Add error handling and recovery mechanisms
๐ Resources and References
- xLAM Dataset: Salesforce/xlam-function-calling-60k
- QLoRA Paper: Efficient Finetuning of Quantized LLMs
- Function Calling Guide: Complete methodology article
- PEFT Library: Hugging Face PEFT Documentation
๐๏ธ Achievement Unlocked
๐ Universal Function Calling Fine-tuning Master!
You now have a production-ready system that can fine-tune virtually any open-source language model for function calling with just a single line change!
Happy Fine-tuning! ๐ Try different models, share your results, and contribute back to the community!