Large Language Models (LLMs) offer powerful general capabilities, but often require fine-tuning to excel at specific tasks or understand domain-specific language. Fine-tuning adapts a trained model to a smaller, targeted dataset, enhancing its performance for your unique needs.
This notebook provides a step-by-step guide to fine-tuning models using the Together AI platform. We will walk through the entire process, from preparing your data to evaluating your fine-tuned model.
We will cover:
- Dataset Preparation: Loading a standard dataset, transforming it into the required format for supervised fine-tuning on Together AI, and uploading your formatted dataset to Together AI Files.
- Fine-tuning Job Launch: Configuring and initiating a fine-tuning job using the Together AI API.
- Job Monitoring: Checking the status and progress of your fine-tuning job.
- Inference: Using your newly finetuned model via the Together AI API for predictions.
- Evaluation: Comparing the performance of the finetuned model against the base model on a test set.
By following this guide, you'll gain practical experience in creating specialized LLMs tailored to your specific requirements using Together AI.
Setup and Installation
First, install the necessary Python libraries. We need:
together: The official Together AI Python client for interacting with the API (fine-tuning, inference, files, etc.).datasets: A library from Hugging Face for easily downloading and manipulating datasets.transformers: Although we won't be training locally, this can be useful for running evals and other utilities if needed.tqdm: To enable interactive elements like progress bars within the notebook.
1. Dataset Preparation
Fine-tuning requires data formatted in a specific way. We'll use the a conversational dataset as an example - here the goal of the fine-tuning is to improve the model on multi-turn conversations.
First we need to transform this dataset into the chat format expected by Together AI for supervised fine-tuning.
The required format is a JSON object per line, where each object contains a list of conversation turns under the "messages" key.
Each message must have a "role" (system, user, or assistant) and "content".
Conversation Data Example:
{"messages": [{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi! How can I help you?"}]}
πDepending on what type of fine-tuning you want to perform you can also pass in instruction data, preference data or even simple text data.
Load Raw Dataset
We use the datasets library to download the CoQA dataset from the Hugging Face Hub.
Let's examine the structure of the raw dataset. CoQA provides a story, a series of questions related to the story, and corresponding answers.
Transform Data to Chat Format
Now, we need to convert each row of the CoQA dataset into the required chat format ([{'role': ..., 'content': ...}, ...]).
We'll create a function map_coqa_to_chat_format that takes a row from the dataset and structures it as a conversation:
- A
systemmessage containing the story (context). - Alternating
user(question) andassistant(answer) messages.
We apply this transformation function to the entire dataset using the .map() method. We also remove the original columns as they are no longer needed after transformation.
Let's check the structure of our transformed dataset. It should now only contain the messages column.
Here's an example of a single processed data point:
Dataset({
, features: ['messages'],
, num_rows: 7199
,}) Write the dataset out to a json file:
Creating json from Arrow format: 0%| | 0/8 [00:00<?, ?ba/s]
23777505
Upload Data to Together AI
Now that we have our formatted coqa_prepared_train.jsonl files, we need to check if they meet the format specification and then upload them to Together AI. Fine-tuning jobs read data directly from your uploaded files.
We use the check_file function to check the file and files.upload() method. This returns information about the uploaded file, including its ID, which we'll need later to start the fine-tuning job.
{
"is_check_passed": true,
"message": "Checks passed",
"found": true,
"file_size": 23777505,
"utf8": true,
"line_type": true,
"text_field": true,
"key_value": true,
"has_min_samples": true,
"num_samples": 7199,
"load_json": true,
"filetype": "jsonl"
}
Uploading file coqa_prepared_train.jsonl: 100%|ββββββββββ| 23.8M/23.8M [00:01<00:00, 15.7MB/s]
Train file response: file-9554964d-5711-419a-bcc2-c4edaaa07ee3
2. Launch Fine-tuning Job
With our data uploaded, we can now launch the fine-tuning job using together.Finetune.create().
Key parameters:
model: The base model you want to finetune (e.g.,'togethercomputer/llama-2-7b-chat'). Choose from the models available for fine-tuning on Together AI.training_file: The ID of your uploaded training JSONL file.validation_file: The ID of your uploaded validation JSONL file (optional, but highly recommended for monitoring).suffix: A custom string added to the base model name to create your unique finetuned model name (e.g.,my-coqa-ft). Keep it short and descriptive.n_epochs: The number of times the model will see the entire training dataset.n_checkpoints: Number of checkpoints to save during training (useful for resuming or selecting the best model). Set to 1 if you only need the final model.learning_rate: Controls how much the model weights are updated during training. Needs tuning.batch_size: Number of training examples processed in one iteration. Depends on model size and available resources.
π For an exhaustive list of all the available fine-tuning parameters refer to the Together AI Fine-tuning API Reference docs.
π For a list of all the models you can fine-tune on the Together AI platform see docs here.
3. Monitor Fine-tuning Job
Fine-tuning can take time depending on the model size, dataset size, and hyperparameters. You can monitor and alter the job's progress using the following methods:
- List all jobs:
client.fine_tuning.list() - Status of a Job:
client.fine_tuning.retrieve(id=ft_resp.id) - List all events for a Job:
client.fine_tuning.list_events(id=ft_resp.id): Retrieves logs and events generated during the job - Cancel job:
client.fine_tuning.cancel(id=ft_resp.id) - Download model after done:
client.fine_tuning.download(id=ft_resp.id)
Once the job is complete (status == 'completed'), the response from retrieve will contain the name of your newly created finetuned model. It follows the pattern: <your-account>/<base-model-name>:<suffix>:<job-id>.
FinetuneJobStatus.STATUS_COMPLETED
Fine tune request created Job started at Wed Apr 9 19:48:05 UTC 2025 Model data downloaded for togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT at Wed Apr 9 19:48:07 UTC 2025 Data downloaded for togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT at $2025-04-09T19:48:14.918488 WandB run initialized. Training started for model togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT Epoch completed, at step 24 Epoch completed, at step 48 Epoch completed, at step 72 Training completed for togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT at Wed Apr 9 20:02:24 UTC 2025 Uploading adapter model Compressing output model Model compression complete Uploading output model Model upload complete Job finished at Wed Apr 9 20:06:33 UTC 2025
π You can also navigate to the WandB page linked in your fine-tuning dashboard to see the fine-tuning related loss curves and more.

4. Inference with Fine-tuned Model
Option 1: Serverless LoRA Inference
Now, let's use our finetuned model! We can call it just like any other model on the Together AI platform, by providing the unique fine-tuned model output_name we retrieved from our fine-tuning job earlier.
π See the list of all models that support LoRA Inference.
Fine-tuned model output_name: zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d
The capital of France is Paris.
You can also prompt the model in our playground, if it support serverless LoRA Inference, by going to your your models dashboard and clicking "OPEN IN PLAYGROUND".

Option 2: Deploy Dedicated Endpoint
Another way to run your fine-tuned model is to deploy it on a custom dedicated endpoint.
Once your fine-tuning job completes, you should see your new model in your models dashboard. You can click the "+ CREATE DEDICATED ENDPOINT" button to deploy the selected model to a DE.
You can then select the hardware configuration for your dedicated endpoint including the min and max replicas which increases the maximum QPS the deployment can support.

You can also deploy the model to a DE programmatically using the Endpoints API via the SDK:
response = client.endpoints.create(
display_name="Fine-tuned Meta Llama 3.1 8B Instruct 04-09-25",
model="zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d",
hardware="4x_nvidia_h100_80gb_sxm",
autoscaling={
min_replicas: 1,
max_replicas: 1
}
)
print(response)
β οΈ If you run this code it will deploy a dedicated endpoint for you. For an detailed documentation around how to deploy, delete and modify endpoints see the Endpoints API Reference.
Once deployed you'll be able to see the model details under your Endpoints Dashboard:

The capital of France is Paris.
5. Evaluation
To assess the impact of fine-tuning, we can compare the responses of our finetuned model with the original base model on the same prompt in out test set.
This provides a way to measure improvements, after fine-tuning, to the model's behavior for our specific task (conversational QA based on a story).
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Dataset({
, features: ['source', 'story', 'questions', 'answers'],
, num_rows: 50
,}) We'll use the code below to generate answers from the baseline and fine-tuned model.
We'll use the function below to calculate the exact match and F1 score metrics
0%| | 0/50 [00:00<?, ?it/s]
0%| | 0/50 [00:00<?, ?it/s]
Model: Baseline, EM: 0.0175, F1: 0.18467257739023207
zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d
0%| | 0/50 [00:00<?, ?it/s]
0%| | 0/50 [00:00<?, ?it/s]
Model: zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d, EM: 0.31, F1: 0.41019649357988347
| Llama 3.1 8B | EM | F1 |
|---|---|---|
| Original | 0.01 | 0.18 |
| Fine-tuned | 0.31 | 0.41 |
We can see that the fine-tuned model performs twice as well on the test set when measuring the F1 score.
For a more detailed guide on Fine-tuning follow our docs here.