Notebooks
T
Together
Batch Inference Evals

Evaluating LLMs on SimpleQA Using Batch Inference API

Open In Colab

Overview

This notebook demonstrates how to use Together AI's Batch Inference API to evaluate large language models on OpenAI's SimpleQA benchmark. We'll evaluate DeepSeek-V3-0324's performance on factual question answering and use an LLM-as-a-Judge grading system.

By the end of this tutorial, you'll understand how to:

  • Format requests for the Batch Inference API
  • Evaluate LLMs on factual QA benchmarks
  • Implement automated grading workflows
  • Calculate evaluation metrics for model performance

What is SimpleQA?

SimpleQA is a benchmark consisting of factual questions with short answers. It tests a model's ability to provide accurate, concise responses to straightforward factual queries. The benchmark includes questions across various domains like science, geography, sports, and arts.

Methodology

Our evaluation follows a two-step process:

  1. Answer Generation: Use DeepSeek-V3-0324 to generate answers to SimpleQA questions
  2. Automated Grading: Use Llama 3.3 70B as a judge to grade the answers as CORRECT, INCORRECT, or NOT_ATTEMPTED

🚀 Setup and Dependencies

[1]

📊 Understanding the Dataset

Let's examine the SimpleQA dataset structure to understand what we're working with.

[3]

Here we can see that every row has a factual question and a corresponding answer.

🎯 Step 1: Preparing Student Model Requests

Now we'll create batch requests for DeepSeek V3 to answer all the SimpleQA questions.

Batch API Request Format

The Batch API expects JSONL format where each line contains a request with:

  • custom_id: Unique identifier for tracking responses
  • body: The actual API request payload
{"custom_id": "request-1", "body": {"model": "deepseek-ai/DeepSeek-V3", "messages": [{"role": "user", "content": "Hello, world!"}]}}
[ ]
[4]
Created simpleqa_batch_student.jsonl with 4326 batch requests
Each request uses model: deepseek-ai/DeepSeek-V3
[5]
Sample first few requests:
Custom ID: 042fffda-ce74-4c38-8f42-45429b8ee3dc
Question: 
You are a helpful language model answering factual questions correctly. Give your best answer 
to the following factual question, focusing on concision and correctness. Do not output anything else 
besides your answer to the question. 
Question: Who received the IEEE Frank Rosenblatt Award in 2010?
Predicted Answer: 

---
Custom ID: f1305d3b-0688-439b-8fb1-5fe5725575a7
Question: 
You are a helpful language model answering factual questions correctly. Give your best answer 
to the following factual question, focusing on concision and correctness. Do not output anything else 
besides your answer to the question. 
Question: Who was awarded the Oceanography Society's Jerlov Award in 2018?
Predicted Answer: 

---
Custom ID: 917ccf80-be0e-4043-8836-481275015830
Question: 
You are a helpful language model answering factual questions correctly. Give your best answer 
to the following factual question, focusing on concision and correctness. Do not output anything else 
besides your answer to the question. 
Question: What's the name of the women's liberal arts college in Cambridge, Massachusetts?
Predicted Answer: 

---
[21]

⚡ Step 2: Running the Student Model Batch Job

Time to submit our batch job and wait for DeepSeek V3-0324 to generate answers.

[6]
Uploading file simpleqa_batch_student.jsonl: 100%|██████████| 2.24M/2.24M [00:00<00:00, 4.24MB/s]
Batch file ID: file-73585c5e-0a77-49f6-ac0c-604ebd49727b
[ ]
Batch created with ID: 1ea94710-592e-4f11-a92f-1a91a4ba3454
[9]
BatchJobStatus.IN_PROGRESS
[10]
[BatchJob(id='a2e03a80-a083-4c73-8790-43cac5d40715', user_id='66f0bd504fb9511df3489b9a', input_file_id='file-c6131920-f4b3-44ff-8daa-f8e3f5cc4e70', file_size_bytes=1056749, status=<BatchJobStatus.COMPLETED: 'COMPLETED'>, job_deadline=datetime.datetime(2025, 6, 10, 21, 14, 25, 429867, tzinfo=TzInfo(UTC)), created_at=datetime.datetime(2025, 6, 9, 21, 14, 25, 429872, tzinfo=TzInfo(UTC)), endpoint='/v1/chat/completions', progress=100.0, model_id='deepseek-ai/DeepSeek-R1', output_file_id='file-94924b8e-ac08-4166-932d-74055bbda804', error_file_id='file-ac6939fc-b542-438d-9487-cb034afad215', error=None, completed_at=datetime.datetime(2025, 6, 9, 23, 8, 52, 900997, tzinfo=TzInfo(UTC))),
, BatchJob(id='a14c0f60-63c3-4ce4-9181-d6a69d48655e', user_id='66f0bd504fb9511df3489b9a', input_file_id='file-c6131920-f4b3-44ff-8daa-f8e3f5cc4e70', file_size_bytes=1056749, status=<BatchJobStatus.COMPLETED: 'COMPLETED'>, job_deadline=datetime.datetime(2025, 6, 10, 21, 13, 54, 381593, tzinfo=TzInfo(UTC)), created_at=datetime.datetime(2025, 6, 9, 21, 13, 54, 381596, tzinfo=TzInfo(UTC)), endpoint='/v1/chat/completions', progress=100.0, model_id='deepseek-ai/DeepSeek-R1', output_file_id='file-cd65f265-538c-47d2-8373-0fb5eef41f0b', error_file_id='file-7927bcf4-ba29-4ce4-8cc9-fb96af6c8db4', error=None, completed_at=datetime.datetime(2025, 6, 9, 23, 8, 20, 271809, tzinfo=TzInfo(UTC))),
, BatchJob(id='1ea94710-592e-4f11-a92f-1a91a4ba3454', user_id='66f0bd504fb9511df3489b9a', input_file_id='file-73585c5e-0a77-49f6-ac0c-604ebd49727b', file_size_bytes=2237747, status=<BatchJobStatus.IN_PROGRESS: 'IN_PROGRESS'>, job_deadline=datetime.datetime(2025, 6, 12, 6, 11, 22, 752345, tzinfo=TzInfo(UTC)), created_at=datetime.datetime(2025, 6, 11, 6, 11, 22, 752350, tzinfo=TzInfo(UTC)), endpoint='/v1/chat/completions', progress=71.75, model_id='deepseek-ai/DeepSeek-V3', output_file_id=None, error_file_id=None, error=None, completed_at=None),
, BatchJob(id='1e82cb6d-e79e-4e7a-993a-ed59b24a21e4', user_id='66f0bd504fb9511df3489b9a', input_file_id='file-c6131920-f4b3-44ff-8daa-f8e3f5cc4e70', file_size_bytes=1056749, status=<BatchJobStatus.COMPLETED: 'COMPLETED'>, job_deadline=datetime.datetime(2025, 6, 10, 22, 18, 17, 586343, tzinfo=TzInfo(UTC)), created_at=datetime.datetime(2025, 6, 9, 22, 18, 17, 586348, tzinfo=TzInfo(UTC)), endpoint='/v1/chat/completions', progress=100.0, model_id='deepseek-ai/DeepSeek-R1', output_file_id='file-1f20c3d2-1d53-429f-9b16-acc0b986931d', error_file_id='file-c432ae3a-0ff0-4968-b73b-826b20c7c9c0', error=None, completed_at=datetime.datetime(2025, 6, 9, 23, 9, 17, 546888, tzinfo=TzInfo(UTC)))]
[15]
Downloading file simpleqa_v3_output.jsonl: 100%|██████████| 2.39M/2.39M [00:00<00:00, 6.00MB/s]

🔍 Step 3: Analyzing Model Responses

Let's examine some of the generated responses before we grade them.

[ ]
First few rows of the combined output:
[45]
Input Prompt: 
You are a helpful language model answering factual questions correctly. Give your best answer 
to the following factual question, focusing on concision and correctness. Do not output anything else 
besides your answer to the question. 
Question: Which architects designed the Abasto?
Predicted Answer: 

[51]
Predicted Answer: Juan Antonio Buschiazzo and Virgilio Colombo.
[52]
Gold Answer: José Luis Delpini, Viktor Sulčič and Raúl Bes

As you can see, for this particular example the model got wrong!

📋 Step 4: Preparing Grader Requests

Now we'll create batch requests for our judge model to evaluate the answers.

Grading Criteria

The grader uses strict criteria from the SimpleQA paper:

  • CORRECT: Answer contains all required information without contradictions
  • INCORRECT: Answer contains factual errors or contradictions
  • NOT_ATTEMPTED: Answer doesn't provide the required information but doesn't contradict
[16]
[ ]
Created simpleqa_batch_grader.jsonl with 4326 batch requests
Each request uses model: meta-llama/Llama-3.3-70B-Instruct-Turbo

⚖️ Step 5: Running the Grader Batch Job

Submit the grading requests and wait for evaluation results.

[34]
Uploading simpleqa_batch_grader.jsonl...
Uploading file simpleqa_batch_grader.jsonl: 100%|██████████| 28.9M/28.9M [00:05<00:00, 5.64MB/s]
Grader batch file ID: file-45de36cc-41a7-4b34-9b00-2f730903d0d3
Creating batch job...
Grader batch created with ID: f5b32360-48b0-40c2-8c27-92d0d0f7e029
Initial status: BatchJobStatus.IN_PROGRESS
[36]
Batch is complete! Downloading results...
Downloading file simpleqa_grader_output.jsonl: 100%|██████████| 2.34M/2.34M [00:00<00:00, 4.39MB/s]

📈 Step 6: Final Results Analysis

Let's analyze DeepSeek V3's performance on SimpleQA.

Key Metrics

  • Accuracy: Overall accuracy across all questions
  • Accuracy for attempted: Accuracy when the model provided an answer
[50]

Final Results:
Correct: 919
Incorrect: 3333
Not Attempted: 74
Total: 4326
Accuracy = 21.2%
Accuracy (attempted) = 21.6%

🎉 Conclusion

We successfully evaluated DeepSeek V3-0324 on 4,326 SimpleQA questions using Together AI's Batch Inference API, achieving and Overall Accuracy: 21.2% (919/4,326 questions)

🔑 Key Takeaways

  • Batch Processing Efficiency: Batch inference dramatically reduces costs and eliminates rate limiting for large-scale evaluations
  • Automated Evaluation: LLM-as-a-judge enables scalable assessment without manual annotation
  • Systematic Methodology: Proper request formatting, progress monitoring, and result alignment are crucial for reliable evaluations
  • Evaluation Complexity: Even straightforward factual questions pose significant challenges for state-of-the-art models

📚 Learn More

For detailed documentation on batch inference capabilities and implementation: 👉 Together AI Batch Inference Documentation