Evaluating LLMs on SimpleQA Using Batch Inference API
Overview
This notebook demonstrates how to use Together AI's Batch Inference API to evaluate large language models on OpenAI's SimpleQA benchmark. We'll evaluate DeepSeek-V3-0324's performance on factual question answering and use an LLM-as-a-Judge grading system.
By the end of this tutorial, you'll understand how to:
- Format requests for the Batch Inference API
- Evaluate LLMs on factual QA benchmarks
- Implement automated grading workflows
- Calculate evaluation metrics for model performance
What is SimpleQA?
SimpleQA is a benchmark consisting of factual questions with short answers. It tests a model's ability to provide accurate, concise responses to straightforward factual queries. The benchmark includes questions across various domains like science, geography, sports, and arts.
Methodology
Our evaluation follows a two-step process:
- Answer Generation: Use DeepSeek-V3-0324 to generate answers to SimpleQA questions
- Automated Grading: Use Llama 3.3 70B as a judge to grade the answers as CORRECT, INCORRECT, or NOT_ATTEMPTED
🚀 Setup and Dependencies
📊 Understanding the Dataset
Let's examine the SimpleQA dataset structure to understand what we're working with.
Here we can see that every row has a factual question and a corresponding answer.
🎯 Step 1: Preparing Student Model Requests
Now we'll create batch requests for DeepSeek V3 to answer all the SimpleQA questions.
Batch API Request Format
The Batch API expects JSONL format where each line contains a request with:
custom_id: Unique identifier for tracking responsesbody: The actual API request payload
{"custom_id": "request-1", "body": {"model": "deepseek-ai/DeepSeek-V3", "messages": [{"role": "user", "content": "Hello, world!"}]}}
Created simpleqa_batch_student.jsonl with 4326 batch requests Each request uses model: deepseek-ai/DeepSeek-V3
Sample first few requests: Custom ID: 042fffda-ce74-4c38-8f42-45429b8ee3dc Question: You are a helpful language model answering factual questions correctly. Give your best answer to the following factual question, focusing on concision and correctness. Do not output anything else besides your answer to the question. Question: Who received the IEEE Frank Rosenblatt Award in 2010? Predicted Answer: --- Custom ID: f1305d3b-0688-439b-8fb1-5fe5725575a7 Question: You are a helpful language model answering factual questions correctly. Give your best answer to the following factual question, focusing on concision and correctness. Do not output anything else besides your answer to the question. Question: Who was awarded the Oceanography Society's Jerlov Award in 2018? Predicted Answer: --- Custom ID: 917ccf80-be0e-4043-8836-481275015830 Question: You are a helpful language model answering factual questions correctly. Give your best answer to the following factual question, focusing on concision and correctness. Do not output anything else besides your answer to the question. Question: What's the name of the women's liberal arts college in Cambridge, Massachusetts? Predicted Answer: ---
⚡ Step 2: Running the Student Model Batch Job
Time to submit our batch job and wait for DeepSeek V3-0324 to generate answers.
Uploading file simpleqa_batch_student.jsonl: 100%|██████████| 2.24M/2.24M [00:00<00:00, 4.24MB/s]
Batch file ID: file-73585c5e-0a77-49f6-ac0c-604ebd49727b
Batch created with ID: 1ea94710-592e-4f11-a92f-1a91a4ba3454
BatchJobStatus.IN_PROGRESS
[BatchJob(id='a2e03a80-a083-4c73-8790-43cac5d40715', user_id='66f0bd504fb9511df3489b9a', input_file_id='file-c6131920-f4b3-44ff-8daa-f8e3f5cc4e70', file_size_bytes=1056749, status=<BatchJobStatus.COMPLETED: 'COMPLETED'>, job_deadline=datetime.datetime(2025, 6, 10, 21, 14, 25, 429867, tzinfo=TzInfo(UTC)), created_at=datetime.datetime(2025, 6, 9, 21, 14, 25, 429872, tzinfo=TzInfo(UTC)), endpoint='/v1/chat/completions', progress=100.0, model_id='deepseek-ai/DeepSeek-R1', output_file_id='file-94924b8e-ac08-4166-932d-74055bbda804', error_file_id='file-ac6939fc-b542-438d-9487-cb034afad215', error=None, completed_at=datetime.datetime(2025, 6, 9, 23, 8, 52, 900997, tzinfo=TzInfo(UTC))), , BatchJob(id='a14c0f60-63c3-4ce4-9181-d6a69d48655e', user_id='66f0bd504fb9511df3489b9a', input_file_id='file-c6131920-f4b3-44ff-8daa-f8e3f5cc4e70', file_size_bytes=1056749, status=<BatchJobStatus.COMPLETED: 'COMPLETED'>, job_deadline=datetime.datetime(2025, 6, 10, 21, 13, 54, 381593, tzinfo=TzInfo(UTC)), created_at=datetime.datetime(2025, 6, 9, 21, 13, 54, 381596, tzinfo=TzInfo(UTC)), endpoint='/v1/chat/completions', progress=100.0, model_id='deepseek-ai/DeepSeek-R1', output_file_id='file-cd65f265-538c-47d2-8373-0fb5eef41f0b', error_file_id='file-7927bcf4-ba29-4ce4-8cc9-fb96af6c8db4', error=None, completed_at=datetime.datetime(2025, 6, 9, 23, 8, 20, 271809, tzinfo=TzInfo(UTC))), , BatchJob(id='1ea94710-592e-4f11-a92f-1a91a4ba3454', user_id='66f0bd504fb9511df3489b9a', input_file_id='file-73585c5e-0a77-49f6-ac0c-604ebd49727b', file_size_bytes=2237747, status=<BatchJobStatus.IN_PROGRESS: 'IN_PROGRESS'>, job_deadline=datetime.datetime(2025, 6, 12, 6, 11, 22, 752345, tzinfo=TzInfo(UTC)), created_at=datetime.datetime(2025, 6, 11, 6, 11, 22, 752350, tzinfo=TzInfo(UTC)), endpoint='/v1/chat/completions', progress=71.75, model_id='deepseek-ai/DeepSeek-V3', output_file_id=None, error_file_id=None, error=None, completed_at=None), , BatchJob(id='1e82cb6d-e79e-4e7a-993a-ed59b24a21e4', user_id='66f0bd504fb9511df3489b9a', input_file_id='file-c6131920-f4b3-44ff-8daa-f8e3f5cc4e70', file_size_bytes=1056749, status=<BatchJobStatus.COMPLETED: 'COMPLETED'>, job_deadline=datetime.datetime(2025, 6, 10, 22, 18, 17, 586343, tzinfo=TzInfo(UTC)), created_at=datetime.datetime(2025, 6, 9, 22, 18, 17, 586348, tzinfo=TzInfo(UTC)), endpoint='/v1/chat/completions', progress=100.0, model_id='deepseek-ai/DeepSeek-R1', output_file_id='file-1f20c3d2-1d53-429f-9b16-acc0b986931d', error_file_id='file-c432ae3a-0ff0-4968-b73b-826b20c7c9c0', error=None, completed_at=datetime.datetime(2025, 6, 9, 23, 9, 17, 546888, tzinfo=TzInfo(UTC)))]
Downloading file simpleqa_v3_output.jsonl: 100%|██████████| 2.39M/2.39M [00:00<00:00, 6.00MB/s]
🔍 Step 3: Analyzing Model Responses
Let's examine some of the generated responses before we grade them.
First few rows of the combined output:
Input Prompt: You are a helpful language model answering factual questions correctly. Give your best answer to the following factual question, focusing on concision and correctness. Do not output anything else besides your answer to the question. Question: Which architects designed the Abasto? Predicted Answer:
Predicted Answer: Juan Antonio Buschiazzo and Virgilio Colombo.
Gold Answer: José Luis Delpini, Viktor Sulčič and Raúl Bes
As you can see, for this particular example the model got wrong!
📋 Step 4: Preparing Grader Requests
Now we'll create batch requests for our judge model to evaluate the answers.
Grading Criteria
The grader uses strict criteria from the SimpleQA paper:
- CORRECT: Answer contains all required information without contradictions
- INCORRECT: Answer contains factual errors or contradictions
- NOT_ATTEMPTED: Answer doesn't provide the required information but doesn't contradict
Created simpleqa_batch_grader.jsonl with 4326 batch requests Each request uses model: meta-llama/Llama-3.3-70B-Instruct-Turbo
⚖️ Step 5: Running the Grader Batch Job
Submit the grading requests and wait for evaluation results.
Uploading simpleqa_batch_grader.jsonl...
Uploading file simpleqa_batch_grader.jsonl: 100%|██████████| 28.9M/28.9M [00:05<00:00, 5.64MB/s]
Grader batch file ID: file-45de36cc-41a7-4b34-9b00-2f730903d0d3 Creating batch job... Grader batch created with ID: f5b32360-48b0-40c2-8c27-92d0d0f7e029 Initial status: BatchJobStatus.IN_PROGRESS
Batch is complete! Downloading results...
Downloading file simpleqa_grader_output.jsonl: 100%|██████████| 2.34M/2.34M [00:00<00:00, 4.39MB/s]
📈 Step 6: Final Results Analysis
Let's analyze DeepSeek V3's performance on SimpleQA.
Key Metrics
- Accuracy: Overall accuracy across all questions
- Accuracy for attempted: Accuracy when the model provided an answer
Final Results: Correct: 919 Incorrect: 3333 Not Attempted: 74 Total: 4326 Accuracy = 21.2% Accuracy (attempted) = 21.6%
🎉 Conclusion
We successfully evaluated DeepSeek V3-0324 on 4,326 SimpleQA questions using Together AI's Batch Inference API, achieving and Overall Accuracy: 21.2% (919/4,326 questions)
🔑 Key Takeaways
- Batch Processing Efficiency: Batch inference dramatically reduces costs and eliminates rate limiting for large-scale evaluations
- Automated Evaluation: LLM-as-a-judge enables scalable assessment without manual annotation
- Systematic Methodology: Proper request formatting, progress monitoring, and result alignment are crucial for reliable evaluations
- Evaluation Complexity: Even straightforward factual questions pose significant challenges for state-of-the-art models
📚 Learn More
For detailed documentation on batch inference capabilities and implementation: 👉 Together AI Batch Inference Documentation