GEPA Summarization Optimization with LLM Judge Evaluation
Introduction
This notebook demonstrates how to optimize summarization prompts using GEPA (Generate, Evaluate, Propose, Adapt) with the Together Evaluations API. We'll:
- Load the CNN/DailyMail dataset containing news articles
- Start with a baseline summarization prompt
- Use an optimizer LLM to iteratively improve the prompt
- Compare prompts head-to-head using a judge model
- Track improvement over multiple iterations
Concepts Covered:
- GEPA Optimization: Iterative prompt engineering using LLM feedback, see this paper for more details
- LLM-as-a-Judge: Using a language model to evaluate and compare outputs
- Batch Evaluation: Efficient comparison of multiple summaries
- Prompt Engineering: Systematic improvement of instruction prompts
📦 Setup and Installation
⚙️ Configuration
Set up your API key and configure the models we'll use:
- Summarizer Model: Generates the summaries
- Judge Model: Evaluates which summary is better
- Optimizer Model: Proposes improvements to the prompt
✓ API key loaded from Colab secrets ✓ Configuration complete
📝 Baseline and Judge Prompts
We start with a simple baseline prompt for summarization. The GEPA process will iteratively improve this prompt based on performance feedback.
Baseline Prompt: Summarize this news article in 3-5 key points. Write a brief summary covering: - The main news event - Key people or organizations involved - Important details or outcomes - Any significant context Keep it to 3-5 sentences total. Judge Prompt: Compare these two summaries of the same news article. Which summary better: - Captures the main news story - Includes important details - Is clear and concise - Avoids unnecessary information Choose A or B and explain why briefly.
📂 Loading the CNN/DailyMail Dataset
The CNN/DailyMail dataset contains news articles paired with human-written highlights. We'll use the articles as our source text and split the data into train, validation, and test sets.
Dataset Structure:
article: The full news article texthighlights: Human-written bullet-point summary- We'll use the articles for summarization and evaluate our generated summaries
================================================================================ 📂 LOADING DATA ================================================================================ Loading CNN/DailyMail dataset... ✓ Loaded 11490 examples Sample article: (CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Cour... Sample highlights: Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since... ✓ Converted to 11490 items ✓ Split: Train=150, Val=300, Test=300
🤖 Summarization Module
We create a DSPy module that wraps our summarization task. This module can be configured with different instruction prompts, which is key to the GEPA optimization process.
✓ Summarization module defined
📊 Batch Summary Generation
This function generates summaries for a batch of articles using a given prompt. It includes error handling and progress tracking.
✓ Batch generation function defined
🧠 Optimizer LLM Wrapper
This wrapper allows us to use an LLM to propose improvements to our summarization prompt based on current performance.
✓ Optimizer LLM wrapper defined
🤔 Reflection and Prompt Improvement
This function uses the optimizer LLM to analyze the current prompt and performance, then propose an improved version.
Key Constraints:
- Keep prompts under 150 words for clarity
- Focus on simple, direct instructions
- Target 4-6 sentence summaries
- Avoid overly complex requirements
✓ Reflection function defined
🔄 Head-to-Head Prompt Comparison
This function compares two prompts by:
- Generating summaries with both prompts
- Creating a comparison dataset
- Using the Together AI evaluation API with a judge model
- Computing win rates
The evaluation uses a two-pass approach to eliminate position bias.
✓ Comparison function defined
🧬 GEPA Optimization Loop
This is the main optimization loop that implements the GEPA algorithm:
- Generate: Create summaries with current prompt
- Evaluate: Compare against baseline using judge model
- Propose: Use optimizer LLM to suggest improvements
- Adapt: Accept improvements that increase win rate
The process repeats for multiple iterations, tracking the best prompt found.
✓ GEPA optimization function defined
🚀 Run the Optimization
Now we'll execute the full GEPA optimization process. This will:
- Set up the summarizer and optimizer models
- Run multiple iterations of prompt improvement
- Evaluate the final optimized prompt on the test set
- Display comprehensive results
================================================================================ 🎯 GEPA SUMMARIZATION - TOGETHER AI BATCH EVAL ================================================================================ ================================================================================ 🧬 MANUAL GEPA OPTIMIZATION ================================================================================ ================================================================================ ITERATION 1/5 ================================================================================ Iteration 0: Establishing baseline (no comparison yet) ================================================================================ ITERATION 2/5 ================================================================================ 🤔 REFLECTION (Iteration 1) ✓ Generated new prompt (63 words) ✓ Generated candidate prompt (404 chars) ================================================================================ 🔄 COMPARING PROMPTS: iter1_val ================================================================================ Generating summaries with Prompt A... Using prompt: Summarize this news article in 3-5 key points. Write a brief summary covering: - The main news even...
Prompt A: 100%|██████████| 300/300 [14:30<00:00, 2.90s/it]
Generating summaries with Prompt B... Using prompt: Summarize this news article in 4-6 sentences, focusing on clarity and concision. Please cover the f...
Prompt B: 100%|██████████| 300/300 [17:16<00:00, 3.46s/it]
📤 Uploading for comparison...
Uploading file temp_compare_iter1_val_20251222_170518.jsonl: 100%|██████████| 1.59M/1.59M [00:00<00:00, 2.82MB/s]
🚀 Launching comparison... ⏳ Waiting (ID: eval-94eb-1766423120)... ✓ Results: Prompt A wins=29, Prompt B wins=35, Ties=236 ✓ Prompt A win rate: 45.31% Current best: 45.31% New candidate: 54.69% 🎉 New best! (+4.69pp) ================================================================================ ITERATION 3/5 ================================================================================ 🤔 REFLECTION (Iteration 2) ✓ Generated new prompt (58 words) ✓ Generated candidate prompt (389 chars) ================================================================================ 🔄 COMPARING PROMPTS: iter2_val ================================================================================ Generating summaries with Prompt A... Using prompt: Summarize this news article in 4-6 sentences, focusing on clarity and concision. Please cover the f...
Prompt A: 100%|██████████| 300/300 [00:39<00:00, 7.68it/s]
Generating summaries with Prompt B... Using prompt: Write a 4-6 sentence summary of this news article, prioritizing clarity and accuracy. Clearly stat...
Prompt B: 100%|██████████| 300/300 [15:55<00:00, 3.18s/it]
📤 Uploading for comparison...
Uploading file temp_compare_iter2_val_20251222_173300.jsonl: 100%|██████████| 1.62M/1.62M [00:00<00:00, 3.48MB/s]
🚀 Launching comparison... ⏳ Waiting (ID: eval-6faf-1766424783)... ✓ Results: Prompt A wins=34, Prompt B wins=29, Ties=237 ✓ Prompt A win rate: 53.97% Current best: 53.97% New candidate: 46.03% No improvement ================================================================================ ITERATION 4/5 ================================================================================ 🤔 REFLECTION (Iteration 3) ✓ Generated new prompt (87 words) ✓ Generated candidate prompt (578 chars) ================================================================================ 🔄 COMPARING PROMPTS: iter3_val ================================================================================ Generating summaries with Prompt A... Using prompt: Summarize this news article in 4-6 sentences, focusing on clarity and concision. Please cover the f...
Prompt A: 100%|██████████| 300/300 [00:37<00:00, 8.08it/s]
Generating summaries with Prompt B... Using prompt: Summarize this news article in 4-6 sentences, focusing on the most important facts. Provide a clear ...
Prompt B: 100%|██████████| 300/300 [15:51<00:00, 3.17s/it]
📤 Uploading for comparison...
Uploading file temp_compare_iter3_val_20251222_181544.jsonl: 100%|██████████| 1.65M/1.65M [00:00<00:00, 2.48MB/s]
🚀 Launching comparison... ⏳ Waiting (ID: eval-1788-1766427347)... ✓ Results: Prompt A wins=44, Prompt B wins=22, Ties=234 ✓ Prompt A win rate: 66.67% Current best: 66.67% New candidate: 33.33% No improvement ================================================================================ ITERATION 5/5 ================================================================================ 🤔 REFLECTION (Iteration 4) ✓ Generated new prompt (77 words) ✓ Generated candidate prompt (547 chars) ================================================================================ 🔄 COMPARING PROMPTS: iter4_val ================================================================================ Generating summaries with Prompt A... Using prompt: Summarize this news article in 4-6 sentences, focusing on clarity and concision. Please cover the f...
Prompt A: 100%|██████████| 300/300 [00:40<00:00, 7.47it/s]
Generating summaries with Prompt B... Using prompt: Summarize this news article in 4-6 sentences, focusing on accuracy, brevity, and clarity. Clearly s...
Prompt B: 100%|██████████| 300/300 [16:34<00:00, 3.32s/it]
📤 Uploading for comparison...
Uploading file temp_compare_iter4_val_20251222_184909.jsonl: 100%|██████████| 1.62M/1.62M [00:00<00:00, 1.77MB/s]
🚀 Launching comparison... ⏳ Waiting (ID: eval-1e94-1766429353)... ✓ Results: Prompt A wins=45, Prompt B wins=33, Ties=222 ✓ Prompt A win rate: 57.69% Current best: 57.69% New candidate: 42.31% No improvement ================================================================================ 📊 FINAL TEST EVALUATION ================================================================================ ⏱️ OPTIMIZATION TIME: Total: 2h 31m 48s ================================================================================ 🔄 COMPARING PROMPTS: final_test ================================================================================ Generating summaries with Prompt A... Using prompt: Summarize this news article in 3-5 key points. Write a brief summary covering: - The main news even...
Prompt A: 100%|██████████| 300/300 [16:05<00:00, 3.22s/it]
Generating summaries with Prompt B... Using prompt: Summarize this news article in 4-6 sentences, focusing on clarity and concision. Please cover the f...
Prompt B: 100%|██████████| 300/300 [18:27<00:00, 3.69s/it]
📤 Uploading for comparison...
Uploading file temp_compare_final_test_20251222_193951.jsonl: 100%|██████████| 1.57M/1.57M [00:00<00:00, 2.74MB/s]
🚀 Launching comparison... ⏳ Waiting (ID: eval-ff84-1766432395)... ✓ Results: Prompt A wins=25, Prompt B wins=41, Ties=234 ✓ Prompt A win rate: 37.88% ================================================================================ 🎉 FINAL RESULTS ================================================================================ TEST SET: Baseline prompt: 37.88% Optimized prompt: 62.12% Improvement: +12.12pp from neutral 💾 Saved to: results/prompts_20251222_195058.txt ✅ Complete!
📊 Analyzing the Results
Let's examine the optimized prompt and compare it to the baseline.
================================================================================ 📝 PROMPT COMPARISON ================================================================================ BASELINE PROMPT: -------------------------------------------------------------------------------- Summarize this news article in 3-5 key points. Write a brief summary covering: - The main news event - Key people or organizations involved - Important details or outcomes - Any significant context Keep it to 3-5 sentences total. OPTIMIZED PROMPT: -------------------------------------------------------------------------------- Summarize this news article in 4-6 sentences, focusing on clarity and concision. Please cover the following key aspects: - What is the main news event being reported? - Who are the key people or organizations involved? - What are the most important details or outcomes of the event? Provide relevant background information if necessary, but prioritize the essential facts and avoid unnecessary details. PERFORMANCE COMPARISON: -------------------------------------------------------------------------------- Baseline Win Rate: 37.88% Optimized Win Rate: 62.12% Improvement: +12.12 percentage points from neutral
🔑 Key Findings
GEPA Optimization Process:
- Iteratively improves prompts through LLM-guided reflection
- Uses head-to-head comparisons with a judge model
- Tracks and accepts only improvements over baseline
Benefits of This Approach:
- Automated: No manual prompt engineering required
- Data-driven: Decisions based on actual performance metrics
- Scalable: Can optimize for any task with appropriate data
- Transparent: Clear tracking of improvements across iterations
Next Steps:
- Try with different datasets or domains
- Experiment with different judge criteria
- Adjust the optimizer's reflection prompt
- Increase iterations for potentially better results