Notebooks
O
OpenAI
Prompt Optimization Cookbook

Prompt Optimization Cookbook

chatgptopenaigpt-4examplesgpt-5openai-apiopenai-cookbook

GPT-5 Prompt Migration and Improvement using the new prompt optimizer

The GPT-5 Family of models are the smartest models we’ve released to date, representing a step change in the models’ capabilities across the board. GPT-5 is particularly specialized in agentic task performance, coding, and steerability, making it a great fit for everyone from curious users to advanced researchers.

GPT-5 will benefit from all the traditional prompting best practices, but to make optimizations and migrations easier, we are introducing the GPT-5 Prompt Optimizer in our Playground to help users get started on improving existing prompts and migrating prompts for GPT-5 and other OpenAI models.

Prompt Optimizer demo

In this cookbook we will show you how to use the Prompt Optimzer to get spun up quickly to solve your tasks with GPT-5, while demonstrating how prompt optimize can have measurable improvements.

Migrating and Optimizing Prompts

Crafting effective prompts is a critical skill when working with LLMs. The goal of the Prompt Optimizer is to give your prompt the best practices and formatting most effective for our models. The Optimizer also removes common prompting failure modes such as:

• Contradictions in the prompt instructions
• Missing or unclear format specifications
• Inconsistencies between the prompt and few-shot examples

Along with tuning the prompt for the target model, the Optimizer is cognizant of the specific task you are trying to accomplish and can apply crucial practices to boost performance in Agentic Workflows, Coding and Multi-Modality. Let's walk through some before-and-afters to see where prompt optimization shines.

Remember that prompting is not a one-size-fits-all experience, so we recommend running thorough experiments and iterating to find the best solution for your problem.

Ensure you have set up your OpenAI API Key set as OPENAI_API_KEY and have access to GPT-5

[1]
OPENAI_API_KEY is set!
[ ]

Coding and Analytics: Streaming Top‑K Frequent Words

We start with a task in a field that model has seen significant improvements: Coding and Analytics. We will ask the model to generate a Python script that computes the exact Top‑K most frequent tokens from a large text stream using a specific tokenization spec. Tasks like these are highly sensitive to poor prompting as they can push the model toward the wrong algorithms and approaches (approximate sketches vs multi‑pass/disk‑backed exact solutions), dramatically affecting accuracy and runtime.

For this task, we will evaluate:

  1. Compilation/Execution success over 30 runs
  2. Average runtime (successful runs)
  3. Average peak memory (successful runs)
  4. Exactness: output matches ground‑truth Top‑K with tie‑break: by count desc, then token asc

Note: Evaluated on an M4 Max MacBook Pro; adjust constraints if needed.

Our Baseline Prompt

For our example, let's look at a typical starting prompt with some minor contradictions in the prompt, and ambiguous or underspecified instructions. Contradictions in instructions often reduce performance and increase latency, especially in reasoning models like GPT-5, and ambiguous instructions can cause unwanted behaviors.

[4]

This baseline prompt is something that you could expect from asking ChatGPT to write you a prompt, or talking to a friend who is knowledgeable about coding but not particularly invested in your specific use case. Our baseline prompt is intentionally shorter and friendlier, but it hides mixed signals that can push the model into inconsistent solution families.

First, we say to prefer the standard library, then immediately allow external packages “if they make things simpler.” That soft permission can nudge the model toward non‑portable dependencies or heavier imports that change performance and even execution success across environments.

Next, we encourage single‑pass streaming to keep memory low, but we also say it’s fine to reread or cache “if that makes the solution clearer.” That ambiguity opens the door to multi‑pass designs or in‑memory caches that defeat the original streaming constraint and can alter runtime and memory profiles.

We also ask for exact results while permitting approximate methods “when they don’t change the outcome in practice.” This is a judgment call the model can’t reliably verify. It may introduce sketches or heuristics that subtly shift counts near the Top‑K boundary, producing results that look right but fail strict evaluation.

We advise avoiding global state, yet suggest exposing a convenient global like top_k. That mixes interface contracts: is the function supposed to return data, or should callers read globals? Models may implement both, causing side effects that complicate evaluation and reproducibility.

Documentation guidance is similarly split: “keep comments minimal” but “add brief explanations.” Depending on how the model interprets this, you can get under‑explained code or prose interleaved with logic, which sometimes leaks outside the required output format.

Finally, we ask for “natural, human‑friendly” sorting while also mentioning strict tie rules. These aren’t always the same. The model might pick convenience ordering (e.g., Counter.most_common) and drift from the evaluator’s canonical (-count, token) sort, especially on ties—leading to subtle correctness misses.

Why this matters: the softened constraints make the prompt feel easy to satisfy, but they create forks in the road. The model may pick different branches across runs—stdlib vs external deps, one‑pass vs reread/cache, exact vs approximate—yielding variability in correctness, latency, and memory.

Our evaluator remains strict: fixed tokenization [a-z0-9]+ on lowercased text and deterministic ordering by (-count, token). Any divergence here will penalize exactness even if the rest of the solution looks reasonable.

Let's see how it performs: Generating 30 code scripts with the baseline prompt

Using the OpenAI Responses API we'll invoke the model 30 times with our baseline prompt and save each response as a Python file in the results_topk_baseline. This may take some time.

[ ]

Evaluate Generated Scripts - Baseline Prompt

We then benchmark every script in results_topk_baseline On larger datasets this evaluation is intentionally heavy and can take several minutes.

[ ]

Optimizing our Prompt

Now let's use the prompt optimization tool in the console to improve our prompt and then review the results. We can start by going to the OpenAI Optimize Playground, and pasting our existing prompt in the Developer Message section.

From there press the Optimize button. This will open the optimization panel. At this stage, you can either provide specific edits you'd like to see reflected in the prompt or simply press Optimize to have it refined according to best practices for the target model and task. To start let's do just this.

optimize_image

Once it's completed you'll see the result of the prompt optimization. In our example below you'll see many changes were made to the prompt. It will also give you snippets of what it changed and why the change was made. You can interact with these by opening the comments up or using the inline reviewer mode.

We'll add an additional change we'd like which include:

  • Enforcing the single-pass streaming

This is easy using the iterative process of the Prompt Optimizer.

optimize_image

Once we are happy with the optimized version of our prompt, we can save it as a Prompt Object using a button on the top right of the optimizer. We can use this object within our API Calls which can help with future iteration, version management, and reusability across different applications.

optimize_image

Let's see how it performs: Evaluating our improved prompt

For visibility we will provide our new optimized prompt here, but you can also pass the prompt_id and version. Let's start by writing out our optimized prompt.

[18]

Generating 30 code scripts with the Optimized prompt

[ ]

Evaluate Generated Scripts - Optimized Prompt

We run the same evaluation as above, but now with our optimized prompt to see if there were any improvements

[ ]

Adding LLM-as-a-Judge Grading

Along with more quantitative evaluations we can measure the models performance on more qualitative metrics like code quality, and task adherence. We have created a sample prompt for this called llm_as_judge.txt.

[21]
[ ]
[ ]

Summarizing the results

We can now demonstrate from both a quantitative standpoint, along with a qualitative standpoint from our LLM as Judge results.

[6]
Output
### Prompt Optimization Results - Coding Tasks

| Metric                      | Baseline | Optimized | Δ (Opt − Base) |
|----------------------------|---------:|----------:|---------------:|
| Avg Time (s)                |    7.906 |     6.977 |        -0.929 |
| Peak Memory (KB)            |   3626.3 |     577.5 |       -3048.8 |
| Exact (%)                   |    100.0 |     100.0 |           0.0 |
| Sorted (%)                  |    100.0 |     100.0 |           0.0 |
| LLM Adherence (1–5)         |     4.40 |      4.90 |         +0.50 |
| Code Quality (1–5)          |     4.73 |      4.90 |         +0.16 |

Even though GPT-5 already produced correct code, prompt optimization tightened constraints and clarified any ambiguity. Showing overall improvements to the results!


Context and Retrieval: Simulating a Financial Question Answering

Most production use cases face imperfect queries and noisy context. FailSafeQA is an excellent benchmark that deliberately perturbs both the query (misspellings, incompleteness, off-domain phrasing) and the context (missing, OCR-corrupted, or irrelevant docs) and reports Robustness, Context Grounding, and Compliance—i.e., can the model answer when the signal exists and abstain when it doesn’t.

FailSafeQA diagram

Links

We will run FailSafeQA evaluations via the helper script and compare Baseline vs Optimized prompts side by side.

[3]

We can use the prompt optimizer once again to construct a new prompt that is more suitable for this use case. Drawing on best practices for long-context question answering, we know that we should remind our answer model to rely on information in the context section and refuse answers to questions if the context is insufficient. By using the Optimize button once without any arguments we get a reasonable structure for the prompt and end up with this as our optimized prompt.

optimize_image

[4]

Let's now run our evaluations, for demonstration we will display the results of a single comparison, but you can also run the full evaluation. Note: This will take time.

[ ]
[1]
## FailSafeQA — Summary

**Compliance threshold:** ≥ 6

| Metric                                    | Baseline | Optimized | Δ (Opt − Base) |
| ----------------------------------------- | -------- | --------- | -------------- |
| Robustness (avg across datapoints)        | 0.320    | 0.540     | +0.220         |
| Context Grounding (avg across datapoints) | 0.800    | 0.950     | +0.150         |

_Source files:_ `results_failsafeqa.csv` · `results_failsafeqa.csv`

GPT-5-mini crushes this task, so even the baseline prompt gets scores of >= 4 almost all of the time. However if we compare the percent of perfect scores (6/6) for the judge, we see that the optimize prompt has way significantly more perfect answers when evaluated in the two categories of FailSafeQA answer quality: robustness and context grounding.

Conclusion

We’re excited for everyone to try Prompt Optimization for GPT-5 in the OpenAI Playground. GPT-5 brings state-of-the-art intelligence, and a strong prompt helps it reason more reliably, follow constraints, and produce cleaner, higher quality results.

Give the Prompt Optimizer a try on your task today!