Optimizing JSON Webpage Prompts with the Arize Prompt Learning SDK
In this cookbook, we demonstrate a use case of the Arize Prompt Learning SDK by optimizing a system prompt for GPT-4o. The goal is to improve the model’s ability to generate accurate JSON representations of webpages in response to user queries. The dataset consists of prompts asking GPT to generate webpages, and we define 10 specific rules that the JSON outputs must satisfy. Using the SDK, we iteratively refine the prompt to achieve high accuracy on the training set, and then evaluate its performance on a separate test set.
Requirement already satisfied: arize-phoenix-evals==2.3.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (2.3.0) Requirement already satisfied: arize-phoenix-client in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (1.24.0) Requirement already satisfied: tiktoken in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (0.12.0) Requirement already satisfied: openai in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (2.8.1) Requirement already satisfied: scikit-learn in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (1.7.2) Requirement already satisfied: jsonpath-ng in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-evals==2.3.0) (1.7.0) Requirement already satisfied: openinference-instrumentation>=0.1.20 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-evals==2.3.0) (0.1.42) Requirement already satisfied: openinference-semantic-conventions>=0.1.19 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-evals==2.3.0) (0.1.25) Requirement already satisfied: opentelemetry-api in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-evals==2.3.0) (1.38.0) Requirement already satisfied: pandas in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-evals==2.3.0) (2.3.3) Requirement already satisfied: pydantic>=2.0.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-evals==2.3.0) (2.12.4) Requirement already satisfied: pystache in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-evals==2.3.0) (0.6.8) Requirement already satisfied: tqdm in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-evals==2.3.0) (4.67.1) Requirement already satisfied: typing-extensions<5,>=4.5 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-evals==2.3.0) (4.15.0) Requirement already satisfied: httpx in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-client) (0.28.1) Requirement already satisfied: opentelemetry-exporter-otlp in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-client) (1.38.0) Requirement already satisfied: opentelemetry-sdk in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from arize-phoenix-client) (1.38.0) Requirement already satisfied: regex>=2022.1.18 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from tiktoken) (2025.11.3) Requirement already satisfied: requests>=2.26.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from tiktoken) (2.32.5) Requirement already satisfied: anyio<5,>=3.5.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from openai) (4.11.0) Requirement already satisfied: distro<2,>=1.7.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from openai) (1.9.0) Requirement already satisfied: jiter<1,>=0.10.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from openai) (0.12.0) Requirement already satisfied: sniffio in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from openai) (1.3.1) Requirement already satisfied: idna>=2.8 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from anyio<5,>=3.5.0->openai) (3.11) Requirement already satisfied: certifi in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from httpx->arize-phoenix-client) (2025.11.12) Requirement already satisfied: httpcore==1.* in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from httpx->arize-phoenix-client) (1.0.9) Requirement already satisfied: h11>=0.16 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from httpcore==1.*->httpx->arize-phoenix-client) (0.16.0) Requirement already satisfied: annotated-types>=0.6.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from pydantic>=2.0.0->arize-phoenix-evals==2.3.0) (0.7.0) Requirement already satisfied: pydantic-core==2.41.5 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from pydantic>=2.0.0->arize-phoenix-evals==2.3.0) (2.41.5) Requirement already satisfied: typing-inspection>=0.4.2 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from pydantic>=2.0.0->arize-phoenix-evals==2.3.0) (0.4.2) Requirement already satisfied: numpy>=1.22.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from scikit-learn) (2.3.5) Requirement already satisfied: scipy>=1.8.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from scikit-learn) (1.16.3) Requirement already satisfied: joblib>=1.2.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from scikit-learn) (1.5.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from scikit-learn) (3.6.0) Requirement already satisfied: wrapt<2,>=1.14.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from openinference-instrumentation>=0.1.20->arize-phoenix-evals==2.3.0) (1.17.3) Requirement already satisfied: charset_normalizer<4,>=2 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from requests>=2.26.0->tiktoken) (3.4.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from requests>=2.26.0->tiktoken) (2.5.0) Requirement already satisfied: ply in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from jsonpath-ng->arize-phoenix-evals==2.3.0) (3.11) Requirement already satisfied: importlib-metadata<8.8.0,>=6.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from opentelemetry-api->arize-phoenix-evals==2.3.0) (8.7.0) Requirement already satisfied: zipp>=3.20 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from importlib-metadata<8.8.0,>=6.0->opentelemetry-api->arize-phoenix-evals==2.3.0) (3.23.0) Requirement already satisfied: opentelemetry-exporter-otlp-proto-grpc==1.38.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from opentelemetry-exporter-otlp->arize-phoenix-client) (1.38.0) Requirement already satisfied: opentelemetry-exporter-otlp-proto-http==1.38.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from opentelemetry-exporter-otlp->arize-phoenix-client) (1.38.0) Requirement already satisfied: googleapis-common-protos~=1.57 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from opentelemetry-exporter-otlp-proto-grpc==1.38.0->opentelemetry-exporter-otlp->arize-phoenix-client) (1.72.0) Requirement already satisfied: grpcio<2.0.0,>=1.66.2 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from opentelemetry-exporter-otlp-proto-grpc==1.38.0->opentelemetry-exporter-otlp->arize-phoenix-client) (1.76.0) Requirement already satisfied: opentelemetry-exporter-otlp-proto-common==1.38.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from opentelemetry-exporter-otlp-proto-grpc==1.38.0->opentelemetry-exporter-otlp->arize-phoenix-client) (1.38.0) Requirement already satisfied: opentelemetry-proto==1.38.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from opentelemetry-exporter-otlp-proto-grpc==1.38.0->opentelemetry-exporter-otlp->arize-phoenix-client) (1.38.0) Requirement already satisfied: protobuf<7.0,>=5.0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from opentelemetry-proto==1.38.0->opentelemetry-exporter-otlp-proto-grpc==1.38.0->opentelemetry-exporter-otlp->arize-phoenix-client) (6.33.1) Requirement already satisfied: opentelemetry-semantic-conventions==0.59b0 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from opentelemetry-sdk->arize-phoenix-client) (0.59b0) Requirement already satisfied: python-dateutil>=2.8.2 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from pandas->arize-phoenix-evals==2.3.0) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from pandas->arize-phoenix-evals==2.3.0) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from pandas->arize-phoenix-evals==2.3.0) (2025.2) Requirement already satisfied: six>=1.5 in /Users/fali/projects/arize/.venv/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas->arize-phoenix-evals==2.3.0) (1.17.0)
Configuration
NUM_SAMPLES: Controls how many rows to sample from the full dataset. Set to 0 to use all available data, or a positive number to limit the sample size for faster experimentation.
TRAIN_SPLIT_FRACTION: Determines the train/test split ratio. 0.8 means 80% of data goes to training set, 20% to test set.
NUM_RULES: Specifies the number of rules to use for evaluation. This determines which prompt files to load (e.g., evaluator-prompt-10.txt vs evaluator-prompt-50.txt).
NUM_OPTIMIZATION_LOOPS: Sets how many optimization iterations to run per experiment. Each loop generates outputs, evaluates them, and refines the prompt.
These variables control the experiment scope, data splitting, evaluation criteria, and optimization intensity.
OpenAI Key
We will be using OpenAI to generate the webpage jsons.
Training and Test Datasets
Create training and test datasets, and export to Arize.
Initial System Prompt
Initialize your system prompt. This is the original prompt that will be tested and optimized.
Evaluators
This cell initializes two evaluators that use LLMs as judges to assess the quality of generated outputs.
evaluate_output: A comprehensive evaluator that assesses JSON webpage correctness against the input query and evaluation rules. It provides:
- Correctness labels: "correct" or "incorrect"
- Detailed explanations: Reasoning for the evaluation decision
rule_checker: A specialized evaluator that performs granular rule-by-rule analysis. It:
- Checks individual rules: Examines each rule compliance separately
Both evaluators generate feedback that the optimization loop uses to iteratively improve the system prompt. The explanations and rule violations guide the PromptLearningOptimizer in creating more effective prompts.
Output Generation
This cell defines the function that generates JSON webpage outputs using the current system prompt.
Model: Uses GPT-4.1 with JSON response format and zero temperature for consistent outputs.
Function: Takes a dataset and system prompt, generates outputs for all rows, and returns the results for evaluation.
Usage: Called during each optimization iteration to produce outputs that will be evaluated by the assessors.
Additional Metrics
Optimization Loop
This cell implements the core prompt optimization algorithm. The loop follows a 3-step process:
Generate & Evaluate: Generate outputs using the current prompt on the test dataset and evaluate their correctness
Train & Optimize: If results are unsatisfactory, generate outputs on the training set, evaluate them, and use the feedback to create an improved prompt
Iterate: Repeat until either the threshold is met or all loops are completed
The algorithm tracks metrics across iterations and returns detailed results including train/test accuracy scores, optimized prompts, and raw evaluation data. The optimization uses the PromptLearningOptimizer to iteratively refine the system prompt based on evaluator feedback.
Key parameters:
threshold: Target accuracy score to stop optimizationloops: Maximum number of optimization iterationsscorer: Metric to optimize (accuracy, f1, precision, recall)num_rules: Number of evaluation rules to use
Results Saving Functions
This cell defines two utility functions for saving experiment results in different formats:
save_experiment_results(): Saves complete experiment data to JSON format with timestamps. Useful for preserving all experiment details including raw evaluation data and metadata.
save_single_experiment_csv(): Creates lightweight CSV files with iteration-level data including:
- Iteration number
- Number of rules used
- Test and train accuracy scores
- Optimized prompt text
The CSV format makes it easy to analyze performance trends and prompt evolution over time. Files are automatically timestamped to avoid overwriting previous results.
Output format: Each row represents one optimization iteration with metrics and the corresponding optimized prompt.
Experiment Execution
This cell runs a prompt optimization experiment and saves the results.
Execution: Runs the optimization loop with the specified evaluators and configuration parameters, tracking performance across iterations.
Results Saving:
- JSON format: Saves complete experiment data with timestamps for detailed analysis
- CSV format: Creates lightweight CSV files with iteration data, metrics, and prompts for easy visualization
The CSV output includes columns for iteration number, number of rules, test/train accuracy scores, and the optimized prompt text, making it easy to analyze performance trends and prompt evolution over time.
Output files:
experiment_results.json- Complete experiment dataexperiment_YYYYMMDD_HHMMSS.csv- Timestamped CSV with iteration metrics
🚀 Starting prompt optimization with 5 iterations (scorer: accuracy, threshold: 1) 📊 Initial evaluation:
llm_generate |████ | 2/5 (40.0%) | ⏳ 00:11<00:15 | 5.17s/it
Now you have your optimized system prompt!
Here is the prompt that achieved the best test accuracy across the optimization iterations.
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[25], line 22 19 return best_prompt, best_accuracy, best_iteration 21 # Usage example: ---> 22 best_prompt, best_accuracy, best_iter = get_best_prompt(results) 23 print(f"Original Prompt: {system_prompt}") 24 print(f"Best Optimized Prompt (iteration {best_iter}, accuracy: {best_accuracy:.3f}):") NameError: name 'results' is not defined