Arizeax Support Query Classification
Arize AX: Improving Classification with LLMs using Prompt Learning
In this notebook we will leverage the PromptLearningOptimizer developed here at Arize to improve upon the accuracy of LLMs on classification tasks. Specifically we will be classifying support queries into 30 different classes, including
Account Creation
Login Issues
Password Reset
Two-Factor Authentication
Profile Updates
Billing Inquiry
Refund Request
and 24 more.
You can view the dataset in datasets/support_queries.csv.
Note: This notebook arizeax_support_query_classification.ipynb complements support_query_classification.ipynb by using Arize Phoenix datasets, experiments, and prompt management for Prompt Learning. It's a more end to end way for you to visualize your iterative prompt improvement and see how it performs on train/test sets, and also leverages methods for advanced features.
Note: you may need to restart the kernel to use updated packages.
Setup
Make train/test sets
We use an 80/20 train/test split to train our prompt. The optimizer will use the training set to visualize and analyze its errors and successes, and make prompt updates based on these results. We will then test on the test set to see how that prompt performs on unseen data.
We will be exporting these datasets to Arize AX. In Arize you will be able to view the experiments we run on the train/test sets.
train dataset id: RGF0YXNldDozMDQyNDM6dmRqYw== test dataset id: RGF0YXNldDozMDQyNDQ6U0tTSw==
Base Prompt for Optimization
This is our base prompt - our 0th iteration. This is the prompt we will be optimizing for our task.
We also upload our prompt to Arize AX. Arize's Prompt Hub serves as a repository for your prompts. You will be able to view all iterations of your prompt as its optimized, along with some metrics.
Note: you may need to restart the kernel to use updated packages.
Output Generator
This function calls OpenAI with our prompt on every row of our dataset to generate outputs. It leverages llm_generate, a Phoenix function, for concurrency in calling LLMs.
We return the output column, which contains outputs for every row of our dataset, or every support query in our dataset.
Evaluator
In this section we define our LLM-as-judge eval.
Prompt Learning works by generating natural language evaluations on your outputs. These evaluations help guide the prompt optimizer towards building an optimized prompt.
You should spend time thinking about how to write an informative eval. Your eval makes or breaks this prompt optimizer. With helpful feedback, our prompt optimizer will be able to generate a stronger optimized prompt much more effectively than with sparse or unhelpful feedback.
Below is a great example for building a strong eval. You can see that we return many evaluations, including
-
correctness: correct/incorrect - whether the support query was classified correctly or incorrectly.
-
explanation: Brief explanation of why the predicted classification is correct or incorrect, referencing the correct label if relevant.
-
confusion_reason: If incorrect, explains why the model may have made this choice instead of the correct classification. Focuses on likely sources of confusion. If correct, 'no confusion'.
-
error_type: One of: 'broad_vs_specific', 'keyword_bias', 'multi_intent_confusion', 'ambiguous_query', 'off_topic', 'paraphrase_gap', 'other'. Use 'none' if correct. Include the definition of the chosen error type, which are passed into the evaluator's prompt.
-
evidence_span: Exact phrase(s) from the query that strongly indicate the correct classification.
-
prompt_fix_suggestion: One clear instruction to add to the classifier prompt to prevent this error.
Take a look at support_query_classification/evaluator_prompt.txt for the full prompt!
Our evaluator leverages llm_generate once again to build these llm evals with concurrency. We use an output parser to ensure that our eval is returned in proper json format.
Metrics
Below we define some metrics that will compute on each iteration of prompt optimization. It will help us measure how our classifier with the current iteration's prompt performs.
Specifically we use scikit learn for precision, recall, f1 score, and simple accuracy.
Experiment Processor
This function pulls an Arize experiment and loads the data into a pandas dataframe so it can run through the optimizer.
Specifically it:
- Pulls the experiment data from Arize
- Adds the input column to the dataframe
- Adds the evals to the dataframe
- Adds the output to the dataframe
- Returns the dataframe
Prompt Optimization Loop with Arize Experiments
This code implements an iterative prompt optimization system that uses Arize AX experiments to evaluate and improve prompts based on feedback from LLM evaluators.
Overview
The optimize_loop function automates prompt engineering by:
- Evaluating prompts using Arize experiments
- Collecting detailed feedback from LLM evaluators
- Optimizing prompts via a learning-based optimizer
- Iterating until the performance threshold is met or the loop limit is reached
Step-by-Step Breakdown
Each of these numbers are added as comments in the code.
1. Initialization
- Set up tracking variables:
train_metrics,test_metrics,raw_dfsfor storing evaluation results
- Convert training dataset to a DataFrame for easy updates
2. Baseline Evaluation
- Run an initial experiment using the test set
- Establish a baseline metric (e.g., accuracy, F1) to compare against future improvements
3. Early Exit Check
- If the initial prompt already meets the performance threshold, skip further optimization to save time and compute
4. Main Optimization Loop
For each iteration (up to loops):
4a. Run Training Experiment
- Execute the current prompt on the training set
- Use LLM evaluators to generate natural language feedback
4b. Process Feedback
- Extract structured information from evaluator outputs:
- Correctness
- Explanation
- Confusion reason
- Error type
- Prompt fix suggestions
- Update the training DataFrame with this feedback
4c. Generate Learning Annotations
- Convert feedback into structured annotations for the optimizer to learn from
- This allows learning from evaluator insights in a consistent format
4d. Optimize the Prompt
- Pass feedback to the PromptLearningOptimizer
- Generate an improved prompt that attempts to correct issues found in the previous iteration
4e. Evaluate on Test Set
- Evaluate the updated prompt on the held-out test set
- Assess generalization beyond the training data
4f. Track Metrics
- Log metrics for:
- Training set performance
- Test set performance
- Store raw results for further analysis or visualization
4g. Convergence Check
- If the new prompt's test metric meets or exceeds the threshold, exit the loop early
๐ Starting prompt optimization with 5 iterations (scorer: accuracy, threshold: 1) ๏ฟฝ๏ฟฝ Initial evaluation: arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:27<00:00 | 1.22it/s
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:10 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 31 31 0
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:29<00:00 | 1.06it/s
arize.utils.logging | INFO | โ All evaluators completed.
running experiment evaluations |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:18<00:00 | 1.63it/s
id example_id result \
0 EXP_ID_a571db e229198e-c98b-4dee-a5b8-a6f3f36d18e4 Account Creation
1 EXP_ID_97b182 0e780f04-2457-469c-9bc9-954db24b3583 Billing Inquiry
2 EXP_ID_7325f2 abe65d8c-297a-4759-b187-2f48cd510b3d Order Status
3 EXP_ID_96fe1e 8287e65d-8fb9-470f-9bcf-cc68491f230a Password Reset
4 EXP_ID_a0ab3d 8fc5bd05-f4fa-41b4-894e-2d76955a5cf1 Login Issues
result.trace.id result.trace.timestamp \
0 503b767c1a09da8da1a540e281602b90 1756905015754
1 a33f37c414b7d8b665e6629a244ee5af 1756905016765
2 2d9bf1d868203b103ff11241b81ddd0c 1756905017755
3 029370f301629ef18760545795751f05 1756905018738
4 5277b47c30c9b29847dd989daef74b29 1756905019707
eval.test_evaluator.score eval.test_evaluator.label \
0 1.0 True
1 0.0 False
2 0.0 False
3 0.0 False
4 1.0 True
eval.test_evaluator.explanation eval.test_evaluator.trace.id \
0 placeholder f41ec274799bd2751dc0342d75801595
1 placeholder 841bfa7746365e83ab32d68fe7f4c4b7
2 placeholder 11da39365a2937152e242ae8652266ce
3 placeholder 31bfbfd39e32e8c822e9bb281a07192a
4 placeholder 6273a74a4bba9f2fc59635c8df50ff16
eval.test_evaluator.trace.timestamp
0 1756905042875
1 1756905043001
2 1756905043109
3 1756905043203
4 1756905043319
โ
Initial accuracy: 0.6129032258064516
๐ Loop 1: Optimizing prompt...
arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 123/123 (100.0%) | โณ 01:20<00:00 | 2.41it/s
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:12 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 123 123 0
running tasks |โโโโโโโโโโ| 123/123 (100.0%) | โณ 01:21<00:00 | 1.51it/s running experiment evaluations |โโโโโโโโโโ| 246/246 (100.0%) | โณ 00:55<00:00 | 1.89s/it
arize.utils.logging | INFO | โ
All evaluators completed.
id example_id \
0 EXP_ID_9b66fb dd17a2a0-2912-4d4c-bf7e-b3b3027b75fe
1 EXP_ID_41a1ad c820437c-026e-4cf5-9e0a-1d190fa39fae
2 EXP_ID_6e739f 787aa17a-da44-4c75-b4fd-cc303be0bd56
3 EXP_ID_856820 36e38c54-b02b-4219-94d1-8da3bb5e14d9
4 EXP_ID_35cea8 ce75bd3c-9abe-4e4c-aaea-3f1fa1361886
result result.trace.id \
0 Privacy Policy Question cef78b4e554c06dc6fcb9759b87f8d0e
1 Billing Inquiry 1d0143c3ff57677d955029da9dabfb55
2 Return label prints blank 39fbdb431830a2a9618ad829fd9deae6
3 Billing Inquiry f3e0b076a74b2a85cb3e8f9b16aa2ca2
4 General Feedback f72a2e1f78425abacd5c196aa7500a87
result.trace.timestamp eval.output_evaluator.score \
0 1756905074351 1.0
1 1756905075341 1.0
2 1756905076368 0.0
3 1756905077354 1.0
4 1756905078229 0.0
eval.output_evaluator.label \
0 correct
1 correct
2 incorrect
3 correct
4 incorrect
eval.output_evaluator.explanation \
0 correctness: correct;\n explanation: Th...
1 correctness: correct;\n explanation: Th...
2 correctness: incorrect;\n explanation: ...
3 correctness: correct;\n explanation: Th...
4 correctness: incorrect;\n explanation: ...
eval.output_evaluator.trace.id eval.output_evaluator.trace.timestamp \
0 debad9b4e8e7fbf7dc85fc02a59df734 1756905154824
1 d6b0a3553b1858b72805951ff5d3d891 1756905154904
2 de822b096d51c5649a7a1237e21d9213 1756905154978
3 a7b8ad05b1dd43218fa724d18bb6e0be 1756905155053
4 51422939fc2e003f884f3ac6f17e7239 1756905155129
eval.test_evaluator.score eval.test_evaluator.label \
0 1.0 True
1 1.0 True
2 0.0 False
3 1.0 True
4 0.0 False
eval.test_evaluator.explanation eval.test_evaluator.trace.id \
0 placeholder e3ce94880d6d6e513a37413f12608b12
1 placeholder 45ac669f8a120d016629fc666dc51975
2 placeholder d6c87ae564a8a1a186c08c233ab32996
3 placeholder 2e8c48b0bf735bfbfbc79d5418fd87b8
4 placeholder 6577f835c5ddb073157e1dab85c8a70c
eval.test_evaluator.trace.timestamp
0 1756905154827
1 1756905154906
2 1756905154982
3 1756905155058
4 1756905155133
running experiment evaluations |โโโโโโโโโโ| 246/246 (100.0%) | โณ 01:16<00:00 | 3.22it/s
โ Training accuracy: 0.5772357723577236 ๐ Running annotator... ['query', 'ground_truth', 'created_at', 'updated_at', 'id', '__index_level_0__', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output'] ๐ง Creating batches with 90,000 token limit ๐ Processing 123 examples in 1 batches โ Batch 1/1: Optimized arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:27<00:00 | 1.10s/it
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:15 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 31 31 0
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:29<00:00 | 1.07it/s
arize.utils.logging | INFO | โ All evaluators completed.
running experiment evaluations |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:17<00:00 | 1.79it/s
id example_id \
0 EXP_ID_821972 e229198e-c98b-4dee-a5b8-a6f3f36d18e4
1 EXP_ID_cdfe5b 0e780f04-2457-469c-9bc9-954db24b3583
2 EXP_ID_18c746 abe65d8c-297a-4759-b187-2f48cd510b3d
3 EXP_ID_dca0d6 8287e65d-8fb9-470f-9bcf-cc68491f230a
4 EXP_ID_324a0c 8fc5bd05-f4fa-41b4-894e-2d76955a5cf1
result result.trace.id \
0 Login Issues 8ef1f6a5aaeab6793cf77ad12c20f09a
1 Subscription Upgrade/Downgrade fbf1485120e77c40ae37a2587ff42bdc
2 Data Export d12803c1ec835fb429c15b6ff2ce7cbb
3 Password Reset bcb301b819adc1b749a932e492a3a102
4 Login Issues dda6f07d37517e0b77f234231c78f30d
result.trace.timestamp eval.test_evaluator.score \
0 1756905288188 0.0
1 1756905289239 0.0
2 1756905290152 1.0
3 1756905291183 0.0
4 1756905294006 1.0
eval.test_evaluator.label eval.test_evaluator.explanation \
0 False placeholder
1 False placeholder
2 True placeholder
3 False placeholder
4 True placeholder
eval.test_evaluator.trace.id eval.test_evaluator.trace.timestamp
0 0196d3ce4bad886f7012710e66f01e73 1756905315394
1 1ddbf0f8b15bb41880cda106269dfca0 1756905315459
2 7230d74ce681810042458268960c2f43 1756905315531
3 182fd8efae4159e20a19f0a1ad9f95ad 1756905315611
4 100f9ec4f34ab6ba5c5c792bccc076b5 1756905315690
โ
Test accuracy: 0.5483870967741935
๐ Loop 2: Optimizing prompt...
arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 123/123 (100.0%) | โณ 01:20<00:00 | 2.28it/s
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:17 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 123 123 0
running tasks |โโโโโโโโโโ| 123/123 (100.0%) | โณ 01:21<00:00 | 1.51it/s
arize.utils.logging | INFO | โ
All evaluators completed.
id example_id \
0 EXP_ID_cc8656 dd17a2a0-2912-4d4c-bf7e-b3b3027b75fe
1 EXP_ID_a92e3b c820437c-026e-4cf5-9e0a-1d190fa39fae
2 EXP_ID_50d9e5 787aa17a-da44-4c75-b4fd-cc303be0bd56
3 EXP_ID_10e0c2 36e38c54-b02b-4219-94d1-8da3bb5e14d9
4 EXP_ID_63d2ae ce75bd3c-9abe-4e4c-aaea-3f1fa1361886
result result.trace.id \
0 Privacy Policy Question e2fac03a0c9e7641f5fd38322a95fb85
1 Billing Inquiry d71732cd94f7e31b93f2735ef6f27339
2 Technical Bug Report cb31a4ff728acf07b8947a4df2ec5c60
3 Billing Inquiry cf4477d59646e553a639734cafe62393
4 Feature Request 0b95ab486fb01caa2872551d24f37645
result.trace.timestamp eval.output_evaluator.score \
0 1756905344864 1.0
1 1756905345869 1.0
2 1756905346811 0.0
3 1756905347836 1.0
4 1756905348784 1.0
eval.output_evaluator.label \
0 correct
1 correct
2 incorrect
3 correct
4 correct
eval.output_evaluator.explanation \
0 correctness: correct;\n explanation: Th...
1 correctness: correct;\n explanation: Th...
2 correctness: incorrect;\n explanation: ...
3 correctness: correct;\n explanation: Th...
4 correctness: correct;\n explanation: Th...
eval.output_evaluator.trace.id eval.output_evaluator.trace.timestamp \
0 72e6492cd609e60b796b4951b8c492f1 1756905425193
1 6ba82384429c5a8b916136604dc8b5cf 1756905425278
2 5e3b858a144af8625ec89dcbc25340f4 1756905425349
3 c051d3544857162adb9c668fa6c24256 1756905425431
4 6d96e4eaa8ad29039639f364c6d68587 1756905425511
eval.test_evaluator.score eval.test_evaluator.label \
0 1.0 True
1 1.0 True
2 0.0 False
3 1.0 True
4 1.0 True
eval.test_evaluator.explanation eval.test_evaluator.trace.id \
0 placeholder a6e039d4b513c408d4c149024bb5868c
1 placeholder 9f55a4325e95dcdd1d503baaafda9f66
2 placeholder 6babdc4241812a0d5de999141e4b7c37
3 placeholder 6c866dc45d3fd27ed7e9feb468526bd1
4 placeholder fff334454606f0c36907361d6c86c326
eval.test_evaluator.trace.timestamp
0 1756905425195
1 1756905425279
2 1756905425354
3 1756905425435
4 1756905425515
running experiment evaluations |โโโโโโโโโโ| 246/246 (100.0%) | โณ 01:04<00:00 | 3.81it/s
โ Training accuracy: 0.7723577235772358 ๐ Running annotator... ['query', 'ground_truth', 'created_at', 'updated_at', 'id', '__index_level_0__', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output'] ๐ง Creating batches with 90,000 token limit ๐ Processing 123 examples in 1 batches โ Batch 1/1: Optimized arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:27<00:00 | 1.13it/s
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:19 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 31 31 0
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:29<00:00 | 1.06it/s
arize.utils.logging | INFO | โ All evaluators completed.
running experiment evaluations |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:03<00:00 | 8.68it/s
id example_id \
0 EXP_ID_2261c4 e229198e-c98b-4dee-a5b8-a6f3f36d18e4
1 EXP_ID_ab4f1f 0e780f04-2457-469c-9bc9-954db24b3583
2 EXP_ID_8e6039 abe65d8c-297a-4759-b187-2f48cd510b3d
3 EXP_ID_fb7e4b 8287e65d-8fb9-470f-9bcf-cc68491f230a
4 EXP_ID_bd75a0 8fc5bd05-f4fa-41b4-894e-2d76955a5cf1
result result.trace.id \
0 Login Issues 1a967a4db8659598294fd0104bd0ec28
1 Subscription Upgrade/Downgrade 29c8ecdc3c251b454c145087fa4546c1
2 Data Export 64781d712b854c281d8ca4a0c9684ecd
3 Password Reset b731c5b4068400b50c8d99abcd03ea31
4 Login Issues c4f9ade1c49198f8809f626afc743198
result.trace.timestamp eval.test_evaluator.score \
0 1756905545045 0.0
1 1756905546218 0.0
2 1756905547009 1.0
3 1756905548053 0.0
4 1756905548924 1.0
eval.test_evaluator.label eval.test_evaluator.explanation \
0 False placeholder
1 False placeholder
2 True placeholder
3 False placeholder
4 True placeholder
eval.test_evaluator.trace.id eval.test_evaluator.trace.timestamp
0 4b20ac7e0bb0888288863dd364400417 1756905572421
1 2bad86f7b224179dcbed783fa6f20f55 1756905572502
2 5dc1b75aecd25f97ebb2cdffdbd4b447 1756905572583
3 819e043bf3e58c54f200bc2ded9e0100 1756905572682
4 9461a2fe40027bcc7c60c8856801d560 1756905572762
โ
Test accuracy: 0.6451612903225806
๐ Loop 3: Optimizing prompt...
arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 123/123 (100.0%) | โณ 01:20<00:00 | 1.26it/s
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:21 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 123 123 0
running tasks |โโโโโโโโโโ| 123/123 (100.0%) | โณ 01:21<00:00 | 1.51it/s
arize.utils.logging | INFO | โ
All evaluators completed.
id example_id \
0 EXP_ID_280a6b dd17a2a0-2912-4d4c-bf7e-b3b3027b75fe
1 EXP_ID_ea1bb3 c820437c-026e-4cf5-9e0a-1d190fa39fae
2 EXP_ID_a8d783 787aa17a-da44-4c75-b4fd-cc303be0bd56
3 EXP_ID_f6c57f 36e38c54-b02b-4219-94d1-8da3bb5e14d9
4 EXP_ID_a73937 ce75bd3c-9abe-4e4c-aaea-3f1fa1361886
result result.trace.id \
0 Privacy Policy Question 7c504594577d69930848c19ef17cecf2
1 Billing Inquiry 1738f24c66f0647f8863d31efcea45a3
2 **Product Return** fa46c4bb5d6b416228f04664bc8486d7
3 Billing Inquiry 66c31275028840e540d4977e2702e387
4 Feature Request f9ff934cac9192e3d2f99c5db4a937c5
result.trace.timestamp eval.output_evaluator.score \
0 1756905602392 1.0
1 1756905603535 1.0
2 1756905604424 1.0
3 1756905605304 1.0
4 1756905608210 1.0
eval.output_evaluator.label \
0 correct
1 correct
2 correct
3 correct
4 correct
eval.output_evaluator.explanation \
0 correctness: correct;\n explanation: Th...
1 correctness: correct;\n explanation: Th...
2 correctness: correct;\n explanation: Th...
3 correctness: correct;\n explanation: Th...
4 correctness: correct;\n explanation: Th...
eval.output_evaluator.trace.id eval.output_evaluator.trace.timestamp \
0 b39d91cbff173390f8f8931bea520645 1756905682784
1 07b64855ec3fb28b35a7f4a90584217a 1756905682859
2 9010961dc83a367f4b4972d2a8520348 1756905682937
3 cd5fd96232c4f7ea556501b44752d318 1756905683016
4 f173fea6a3121399b1ccca2afc66d33c 1756905683099
eval.test_evaluator.score eval.test_evaluator.label \
0 1.0 True
1 1.0 True
2 0.0 False
3 1.0 True
4 1.0 True
eval.test_evaluator.explanation eval.test_evaluator.trace.id \
0 placeholder ea55a6d0bb4dbcc65d61be4e0b553623
1 placeholder 28537931d9e5d0a47d30ca1f226bf6fd
2 placeholder de61b5ac773ee51e6a9eb3c9e8a95cfc
3 placeholder aef724d09f5581afd4118ba5ac109db2
4 placeholder 0f556117248afe49f8b8c93122fa22e8
eval.test_evaluator.trace.timestamp
0 1756905682786
1 1756905682860
2 1756905682939
3 1756905683021
4 1756905683103
running experiment evaluations |โโโโโโโโโโ| 246/246 (100.0%) | โณ 01:00<00:00 | 4.05it/s
โ Training accuracy: 0.7967479674796748 ๐ Running annotator... ['query', 'ground_truth', 'created_at', 'updated_at', 'id', '__index_level_0__', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output'] ๐ง Creating batches with 90,000 token limit ๐ Processing 123 examples in 1 batches โ Batch 1/1: Optimized arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:27<00:00 | 1.31it/s
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:23 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 31 31 0
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:29<00:00 | 1.05it/s
arize.utils.logging | INFO | โ All evaluators completed.
running experiment evaluations |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:03<00:00 | 8.96it/s
id example_id \
0 EXP_ID_da0286 e229198e-c98b-4dee-a5b8-a6f3f36d18e4
1 EXP_ID_285fea 0e780f04-2457-469c-9bc9-954db24b3583
2 EXP_ID_3e8095 abe65d8c-297a-4759-b187-2f48cd510b3d
3 EXP_ID_3d21a8 8287e65d-8fb9-470f-9bcf-cc68491f230a
4 EXP_ID_1896a9 8fc5bd05-f4fa-41b4-894e-2d76955a5cf1
result result.trace.id \
0 Login Issues 9312b7cd948a832424c244402e261dd3
1 **Subscription Upgrade/Downgrade** c74967c366f29051aacfcd73579e92f6
2 Data Export 20552b811a6cf3980c971ee9a83a4cb4
3 Password Reset 13ac8580fe6a1d1a970ca35d70c65688
4 Login Issues 62700bb18a0bf97bbab76ab1f8706cab
result.trace.timestamp eval.test_evaluator.score \
0 1756905795072 0.0
1 1756905796068 0.0
2 1756905797104 1.0
3 1756905798054 0.0
4 1756905798976 1.0
eval.test_evaluator.label eval.test_evaluator.explanation \
0 False placeholder
1 False placeholder
2 True placeholder
3 False placeholder
4 True placeholder
eval.test_evaluator.trace.id eval.test_evaluator.trace.timestamp
0 f374d1c44937abbe7ba461c80bf67141 1756905822628
1 14a584517d509c6eed2264aec132e212 1756905822710
2 7dd9ba6fc3cab8e743274576c5367d70 1756905822797
3 e4cbb5f20e07f9b96337f6fba289cd0e 1756905822871
4 4b2b73e84b1985905e1c57bccab213b0 1756905822949
โ
Test accuracy: 0.4838709677419355
๐ Loop 4: Optimizing prompt...
arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 123/123 (100.0%) | โณ 01:20<00:00 | 2.32it/s
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:25 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 123 123 0
running tasks |โโโโโโโโโโ| 123/123 (100.0%) | โณ 01:21<00:00 | 1.50it/s
arize.utils.logging | INFO | โ
All evaluators completed.
id example_id \
0 EXP_ID_c13b46 dd17a2a0-2912-4d4c-bf7e-b3b3027b75fe
1 EXP_ID_a44f6d c820437c-026e-4cf5-9e0a-1d190fa39fae
2 EXP_ID_8be281 787aa17a-da44-4c75-b4fd-cc303be0bd56
3 EXP_ID_4778d0 36e38c54-b02b-4219-94d1-8da3bb5e14d9
4 EXP_ID_921271 ce75bd3c-9abe-4e4c-aaea-3f1fa1361886
result result.trace.id \
0 **Privacy Policy Question** a3f5ca6ead158a43e56fff8bf830971b
1 **Billing Inquiry** 8945406b1fac08b576bcf6804309e004
2 **Product Return** 0d7765ab60304021fe36e8b08dca337c
3 **Billing Inquiry** b43f27f77812dec7d8932dfc1a533e64
4 **Feature Request** bc63716b4fffb357890a15ca9692695b
result.trace.timestamp eval.output_evaluator.score \
0 1756905852002 1.0
1 1756905853128 1.0
2 1756905853973 1.0
3 1756905854953 1.0
4 1756905855943 1.0
eval.output_evaluator.label \
0 correct
1 correct
2 correct
3 correct
4 correct
eval.output_evaluator.explanation \
0 correctness: correct;\n explanation: Th...
1 correctness: correct;\n explanation: Th...
2 correctness: correct;\n explanation: Th...
3 correctness: correct;\n explanation: Th...
4 correctness: correct;\n explanation: Th...
eval.output_evaluator.trace.id eval.output_evaluator.trace.timestamp \
0 b98ebc5be125af757fceac8e2e6d71d9 1756905932819
1 b2c07f31c5d1c0ff44ab7900f23013b3 1756905932903
2 64deef1ec5d3899ff8f61fd482bb1322 1756905933014
3 0687fe3ad82bd15159bf61ba2b124053 1756905933091
4 2f579d765fae5590d93371cafd11e766 1756905933171
eval.test_evaluator.score eval.test_evaluator.label \
0 0.0 False
1 0.0 False
2 0.0 False
3 0.0 False
4 0.0 False
eval.test_evaluator.explanation eval.test_evaluator.trace.id \
0 placeholder bdbc63932481983770d242a35f1376bd
1 placeholder 9faec5fb685a62b46da58dfaa507574a
2 placeholder 7d6e42c53f81f0c15d6143047cdec0e4
3 placeholder 559ca6f4b5baca1b2c3155709fc0c52a
4 placeholder 8c7c15b360568dccecf0945deabbac98
eval.test_evaluator.trace.timestamp
0 1756905932821
1 1756905932905
2 1756905933019
3 1756905933095
4 1756905933174
running experiment evaluations |โโโโโโโโโโ| 246/246 (100.0%) | โณ 01:03<00:00 | 3.88it/s
โ Training accuracy: 0.8292682926829268 ๐ Running annotator... ['query', 'ground_truth', 'created_at', 'updated_at', 'id', '__index_level_0__', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output'] ๐ง Creating batches with 90,000 token limit ๐ Processing 123 examples in 1 batches โ Batch 1/1: Optimized arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:26<00:00 | 1.18it/s
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:27 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 31 31 0
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:28<00:00 | 1.10it/s
arize.utils.logging | INFO | โ All evaluators completed.
running experiment evaluations |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:16<00:00 | 1.91it/s
id example_id \
0 EXP_ID_482a23 e229198e-c98b-4dee-a5b8-a6f3f36d18e4
1 EXP_ID_91969e 0e780f04-2457-469c-9bc9-954db24b3583
2 EXP_ID_a2c4a3 abe65d8c-297a-4759-b187-2f48cd510b3d
3 EXP_ID_1abacc 8287e65d-8fb9-470f-9bcf-cc68491f230a
4 EXP_ID_a51a15 8fc5bd05-f4fa-41b4-894e-2d76955a5cf1
result result.trace.id \
0 Login Issues 2f97ae33e5b4ffe9f7f50da6c5089f1a
1 Subscription Upgrade/Downgrade 71d71ce8ed8a3df4696bac06a9457f53
2 Data Export b9c699da6ac2b22c8a3d27ff5376d8ee
3 Password Reset 2a22b58c8cbe61e9a0dd4bdddf174037
4 Login Issues d11bd95ffa2564bf782d84aadcb0e171
result.trace.timestamp eval.test_evaluator.score \
0 1756906050492 0.0
1 1756906051558 0.0
2 1756906052501 1.0
3 1756906054450 0.0
4 1756906055347 1.0
eval.test_evaluator.label eval.test_evaluator.explanation \
0 False placeholder
1 False placeholder
2 True placeholder
3 False placeholder
4 True placeholder
eval.test_evaluator.trace.id eval.test_evaluator.trace.timestamp
0 92e4df8147f25f1e4bb0aa4e1c223fd5 1756906076756
1 659d779b6152c5fac4394c601961b5c4 1756906076864
2 0d86510a4cfebd8537973ae516a1a04b 1756906076936
3 ca476acc1300a8d1928b50d1340613ae 1756906077026
4 c4aa7eab5911b0dd511a39e0941079db 1756906077109
โ
Test accuracy: 0.5806451612903226
๐ Loop 5: Optimizing prompt...
arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 123/123 (100.0%) | โณ 01:20<00:00 | 2.44it/s
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:29 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 123 123 0
running tasks |โโโโโโโโโโ| 123/123 (100.0%) | โณ 01:21<00:00 | 1.51it/s
arize.utils.logging | INFO | โ
All evaluators completed.
id example_id \
0 EXP_ID_c7b2e6 dd17a2a0-2912-4d4c-bf7e-b3b3027b75fe
1 EXP_ID_456c75 c820437c-026e-4cf5-9e0a-1d190fa39fae
2 EXP_ID_1529e6 787aa17a-da44-4c75-b4fd-cc303be0bd56
3 EXP_ID_cfae9d 36e38c54-b02b-4219-94d1-8da3bb5e14d9
4 EXP_ID_aaa517 ce75bd3c-9abe-4e4c-aaea-3f1fa1361886
result result.trace.id \
0 Privacy Policy Question 91b6ec2bf1eded6fb42939356b792c74
1 Billing Inquiry 7b65b544896bfc7ab59968096bc4014f
2 **Product Return** c9e22a245f9b4e5f8a15e3a5206f4850
3 Billing Inquiry 39906e374314920e46969e57f89052b3
4 Feature Request 2cdbd3896a94862ab3b2a70f362fd3a1
result.trace.timestamp eval.output_evaluator.score \
0 1756906106208 1.0
1 1756906107240 1.0
2 1756906108210 1.0
3 1756906109132 1.0
4 1756906110133 1.0
eval.output_evaluator.label \
0 correct
1 correct
2 correct
3 correct
4 correct
eval.output_evaluator.explanation \
0 correctness: correct;\n explanation: Th...
1 correctness: correct;\n explanation: Th...
2 correctness: correct;\n explanation: Th...
3 correctness: correct;\n explanation: Th...
4 correctness: correct;\n explanation: Th...
eval.output_evaluator.trace.id eval.output_evaluator.trace.timestamp \
0 281724994b5b0281627f98bb5531fef0 1756906186423
1 4aff9b3dcf136bcd1f27379966bda3bb 1756906186508
2 ed2849d87ad2ad9d5c5bd2f1f5beed53 1756906186586
3 832e8be56da6f5b740ed86d7d0a8e192 1756906186659
4 a75ef3c513048e9facb9f86d4c588cb0 1756906186750
eval.test_evaluator.score eval.test_evaluator.label \
0 1.0 True
1 1.0 True
2 0.0 False
3 1.0 True
4 1.0 True
eval.test_evaluator.explanation eval.test_evaluator.trace.id \
0 placeholder 7be48e1df79272e915b74dca2bac7897
1 placeholder 0d9e4d7136eb396504985a728870d7fd
2 placeholder 23775fcc87cc433ef34c2b4d94ddde07
3 placeholder c4e46aba3b1506771fa271ba210f58e6
4 placeholder f0874563b1e212d22dada526e463cd17
eval.test_evaluator.trace.timestamp
0 1756906186425
1 1756906186509
2 1756906186590
3 1756906186663
4 1756906186753
running experiment evaluations |โโโโโโโโโโ| 246/246 (100.0%) | โณ 01:05<00:00 | 3.77it/s
โ Training accuracy: 0.8130081300813008 ๐ Running annotator... ['query', 'ground_truth', 'created_at', 'updated_at', 'id', '__index_level_0__', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output'] ๐ง Creating batches with 90,000 token limit ๐ Processing 123 examples in 1 batches โ Batch 1/1: Optimized arize.utils.logging | INFO | ๐งช Experiment started.
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:27<00:00 | 1.15it/s
arize.utils.logging | INFO | โ Task runs completed. Tasks Summary (09/03/25 06:32 AM -0700) --------------------------------------- n_examples n_runs n_errors 0 31 31 0
running tasks |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:29<00:00 | 1.06it/s
arize.utils.logging | INFO | โ All evaluators completed.
running experiment evaluations |โโโโโโโโโโ| 31/31 (100.0%) | โณ 00:17<00:00 | 1.81it/s
id example_id \
0 EXP_ID_83044b e229198e-c98b-4dee-a5b8-a6f3f36d18e4
1 EXP_ID_52ef60 0e780f04-2457-469c-9bc9-954db24b3583
2 EXP_ID_d1e5b6 abe65d8c-297a-4759-b187-2f48cd510b3d
3 EXP_ID_709e32 8287e65d-8fb9-470f-9bcf-cc68491f230a
4 EXP_ID_69a659 8fc5bd05-f4fa-41b4-894e-2d76955a5cf1
result result.trace.id \
0 Login Issues 35bd4b72241ec8aa9db44eb730deda39
1 Subscription Upgrade/Downgrade a0bd73ae0f449c51d116700e0dc7c267
2 Data Export a547a17c4d280bb4e518008f0e28a6e4
3 Password Reset e08015917c6acf5199dbe40e55d463c2
4 Login Issues 2fa27b32aeee195eb139f26f3b4ee918
result.trace.timestamp eval.test_evaluator.score \
0 1756906320028 0.0
1 1756906321039 0.0
2 1756906321997 1.0
3 1756906322968 0.0
4 1756906323983 1.0
eval.test_evaluator.label eval.test_evaluator.explanation \
0 False placeholder
1 False placeholder
2 True placeholder
3 False placeholder
4 True placeholder
eval.test_evaluator.trace.id eval.test_evaluator.trace.timestamp
0 6e1d473957f18fb9470bc9c36da08db9 1756906347459
1 929f2526d03aac70c101f63ab3eed51c 1756906347535
2 9e7fbf6dc56defae863e7a6f4d95ac8f 1756906347630
3 05e28dc0576349c457bedee1541d59fd 1756906347704
4 226fb91e1e1ee7e82ebdecb10f379284 1756906347780
โ
Test accuracy: 0.5161290322580645
Prompt Optimized!
The code below picks the prompt with the highest score on the test set, and displays the training/test metrics and delta for that prompt.
๐ Best Prompt Found:
support query: {query}
Account Creation
Login Issues
Password Reset
Two-Factor Authentication
Profile Updates
Billing Inquiry
Refund Request
Subscription Upgrade/Downgrade
Payment Method Update
Invoice Request
Order Status
Shipping Delay
Product Return
Warranty Claim
Technical Bug Report
Feature Request
Integration Help
Data Export
Security Concern
Terms of Service Question
Privacy Policy Question
Compliance Inquiry
Accessibility Support
Language Support
Mobile App Issue
Desktop App Issue
Email Notifications
Marketing Preferences
Beta Program Enrollment
General Feedback
Return just the category, no other text.
๐งช Initial Test Accuracy: 0.5806451612903226
๐งช Optimized Test Accuracy: 0.7096774193548387 (ฮ 0.1290)