Notebooks
A
Arize AI
Llama Support Query Optimization

🪣 Clone the Prompt-Learning Repository

This section clones the Arize Prompt-Learning repo and adds it to sys.path
so that modules like optimizer_sdk, phoenix, and utilities can be imported directly.

[ ]

⚙️ Install Dependencies

Install all required packages for Google ADK, LiteLLM, Phoenix SDK, and Vertex AI integration.
These libraries enable you to run and evaluate LLMs through Vertex AI and track results in Arize Phoenix.

[ ]

🔐 Authenticate and Connect Phoenix

Authenticate your Google Cloud account and connect to your Phoenix workspace.
You’ll be prompted for:

  • Phoenix Collector Endpoint
  • Phoenix API Key

These allow experiment tracking and dataset creation within Phoenix.

[ ]
[ ]
[ ]
[ ]

☁️ Configure Vertex AI Environment

Set up your Google Cloud project, region, and GCS bucket for Vertex AI.
This ensures that all Vertex and ADK calls use your correct project context.

[ ]
PROJECT_ID
LOCATION

🧠 Define the System Prompt and Upload to Phoenix

Here we define the classifier prompt listing all supported support-ticket categories.
Then we upload it to Phoenix Prompt Hub using upload_prompt_phoenix(),
which versions and stores the prompt for tracking.

[ ]

📊 Load and Split Dataset

Load the support_queries.csv dataset, then split it into training (70%) and test (30%) sets.
Each dataset is uploaded to Phoenix as a tracked dataset for experimentation.

[ ]

🤖 Create Agent, Runner, and Session Utilities

Initialize the Llama-3.3-70B-Instruct (Vertex) model via LiteLlm.
Define helper functions to:

  • Create a reusable agent + runner
  • Manage sessions (get_or_create_session)
  • Generate completions
  • Wrap tasks for experiment execution
[ ]

🧪 Phoenix Experiments

In order to iterate on our prompts, we must experiment with our prompts at each iteration. Phoenix allows you to experiment with your prompts at scale, by running them over large datasets.

Phoenix also allows you to evaluate your experiment, by setting up LLM and code evaluations.

In order to run an experiment, you must define a task function and evaluator function(s).

task function: In our task we define the output generation of your experiment. For us, we'll call our completion function with the proper input

evaluators: We define two evaluators.

test_evaluator: This is a simple code evaluator that compares the generated class with the ground truth class. This gives us accuracy.

output_evaluator: This is an LLM evaluator that generates our feedback for optimization. It looks to answer questions like why certain outputs were wrong, and why the model made the wrong decision. You can see the entire prompt used for evaluation in prompts -> support_query_classification -> simple_evaluator_prompt.txt.

We'll be using a stronger Llama model, Llama-3.3-70B-Instruct for evals. This will help us generate better evaluations, which will allow us to build a prompt that works even with a smaller model.

[ ]
[ ]
[ ]

🧾 Process Experiment Results

Fetch completed experiment data from Phoenix via its REST API and merge results
back into a pandas DataFrame. Adds columns like feedback, ground_truth, and output
for use in later optimization.

[ ]

Unfortunately as of now, Prompt Learning only supports OpenAI models for its meta prompting stage. We will add support for other models, like Llama, soon!

[ ]

🔁 Define the Optimization Loop

The optimize_loop() function:

  1. Runs model + evaluators on the dataset.
  2. Retrieves feedback from Phoenix.
  3. Uses GPT-4o to refine the system prompt.
  4. Uploads the improved prompt to Phoenix.
  5. Repeats for multiple loops (default = 5).
[ ]

See your prompts and their accuracies!

In the Phoenix UI, you'll be able to visualize your experiments and their accuracies, to see how much your prompts improved after each iteration of prompt optimization.

To grab the prompts associated with those experiment runs, you can index into the prompts array you generated above.

[ ]