Notebooks
W
Weaviate
Query Agent Evaluation With Trulens

Query Agent Evaluation With Trulens

vector-searchvector-databaseretrieval-augmented-generationllm-frameworksoperationstrulensfunction-callingweaviate-recipesintegrationsPythongenerative-ai

Open In Colab

Query Agent Evaluation

In this recipe, we extend the E-Commerce Query Agent and show how to evaluate and improve its performance.

We will use TruLens to trace and evaluate the agent. By using TruLens we can identify opportunities to improve the agent, track our experiments, and compare across versions of the app.

To evaluate the agent, we can access metadata from the intermediate steps in the trace. Then, we can use this metadata to evaluate things like the relevance of the filter used by the query agent, or the relevance of the intermediate search results.

Custom evaluations are particularly valuable here, because they allow us to easily extend existing feedbacks to unique scenarios. In this example, we show how to record a Query Agent run. We also show how to use custom instructions to customize an existing LLM judge to provide tailored feedback for our situation.

By evaluating this ecommerce agent, we are able to identify opportunities for improvement when the search results include items that do not match what the customer is looking for.

Follow along!

1. Setup Keys and Install Packages

[ ]
[ ]

2. Create weaviate client

[ ]

3. Prepare the Collections

[ ]
[ ]

4. Create the Query Agent

[ ]

5. Set the logging

By default, TruLens will log traces and evaluations locally in a sqlite file.

TruLens can also connect to any SQL-alchemy database to store the logs, including Snowflake.

[ ]

6. Define evaluations

For this example, we will define three feedback functions to use as evaluators.

  • Answer relevance will evaluate the relevance of the answer end-to-end
  • Filter relevance will assess the effectiveness of the filter produced as an intermediate step by the agent
  • Context relevance identifies the relevance of the search results produced as an intermediate step by the the agent

In addition to scoring relevance, all three feedback functions will also provide chain-of-thought reasons to explain the scoring. These explanations, along with the scores will be accessible in the TruLens dashboard.

[ ]

7. Register the agent

Registering the agent allows us to track our experiments. We'll give it a name, version and list the feedback functions we want to use to evaluate application runs.

[ ]

8. Run and record the agent

Using the tru_agent we just defined as a context manager, we can run our agent. This will record the traces and kick off evaluations for each time the agent runs.

[ ]

9. Run the dashboard

TruLens ships with a Streamlit dashboard that can be launched using run_dashboard, and operates off the evaluation and trace logs.

This is the primary way you can explore traces, view evaluation results, and compare app versions.

[ ]

10. Identify issue using the TruLens dashboard

By evaluating the query agent, we notice it occasionally returns non-clothing items even though the customer specifically asks for clothing.

trulens identify issues

11. Improve the agent

Let's add additional instruction into the system prompt to help guide the agent to return only the type of result the user is looking for.

[ ]

12. Validate performance

Last, we'll register the improved version of teh app and validate the performance improvement

[ ]
[ ]

In the dashboard, we can compare application versions and their evaluation results.

Comparing here, we see the context relevance improvement.

trulens validate