10 Autogen Evaluation
AutoGen Agents in Production: Observability & Evaluation
In this tutorial, we will learn how to monitor the internal steps (traces) of Autogen agents and evaluate its performance using Langfuse.
This guide covers online and offline evaluation metrics used by teams to bring agents to production fast and reliably.
Why AI agent Evaluation is important:
- Debugging issues when tasks fail or produce suboptimal results
- Monitoring costs and performance in real-time
- Improving reliability and safety through continuous feedback
Step 1: Set Environment Variables
Get your Langfuse API keys by signing up for Langfuse Cloud or self-hosting Langfuse.
Note: Self-hosters can use Terraform modules to deploy Langfuse on Azure. Alternatively, you can deploy Langfuse on Kubernetes using the Helm chart.
Requirement already satisfied: langfuse in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (2.60.9) Requirement already satisfied: openlit in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (1.35.0) Requirement already satisfied: anyio<5.0.0,>=4.4.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langfuse) (4.9.0) Requirement already satisfied: backoff>=1.10.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langfuse) (2.2.1) Requirement already satisfied: httpx<1.0,>=0.15.4 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langfuse) (0.28.1) Requirement already satisfied: idna<4.0,>=3.7 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langfuse) (3.10) Requirement already satisfied: packaging<25.0,>=23.2 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langfuse) (24.2) Requirement already satisfied: pydantic<3.0,>=1.10.7 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langfuse) (2.11.9) Requirement already satisfied: requests<3,>=2 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langfuse) (2.32.3) Requirement already satisfied: wrapt<2.0,>=1.14 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langfuse) (1.17.2) Requirement already satisfied: sniffio>=1.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from anyio<5.0.0,>=4.4.0->langfuse) (1.3.1) Requirement already satisfied: typing_extensions>=4.5 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from anyio<5.0.0,>=4.4.0->langfuse) (4.13.2) Requirement already satisfied: certifi in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from httpx<1.0,>=0.15.4->langfuse) (2025.4.26) Requirement already satisfied: httpcore==1.* in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from httpx<1.0,>=0.15.4->langfuse) (1.0.9) Requirement already satisfied: h11>=0.16 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from httpcore==1.*->httpx<1.0,>=0.15.4->langfuse) (0.16.0) Requirement already satisfied: annotated-types>=0.6.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from pydantic<3.0,>=1.10.7->langfuse) (0.7.0) Requirement already satisfied: pydantic-core==2.33.2 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from pydantic<3.0,>=1.10.7->langfuse) (2.33.2) Requirement already satisfied: typing-inspection>=0.4.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from pydantic<3.0,>=1.10.7->langfuse) (0.4.0) Requirement already satisfied: charset-normalizer<4,>=2 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from requests<3,>=2->langfuse) (3.4.2) Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from requests<3,>=2->langfuse) (2.4.0) Requirement already satisfied: anthropic<1.0.0,>=0.42.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.69.0) Requirement already satisfied: boto3<2.0.0,>=1.34.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (1.40.15) Requirement already satisfied: botocore<2.0.0,>=1.34.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (1.40.15) Requirement already satisfied: langchain<0.4.0,>=0.3.15 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.3.27) Requirement already satisfied: openai<2.0.0,>=1.1.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (1.99.9) Requirement already satisfied: openai-agents>=0.0.3 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.2.9) Requirement already satisfied: opentelemetry-api<2.0.0,>=1.30.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (1.31.1) Requirement already satisfied: opentelemetry-exporter-otlp<2.0.0,>=1.30.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (1.31.1) Requirement already satisfied: opentelemetry-instrumentation<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-aiohttp-client<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-asgi<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-django<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-falcon<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-fastapi<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-flask<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-httpx<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-pyramid<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-requests<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-starlette<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-tornado<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-urllib<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-instrumentation-urllib3<1.0.0,>=0.52b0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.52b1) Requirement already satisfied: opentelemetry-sdk<2.0.0,>=1.30.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (1.31.1) Requirement already satisfied: schedule<2.0.0,>=1.2.2 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (1.2.2) Requirement already satisfied: xmltodict<1.0.0,>=0.13.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openlit) (0.14.2) Requirement already satisfied: distro<2,>=1.7.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from anthropic<1.0.0,>=0.42.0->openlit) (1.9.0) Requirement already satisfied: docstring-parser<1,>=0.15 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from anthropic<1.0.0,>=0.42.0->openlit) (0.17.0) Requirement already satisfied: jiter<1,>=0.4.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from anthropic<1.0.0,>=0.42.0->openlit) (0.9.0) Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from boto3<2.0.0,>=1.34.0->openlit) (1.0.1) Requirement already satisfied: s3transfer<0.14.0,>=0.13.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from boto3<2.0.0,>=1.34.0->openlit) (0.13.1) Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from botocore<2.0.0,>=1.34.0->openlit) (2.9.0.post0) Requirement already satisfied: langchain-core<1.0.0,>=0.3.72 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langchain<0.4.0,>=0.3.15->openlit) (0.3.77) Requirement already satisfied: langchain-text-splitters<1.0.0,>=0.3.9 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langchain<0.4.0,>=0.3.15->openlit) (0.3.9) Requirement already satisfied: langsmith>=0.1.17 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langchain<0.4.0,>=0.3.15->openlit) (0.4.15) Requirement already satisfied: SQLAlchemy<3,>=1.4 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langchain<0.4.0,>=0.3.15->openlit) (2.0.43) Requirement already satisfied: PyYAML>=5.3 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langchain<0.4.0,>=0.3.15->openlit) (6.0.2) Requirement already satisfied: tenacity!=8.4.0,<10.0.0,>=8.1.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langchain-core<1.0.0,>=0.3.72->langchain<0.4.0,>=0.3.15->openlit) (9.1.2) Requirement already satisfied: jsonpatch<2.0.0,>=1.33.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langchain-core<1.0.0,>=0.3.72->langchain<0.4.0,>=0.3.15->openlit) (1.33) Requirement already satisfied: jsonpointer>=1.9 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from jsonpatch<2.0.0,>=1.33.0->langchain-core<1.0.0,>=0.3.72->langchain<0.4.0,>=0.3.15->openlit) (3.0.0) Requirement already satisfied: orjson>=3.9.14 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langsmith>=0.1.17->langchain<0.4.0,>=0.3.15->openlit) (3.10.18) Requirement already satisfied: requests-toolbelt>=1.0.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langsmith>=0.1.17->langchain<0.4.0,>=0.3.15->openlit) (1.0.0) Requirement already satisfied: zstandard>=0.23.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from langsmith>=0.1.17->langchain<0.4.0,>=0.3.15->openlit) (0.24.0) Requirement already satisfied: tqdm>4 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openai<2.0.0,>=1.1.1->openlit) (4.67.1) Requirement already satisfied: deprecated>=1.2.6 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-api<2.0.0,>=1.30.0->openlit) (1.2.18) Requirement already satisfied: importlib-metadata<8.7.0,>=6.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-api<2.0.0,>=1.30.0->openlit) (8.6.1) Requirement already satisfied: zipp>=3.20 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from importlib-metadata<8.7.0,>=6.0->opentelemetry-api<2.0.0,>=1.30.0->openlit) (3.21.0) Requirement already satisfied: opentelemetry-exporter-otlp-proto-grpc==1.31.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-exporter-otlp<2.0.0,>=1.30.0->openlit) (1.31.1) Requirement already satisfied: opentelemetry-exporter-otlp-proto-http==1.31.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-exporter-otlp<2.0.0,>=1.30.0->openlit) (1.31.1) Requirement already satisfied: googleapis-common-protos~=1.52 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-exporter-otlp-proto-grpc==1.31.1->opentelemetry-exporter-otlp<2.0.0,>=1.30.0->openlit) (1.70.0) Requirement already satisfied: grpcio<2.0.0,>=1.63.2 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-exporter-otlp-proto-grpc==1.31.1->opentelemetry-exporter-otlp<2.0.0,>=1.30.0->openlit) (1.71.0) Requirement already satisfied: opentelemetry-exporter-otlp-proto-common==1.31.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-exporter-otlp-proto-grpc==1.31.1->opentelemetry-exporter-otlp<2.0.0,>=1.30.0->openlit) (1.31.1) Requirement already satisfied: opentelemetry-proto==1.31.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-exporter-otlp-proto-grpc==1.31.1->opentelemetry-exporter-otlp<2.0.0,>=1.30.0->openlit) (1.31.1) Requirement already satisfied: protobuf<6.0,>=5.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-proto==1.31.1->opentelemetry-exporter-otlp-proto-grpc==1.31.1->opentelemetry-exporter-otlp<2.0.0,>=1.30.0->openlit) (5.29.5) Requirement already satisfied: opentelemetry-semantic-conventions==0.52b1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-instrumentation<1.0.0,>=0.52b0->openlit) (0.52b1) Requirement already satisfied: opentelemetry-util-http==0.52b1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-instrumentation-aiohttp-client<1.0.0,>=0.52b0->openlit) (0.52b1) Requirement already satisfied: asgiref~=3.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-instrumentation-asgi<1.0.0,>=0.52b0->openlit) (3.8.1) Requirement already satisfied: opentelemetry-instrumentation-wsgi==0.52b1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from opentelemetry-instrumentation-django<1.0.0,>=0.52b0->openlit) (0.52b1) Requirement already satisfied: six>=1.5 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<2.0.0,>=1.34.0->openlit) (1.17.0) Requirement already satisfied: griffe<2,>=1.5.6 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openai-agents>=0.0.3->openlit) (1.12.1) Requirement already satisfied: mcp<2,>=1.11.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openai-agents>=0.0.3->openlit) (1.13.1) Requirement already satisfied: types-requests<3,>=2.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from openai-agents>=0.0.3->openlit) (2.32.4.20250809) Requirement already satisfied: colorama>=0.4 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from griffe<2,>=1.5.6->openai-agents>=0.0.3->openlit) (0.4.6) Requirement already satisfied: httpx-sse>=0.4 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (0.4.0) Requirement already satisfied: jsonschema>=4.20.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (4.23.0) Requirement already satisfied: pydantic-settings>=2.5.2 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (2.9.1) Requirement already satisfied: python-multipart>=0.0.9 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (0.0.18) Requirement already satisfied: sse-starlette>=1.6.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (2.3.4) Requirement already satisfied: starlette>=0.27 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (0.41.3) Requirement already satisfied: uvicorn>=0.31.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (0.34.2) Requirement already satisfied: attrs>=22.2.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from jsonschema>=4.20.0->mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (25.3.0) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from jsonschema>=4.20.0->mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (2025.4.1) Requirement already satisfied: referencing>=0.28.4 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from jsonschema>=4.20.0->mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (0.36.2) Requirement already satisfied: rpds-py>=0.7.1 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from jsonschema>=4.20.0->mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (0.24.0) Requirement already satisfied: python-dotenv>=0.21.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from pydantic-settings>=2.5.2->mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (1.0.1) Requirement already satisfied: click>=7.0 in /Users/koreypace/ai-agents-for-beginners/.venv/lib/python3.12/site-packages (from uvicorn>=0.31.1->mcp<2,>=1.11.0->openai-agents>=0.0.3->openlit) (8.1.8) [notice] A new release of pip is available: 25.1.1 -> 25.2 [notice] To update, run: pip install --upgrade pip
With the environment variables set, we can now initialize the Langfuse client. get_client() initializes the Langfuse client using the credentials provided in the environment variables.
Langfuse client is authenticated and ready!
Step 2: Initialize OpenLit Instrumentation
Now, we initialize the OpenLit instrumentation. OpenLit automatically captures AutoGen operations and exports OpenTelemetry (OTel) spans to Langfuse.
Step 3: Run your agent
Now we set up a multi turn agent to test our instrumentation.
Trace Structure
Langfuse records a trace that contains spans, which represent each step of your agent’s logic. Here, the trace contains the overall agent run and sub-spans for:
- The meal planner agent
- The nuritionist agents
You can inspect these to see precisely where time is spent, how many tokens are used, and so on:

Online Evaluation
Online Evaluation refers to evaluating the agent in a live, real-world environment, i.e. during actual usage in production. This involves monitoring the agent’s performance on real user interactions and analyzing outcomes continuously.
Common Metrics to Track in Production
- Costs — The instrumentation captures token usage, which you can transform into approximate costs by assigning a price per token.
- Latency — Observe the time it takes to complete each step, or the entire run.
- User Feedback — Users can provide direct feedback (thumbs up/down) to help refine or correct the agent.
- LLM-as-a-Judge — Use a separate LLM to evaluate your agent’s output in near real-time (e.g., checking for toxicity or correctness).
Below, we show examples of these metrics.
1. Costs
Below is a screenshot showing usage for gpt-4o-mini calls. This is useful to see costly steps and optimize your agent.

2. Latency
We can also see how long it took to complete each step. In the example below, the entire run took about 3 seconds, which you can break down by step. This helps you identify bottlenecks and optimize your agent.

3. User Feedback
If your agent is embedded into a user interface, you can record direct user feedback (like a thumbs-up/down in a chat UI).
id='da068880-22ae-4f01-9f01-2bb231939089' source='user' models_usage=None metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 43, 732669, tzinfo=datetime.timezone.utc) content='Create a meal with potatoes' type='TextMessage'
id='ad937ce4-3534-493f-824b-ca9c226b5287' source='meal_planner_agent' models_usage=RequestUsage(prompt_tokens=95, completion_tokens=30) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 45, 186423, tzinfo=datetime.timezone.utc) content='Potato and Spinach Frittata \n- Eggs \n- Potatoes \n- Fresh spinach \n- Onion \n- Cheese (optional) ' type='TextMessage'
id='50fd33c1-057f-49fe-afad-ee86d164296d' source='nutritionist_agent' models_usage=RequestUsage(prompt_tokens=132, completion_tokens=4) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 45, 581059, tzinfo=datetime.timezone.utc) content='APPROVE' type='TextMessage'
Stop Reason: Text 'APPROVE' mentioned
id='e371de6c-e5fc-42c1-8eda-e5b8cd5accab' source='user' models_usage=None metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 45, 585048, tzinfo=datetime.timezone.utc) content='I am allergic to gluten.' type='TextMessage'
id='7e20b23a-7434-4704-8cbb-4464bf018aec' source='meal_planner_agent' models_usage=RequestUsage(prompt_tokens=145, completion_tokens=32) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 46, 700087, tzinfo=datetime.timezone.utc) content='Quinoa and Roasted Vegetable Bowl \n- Quinoa \n- Roasted sweet potatoes \n- Bell peppers \n- Zucchini \n- Avocado ' type='TextMessage'
id='24077bbf-e613-43f8-94f6-c8d73778eaca' source='nutritionist_agent' models_usage=RequestUsage(prompt_tokens=184, completion_tokens=4) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 47, 66855, tzinfo=datetime.timezone.utc) content='APPROVE' type='TextMessage'
Stop Reason: Text 'APPROVE' mentioned
User feedback is then captured in Langfuse:

4. Automated LLM-as-a-Judge Scoring
LLM-as-a-Judge is another way to automatically evaluate your agent's output. You can set up a separate LLM call to gauge the output’s correctness, toxicity, style, or any other criteria you care about.
Workflow:
- You define an Evaluation Template, e.g., "Check if the text is toxic."
- You set a model that is used as judge-model; in this case
gpt-4o-miniqueried via Azure. - Each time your agent generates output, you pass that output to your "judge" LLM with the template.
- The judge LLM responds with a rating or label that you log to your observability tool.
Example from Langfuse:

id='eefc628d-502f-451a-8f70-be486f62f8c5' source='user' models_usage=None metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 29, 171393, tzinfo=datetime.timezone.utc) content='I am a picky eater and not sure if you find something for me.' type='TextMessage'
id='13b3e14b-bcf7-42a5-80d6-54b0c7be765e' source='meal_planner_agent' models_usage=RequestUsage(prompt_tokens=352, completion_tokens=27) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 30, 433516, tzinfo=datetime.timezone.utc) content='Chicken Alfredo Pasta \n- Gluten-free pasta \n- Grilled chicken breast \n- Heavy cream \n- Parmesan cheese \n- Garlic ' type='TextMessage'
id='550f2dee-0e08-4bbd-b67f-991b467328f1' source='nutritionist_agent' models_usage=RequestUsage(prompt_tokens=386, completion_tokens=17) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 31, 505173, tzinfo=datetime.timezone.utc) content='Consider incorporating some vegetables, like spinach or broccoli, to increase the nutrient variety.' type='TextMessage'
id='4d249733-56d8-4f74-ab72-2387dbca812a' source='meal_planner_agent' models_usage=RequestUsage(prompt_tokens=402, completion_tokens=30) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 32, 635637, tzinfo=datetime.timezone.utc) content='Chicken Alfredo with Spinach \n- Gluten-free pasta \n- Grilled chicken breast \n- Heavy cream \n- Parmesan cheese \n- Fresh spinach ' type='TextMessage'
id='12747fbc-b09f-4118-a9ed-420a4c1434e1' source='nutritionist_agent' models_usage=RequestUsage(prompt_tokens=439, completion_tokens=4) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 33, 245912, tzinfo=datetime.timezone.utc) content='APPROVE' type='TextMessage'
Stop Reason: Text 'APPROVE' mentioned
You can see that the answer of this example is judged as "not toxic".

5. Observability Metrics Overview
All of these metrics can be visualized together in dashboards. This enables you to quickly see how your agent performs across many sessions and helps you to track quality metrics over time.

Offline Evaluation
Online evaluation is essential for live feedback, but you also need offline evaluation—systematic checks before or during development. This helps maintain quality and reliability before rolling changes into production.
Dataset Evaluation
In offline evaluation, you typically:
- Have a benchmark dataset (with prompt and expected output pairs)
- Run your agent on that dataset
- Compare outputs to the expected results or use an additional scoring mechanism
Below, we demonstrate this approach with the q&a-dataset, which contains questions and expected answers.
/Users/jannik/Documents/GitHub/ai-agents-for-beginners/.venv/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
First few rows of search-dataset:
id \
0 20caf138-0c81-4ef9-be60-fe919e0d68d4
1 1f37d9fd-1bcc-4f79-b004-bc0e1e944033
2 76173a7f-d645-4e3e-8e0d-cca139e00ebe
3 5f5ef4ca-91fe-4610-a8a9-e15b12e3c803
4 64dbed0d-d91b-4acd-9a9c-0a7aa83115ec
question \
0 steve jobs statue location budapst
1 Why is the Battle of Stalingrad considered a t...
2 In what year did 'The Birth of a Nation' surpa...
3 How many Russian soldiers surrendered to AFU i...
4 What event led to the creation of Google Images?
expected_answer category area
0 The Steve Jobs statue is located in Budapest, ... Arts Knowledge
1 The Battle of Stalingrad is considered a turni... General News News
2 This question is based on a false premise. 'Th... Entertainment News
3 About 300 Russian soldiers surrendered to the ... General News News
4 Jennifer Lopez's appearance in a green Versace... Technology News
Next, we create a dataset entity in Langfuse to track the runs. Then, we add each item from the dataset to the system.
Dataset(id='cmcm7524d00kjad07s2cjwqcf', name='qa-dataset_autogen-agent', description='q&a dataset uploaded from Hugging Face', metadata={'date': '2025-03-21', 'type': 'benchmark'}, project_id='cloramnkj0002jz088vzn1ja4', created_at=datetime.datetime(2025, 7, 2, 16, 54, 7, 357000, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 7, 2, 16, 54, 7, 357000, tzinfo=datetime.timezone.utc)) 
Running the Agent on the Dataset
First, we assemble a simple Autogen agent that answers questions using Azure OpenAI models.
Then, we define a helper function my_agent().
'The capital of France is Paris.'
Finally, we loop over each dataset item, run the agent, and link the trace to the dataset item. We can also attach a quick evaluation score if desired.
Running evaluation for item: 09810cc4-9992-4712-a3b2-7224da31776a (Input: {'text': 'In Hindu mythology, which deity is the Ganges river dolphin associated with?'})
Generated Answer: In Hindu mythology, the Ganges river dolphin is associated with the deity Ganga.
Running evaluation for item: bb113f94-7723-47c6-8c34-59d883044514 (Input: {'text': 'What significant discovery did the LHCb collaboration report in 2015?'})
Generated Answer: In 2015, the LHCb collaboration reported the discovery of pentaquark particles.
Running evaluation for item: 4d8ae54e-ceab-46d0-ad2c-6e8e223589a9 (Input: {'text': 'What is the MÄ\x81ori name for the red-crowned parakeet?'})
Generated Answer: The Māori name for the red-crowned parakeet is kākāriki.
Running evaluation for item: 21e5a0d5-f619-4a73-868e-9955053b3e72 (Input: {'text': 'Who starred in the 1978 television film adaptation of Les Misérables?'})
Generated Answer: Richard Jordan starred as Jean Valjean in the 1978 television film adaptation of Les Misérables.
Running evaluation for item: 55d127ba-fae4-43eb-b6a8-42453068414d (Input: {'text': 'michelangelo david height in meteres'})
Generated Answer: Michelangelo’s David is approximately 5.17 meters tall.
Running evaluation for item: bf01eb12-b5d1-423a-be67-2fc8720cf406 (Input: {'text': 'what is the purpose of the air qualty index?'})
Generated Answer: The purpose of the Air Quality Index (AQI) is to provide the public with a simple way to understand how polluted the air currently is or how polluted it is forecast to become, so they can make informed decisions to protect their health.
Running evaluation for item: 090d784f-7fa9-4380-a7a7-b4deb3ca182b (Input: {'text': 'Which village in Donetsk Oblast was captured by Russian forces on 2 August 2024?'})
Generated Answer: The village of Startseve in Donetsk Oblast was captured by Russian forces on 2 August 2024.
Running evaluation for item: 8b04b90c-6dc1-497a-b6fc-89f2b162f11d (Input: {'text': 'When did the last captive quagga die?'})
Generated Answer: The last captive quagga died on 12 August 1883.
Running evaluation for item: 030153ba-bb99-46e8-b788-8d52f26a810d (Input: {'text': 'In what year did London officially adopt its multilingual language policy, recognizing English, Polish, and Bengali as co-official languages for city services?'})
Generated Answer: There is no record of London officially adopting a multilingual language policy recognizing English, Polish, and Bengali as co-official languages for city services. English is the official language used for city services in London.
Running evaluation for item: 68c571da-a7f0-41a3-859b-e22fe2b775f5 (Input: {'text': 'when was higgs boson discoverd at LHC?'})
Generated Answer: The Higgs boson was discovered at the Large Hadron Collider (LHC) in 2012.
Running evaluation for item: 29564ef5-f1dd-4f23-a17a-7f5de3057924 (Input: {'text': 'Which countries are sending police forces to Paris 2024 Olympics?'})
Generated Answer: For the Paris 2024 Olympics, several countries are sending police officers to assist with security and support their own nationals. These countries include the United States, United Kingdom, Germany, Spain, Italy, Belgium, and the Netherlands, among others.
Running evaluation for item: 7dcad15c-5df1-4eaf-b46e-544a47f7ff99 (Input: {'text': 'first formula one race held in turin 1946'})
Generated Answer: The first Formula One race, known as the Turin Grand Prix, was held in Turin, Italy, on 1 September 1946. This race is recognized as the first race run to Formula One regulations.
Running evaluation for item: cd691118-60e8-4fa1-b369-148c4a8a47b7 (Input: {'text': 'what was the name of the computer steve jobs invented at reed college'})
Generated Answer: Steve Jobs did not invent a computer while at Reed College. However, he later co-founded Apple and was involved in the creation of the Apple I and Apple II computers after leaving Reed.
Running evaluation for item: 8055aba2-d95d-4385-850e-4fad26b4335e (Input: {'text': 'where was copa america centenario trophy presented?'})
Generated Answer: The Copa América Centenario trophy was presented at MetLife Stadium in East Rutherford, New Jersey, after the final match on June 26, 2016.
Running evaluation for item: 3520c34b-eb19-4d49-9605-93512c72d753 (Input: {'text': 'Who founded Ford Motor Company?'})
Generated Answer: Ford Motor Company was founded by Henry Ford.
Running evaluation for item: 43434c85-424a-49db-ba33-67c1fb74c78e (Input: {'text': 'COVID-19 impact on African economies'})
Generated Answer: COVID-19 had a significant negative impact on African economies. It disrupted trade, reduced tourism, lowered commodity prices, and caused job losses. Many countries faced reduced economic growth, increased debt, and challenges in healthcare funding, pushing millions into poverty and affecting overall development.
Running evaluation for item: d927ddf3-f9ff-4d17-9489-04d6b4edeeca (Input: {'text': 'What types of incidents are commonly associated with the Bermuda Triangle?'})
Generated Answer: The Bermuda Triangle is commonly associated with incidents such as the mysterious disappearance of ships, aircraft, and their crews. These incidents often involve unexplained loss of contact, navigation failures, and vessels or planes vanishing without a trace.
Running evaluation for item: 6eb98558-f444-440c-9732-6556ce6e7efe (Input: {'text': 'herbert donovan jr mclean v arkansas case'})
Generated Answer: Herbert Donovan Jr. was the federal judge who presided over the case *McLean v. Arkansas Board of Education* in 1981-1982. In this case, the court ruled that a state law requiring the teaching of "creation science" alongside evolution in public schools was unconstitutional, as it violated the Establishment Clause of the First Amendment.
Running evaluation for item: 5709263e-90f0-405e-90eb-ac5857a6b50f (Input: {'text': "What genres are blended in Chappell Roan's 'Pink Pony Club'?"})
Generated Answer: Chappell Roan's "Pink Pony Club" blends pop, synth-pop, and dance-pop genres.
Running evaluation for item: 3060e786-a258-43f3-8f37-604314abafa0 (Input: {'text': 'When was Yellowstone National Park established as the first national park in the United States?'})
Generated Answer: Yellowstone National Park was established as the first national park in the United States on March 1, 1872.
Running evaluation for item: e7281986-414f-41f0-82a9-e81a1e9ff0f0 (Input: {'text': 'oracle database 19c automatic ai-driven schema design feature'})
Generated Answer: Oracle Database 19c does not include an automatic AI-driven schema design feature. While Oracle 19c offers many advanced capabilities such as automatic indexing, performance tuning, and automation features, automated AI-driven schema design is not one of its standard offerings as of this version.
Running evaluation for item: ab466436-5d9b-4469-9a58-695b83e0c064 (Input: {'text': 'How many people were injured in the Dollywood park flooding?'})
Generated Answer: There have been no widely reported injuries caused by flooding at Dollywood park.
Running evaluation for item: f9dc148c-ecf5-420a-9cec-06c95c54d5cd (Input: {'text': 'Who are the current co-leaders of Te PÄ\x81ti MÄ\x81ori?'})
Generated Answer: The current co-leaders of Te Pāti Māori are Debbie Ngarewa-Packer and Rawiri Waititi.
Running evaluation for item: 0f960ef9-ae61-4dc5-9aa6-c247813c25cf (Input: {'text': 'How many World Championship medals has Simone Biles won?'})
Generated Answer: As of 2023, Simone Biles has won 30 World Championship medals.
Running evaluation for item: bc5759ff-de6c-4aa6-8e26-9644875652f7 (Input: {'text': 'How is Tetris used as a relaxation tool?'})
Generated Answer: Tetris is used as a relaxation tool because its repetitive, simple gameplay can help distract from stress and anxiety, promoting a state of flow. Playing Tetris can also induce a calming effect by providing a sense of control and accomplishment through arranging the pieces and clearing lines.
Finished processing dataset 'qa-dataset_autogen-agent' for run 'dev_tasks_run-autogen_gpt-4.1'.
You can repeat this process with different agent configurations such as:
- Models (gpt-4o-mini, gpt-4.1, etc.)
- Prompts
- Tools (search vs. no search)
- Complexity of agent (multi agent vs single agent)
Then compare them side-by-side in Langfuse. In this example, I did run the agent 3 times on the 25 dataset questions. For each run, I used a different Azure OpenAI model. You can see that amount of correctly answered questions improves when using a larger model (as expected). The correct_answer score is created by an LLM-as-a-Judge Evaluator that is set up to judge the correctness of the question based on the sample answer given in the dataset.
