Notebooks
L
Langfuse
Evaluation With Uptrain

Evaluation With Uptrain

observabilityllmsgenaicookbookprompt-managementhacktoberfestlarge-language-modelsnextraLangfuselangfuse-docs

description: This notebook demonstrates how to run UpTrain's evaluation metrics on the traces generated by Langfuse. category: Evaluation

Evaluate Langfuse LLM Traces with UpTrain

UpTrain's open-source library offers a series of evaluation metrics to assess LLM applications.

This notebook demonstrates how to run UpTrain's evaluation metrics on the traces generated by Langfuse. In Langfuse you can then monitor these scores over time or use them to compare different experiments.

Setup

You can get your Langfuse API keys here and OpenAI API key here

[ ]
[1]

Sample Dataset

We use this dataset to represent traces that you have logged to Langfuse. In a production environment, you would use your own data.

[6]

Run Evaluations using UpTrain

We have used the following 3 metrics from UpTrain's open-source library:

  1. Context Relevance: Evaluates how relevant the retrieved context is to the question specified.

  2. Factual Accuracy: Evaluates whether the response generated is factually correct and grounded by the provided context.

  3. Response Completeness: Evaluates whether the response has answered all the aspects of the question specified.

You can look at the complete list of UpTrain's supported metrics here

[7]
2025-06-17 10:43:14.568 | WARNING  | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00,  2.85it/s]
/Users/jannik/Documents/GitHub/langfuse-docs/.venv/lib/python3.13/site-packages/uptrain/operators/language/llm.py:271: RuntimeWarning: coroutine 'LLMMulticlient.async_fetch_responses' was never awaited
  with ThreadPoolExecutor(max_workers=1) as executor:
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
2025-06-17 10:43:15.996 | WARNING  | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00,  2.74it/s]
2025-06-17 10:43:17.464 | WARNING  | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:03<00:00,  1.19it/s]
2025-06-17 10:43:20.860 | WARNING  | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00,  3.13it/s]
2025-06-17 10:43:22.148 | INFO     | uptrain.framework.evalllm:evaluate:376 - Local server not running, start the server to log data and visualize in the dashboard!

Using Langfuse

There are two main ways to run evaluations:

  1. Score each Trace (in development): This means you will run the UpTrain evaluations for each trace item.

  2. Score in Batches (in production): In this method we will simulate fetching production traces on a periodic basis to score them using the UpTrain evaluators. Often, you'll want to sample the traces instead of scoring all of them to control evaluation costs.

Development: Score each trace while it's created

[8]
Langfuse client is authenticated and ready!

We mock the instrumentation of your application by using the sample dataset. See the quickstart to integrate Langfuse with your application.

[ ]

We reuse the scores previously calculated for the traces in the sample dataset. In development, you would run the UpTrain evaluations for the single trace as it's created.

[11]

UpTrain Evals on a single trace in Langfuse

Production: Add scores to traces in batches

To simulate a production environment, we will log our sample dataset to Langfuse.

[ ]

We can now retrieve the traces like regular production data and evaluate them using UpTrain.

[15]

Optional: create a random sample to reduce evaluation costs.

[16]

Convert the data into a dataset to be used for evaluation with UpTrain.

[18]

Evaluate the batch using UpTrain.

[19]
2025-06-17 10:46:35.647 | WARNING  | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00,  3.06it/s]
/Users/jannik/Documents/GitHub/langfuse-docs/.venv/lib/python3.13/site-packages/uptrain/operators/language/llm.py:271: RuntimeWarning: coroutine 'LLMMulticlient.async_fetch_responses' was never awaited
  with ThreadPoolExecutor(max_workers=1) as executor:
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
2025-06-17 10:46:36.963 | WARNING  | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:02<00:00,  1.88it/s]
2025-06-17 10:46:39.097 | WARNING  | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:03<00:00,  1.12it/s]
2025-06-17 10:46:42.703 | WARNING  | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00,  3.87it/s]
2025-06-17 10:46:43.749 | INFO     | uptrain.framework.evalllm:evaluate:376 - Local server not running, start the server to log data and visualize in the dashboard!

Add the trace_id back to the dataset as it was omitted in the previous step to be compatible with UpTrain.

[20]

Now that we have the evaluations, we can add them back to the traces in Langfuse as scores.

[21]

In Langfuse, you can now see the scores for each trace and monitor them over time.

UpTrain Evals on a list of traces in Langfuse