Evaluation With Uptrain
description: This notebook demonstrates how to run UpTrain's evaluation metrics on the traces generated by Langfuse. category: Evaluation
Evaluate Langfuse LLM Traces with UpTrain
UpTrain's open-source library offers a series of evaluation metrics to assess LLM applications.
This notebook demonstrates how to run UpTrain's evaluation metrics on the traces generated by Langfuse. In Langfuse you can then monitor these scores over time or use them to compare different experiments.
Sample Dataset
We use this dataset to represent traces that you have logged to Langfuse. In a production environment, you would use your own data.
Run Evaluations using UpTrain
We have used the following 3 metrics from UpTrain's open-source library:
-
Context Relevance: Evaluates how relevant the retrieved context is to the question specified.
-
Factual Accuracy: Evaluates whether the response generated is factually correct and grounded by the provided context.
-
Response Completeness: Evaluates whether the response has answered all the aspects of the question specified.
You can look at the complete list of UpTrain's supported metrics here
2025-06-17 10:43:14.568 | WARNING | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread. 100%|ββββββββββ| 4/4 [00:01<00:00, 2.85it/s] /Users/jannik/Documents/GitHub/langfuse-docs/.venv/lib/python3.13/site-packages/uptrain/operators/language/llm.py:271: RuntimeWarning: coroutine 'LLMMulticlient.async_fetch_responses' was never awaited with ThreadPoolExecutor(max_workers=1) as executor: RuntimeWarning: Enable tracemalloc to get the object allocation traceback 2025-06-17 10:43:15.996 | WARNING | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread. 100%|ββββββββββ| 4/4 [00:01<00:00, 2.74it/s] 2025-06-17 10:43:17.464 | WARNING | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread. 100%|ββββββββββ| 4/4 [00:03<00:00, 1.19it/s] 2025-06-17 10:43:20.860 | WARNING | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread. 100%|ββββββββββ| 4/4 [00:01<00:00, 3.13it/s] 2025-06-17 10:43:22.148 | INFO | uptrain.framework.evalllm:evaluate:376 - Local server not running, start the server to log data and visualize in the dashboard!
Using Langfuse
There are two main ways to run evaluations:
-
Score each Trace (in development): This means you will run the UpTrain evaluations for each trace item.
-
Score in Batches (in production): In this method we will simulate fetching production traces on a periodic basis to score them using the UpTrain evaluators. Often, you'll want to sample the traces instead of scoring all of them to control evaluation costs.
Development: Score each trace while it's created
Langfuse client is authenticated and ready!
We mock the instrumentation of your application by using the sample dataset. See the quickstart to integrate Langfuse with your application.
We reuse the scores previously calculated for the traces in the sample dataset. In development, you would run the UpTrain evaluations for the single trace as it's created.

Production: Add scores to traces in batches
To simulate a production environment, we will log our sample dataset to Langfuse.
We can now retrieve the traces like regular production data and evaluate them using UpTrain.
Optional: create a random sample to reduce evaluation costs.
Convert the data into a dataset to be used for evaluation with UpTrain.
Evaluate the batch using UpTrain.
2025-06-17 10:46:35.647 | WARNING | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread. 100%|ββββββββββ| 4/4 [00:01<00:00, 3.06it/s] /Users/jannik/Documents/GitHub/langfuse-docs/.venv/lib/python3.13/site-packages/uptrain/operators/language/llm.py:271: RuntimeWarning: coroutine 'LLMMulticlient.async_fetch_responses' was never awaited with ThreadPoolExecutor(max_workers=1) as executor: RuntimeWarning: Enable tracemalloc to get the object allocation traceback 2025-06-17 10:46:36.963 | WARNING | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread. 100%|ββββββββββ| 4/4 [00:02<00:00, 1.88it/s] 2025-06-17 10:46:39.097 | WARNING | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread. 100%|ββββββββββ| 4/4 [00:03<00:00, 1.12it/s] 2025-06-17 10:46:42.703 | WARNING | uptrain.operators.language.llm:fetch_responses:268 - Detected a running event loop, scheduling requests in a separate thread. 100%|ββββββββββ| 4/4 [00:01<00:00, 3.87it/s] 2025-06-17 10:46:43.749 | INFO | uptrain.framework.evalllm:evaluate:376 - Local server not running, start the server to log data and visualize in the dashboard!
Add the trace_id back to the dataset as it was omitted in the previous step to be compatible with UpTrain.
Now that we have the evaluations, we can add them back to the traces in Langfuse as scores.
In Langfuse, you can now see the scores for each trace and monitor them over time.
