Evaluation With Langchain
description: Cookbook that demonstrates how to run Langchain evaluations on data in Langfuse. category: Evaluation
Run Langchain Evaluations on data in Langfuse
This cookbook shows how model-based evaluations can be used to automate the evaluation of production completions in Langfuse. This example uses Langchain and is adaptable to other libraries. Which library is the best to use depends heavily on the use case.
This cookbook follows three steps:
- Fetch production
generationsstored in Langfuse - Evaluate these
generationsusing Langchain - Ingest results back into Langfuse as
scores
Not using Langfuse yet? Get started by capturing LLM events.
Setup
First you need to install Langfuse and Langchain via pip and then set the environment variables.
Initialize the Langfuse Python SDK, more information here.
Langfuse client is authenticated and ready!
Fetching data
Load all generations from Langfuse filtered by name, in this case OpenAI. Names are used in Langfuse to identify different types of generations within an application. Change it to the name you want to evaluate.
Checkout docs on how to set the name when ingesting an LLM Generation.
'adb5ba6beab14984ab89006ee09e9cd6'
Set up evaluation functions
In this section, we define functions to set up the Langchain eval based on the entries in EVAL_TYPES. Hallucinations require their own function. More on the Langchain evals can be found here.
Execute evaluation
Below, we execute the evaluation for each Generation loaded above. Each score is ingested into Langfuse via langfuse.score().
See Scores in Langfuse
In the Langfuse UI, you can filter Traces by Scores and look into the details for each. Check out Langfuse Analytics to understand the impact of new prompt versions or application releases on these scores.
Example trace with conciseness score
Get in touch
Looking for a specific way to score your production data in Langfuse? Join the Discord and discuss your use case!