Example Synthetic Datasets
Synthetic Dataset Generation for LLM Evaluation
In this notebook, we will explore how to generate synthetic datasets using language models and uploading them to Langfuse for evaluation.
What are Langfuse Datasets?
In Langfuse, a dataset is a collection of dataset items, each typically containing an input (e.g., user prompt/question), expected_output (the ground truth or ideal answer) and optional metadata.
Datasets are used for evaluation. You can run your LLM or application on each item in a dataset and compare the application's responses to the expected outputs. This way, you can track performance over time and across different application configs (e.g. model versions or prompt changes).
Cases your Dataset Should Cover
Happy path – straightforward or common queries:
- "What is the capital of France?"
- "Convert 5 USD to EUR."
Edge cases – unusual or complex:
- Very long prompts.
- Ambiguous queries.
- Very technical or niche.
Adversarial cases – malicious or tricky:
- Prompt injection attempts ("Ignore all instructions and ...").
- Content policy violations (harassment, hate speech).
- Logic traps (trick questions).
Examples
Example 1: Looping Over OpenAI API
We'll use OpenAI's API in a simple loop to create synthetic questions for an airline chatbot. You could similarly prompt the model to generate both questions and answers.
With the environment variables set, we can now initialize the Langfuse client. get_client() initializes the Langfuse client using the credentials provided in the environment variables.
Langfuse client is authenticated and ready!

Example 2: RAGAS Library
For RAG, we often want questions that are grounded in specific documents. This ensures the question can be answered by the context, allowing us to evaluate how well a RAG pipeline retrieves and uses the context.
RAGAS is a library that can automate test set generation for RAG. It can take a corpus and produce relevant queries and answers. We'll do a quick example:
Note: This example is taken from the RAGAS documentation

Example 3: DeepEval Library
DeepEval is a library that helps generate synthetic data systematically using the Synthesizer class.

Example 4: No-Code via Hugging Face Dataset Generator
If you prefer a more UI-based approach, check out Hugging Face's Synthetic Data Generator. You can generate examples in the Hugging Face UI. Then you can download them as CSV and upload it in the Langfuse UI.


Example 5: RAG Dataset Generation
If you have an existing vector database or prefer not to use specialized libraries like RAGAS or DeepEval, you can generate a RAG testset by directly looping through your vector store. This approach gives you full control over the generation process.
This is useful when you:
- Want lightweight code without additional dependencies
- Need to customize the question generation logic
You can now evaluate your application using this dataset. Check our RAG Observability and Evals blogpost to learn more.

Example 6: Torque - Declarative Dataset Generation
Torque is a declarative, typesafe DSL for building synthetic datasets. It lets you compose conversations like React components, making it particularly useful for generating complex multi-turn conversations with tool calls.
This approach is ideal when you need:
- Structured conversations with tool usage patterns
- Type-safe dataset generation with full TypeScript support
- Reproducible datasets with seeded generation
- Complex multi-turn dialogs that follow specific patterns
Key advantages of Torque:
- Type-safe conversations: Full TypeScript support with Zod schemas ensures your synthetic data matches your production types
- Declarative patterns: Compose complex conversation flows with
times(),oneOf(), and other combinators - Tool simulation: Built-in support for tool calls and results, perfect for evaluating agentic applications
- Reproducible: Seeded generation ensures identical datasets across runs
- Realistic variations: AI generates natural variations while following your structural constraints
This approach is particularly powerful for evaluating AI agents with tool usage, as it generates structurally consistent but semantically diverse conversations.
Next Steps
- Explore your dataset in Langfuse. You can see each dataset in the UI.
- Run experiments You can now evaluate your application using this dataset.
- Compare runs over time or across models, prompts, or chain logic.
For more details on how to run experiments on a dataset, see the Langfuse docs.