Notebooks
N
NVIDIA
Product Question Answer Generator

Product Question Answer Generator

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtqa-generationnvidia-generative-ai-examplesself-hosted-tutorialslarge-language-modelsmicroservicetriton-inference-servercommunity-contributionsLLMragnemoNeMo-Data-Designer

๐ŸŽจ NeMo Data Designer: Product Information Dataset Generator with Q&A

๐Ÿ“š What you'll learn

This notebook demonstrates how to use NeMo Data Designer to create a synthetic dataset of product information with corresponding questions and answers.


๐Ÿ‘‹ IMPORTANT โ€“ย Environment Setup

  • If you haven't already, follow the instructions in the README to install the necessary dependencies.

  • You may need to restart your notebook's kernel after setting up the environment.

  • In this notebook, we assume you have a self-hosted instance of Data Designer up and running.

  • For deployment instructions, see the Installation Options section of the NeMo Data Designer documentation.

๐Ÿ“ฆ Import the essentials

  • The data_designer module of nemo_microservices exposes Data Designer's high-level SDK.

  • The essentials module provides quick access to the most commonly used objects.

[ ]

โš™๏ธ Initialize the NeMo Data Designer Client

  • NeMoDataDesignerClient is responsible for submitting generation requests to the microservice.
[ ]

๐ŸŽ›๏ธ Define model configurations

  • Each ModelConfig defines a model that can be used during the generation process.

  • The "model alias" is used to reference the model in the Data Designer config (as we will see below).

  • The "model provider" is the external service that hosts the model (see the model config docs for more details).

  • By default, the microservice uses build.nvidia.com as the model provider.

[ ]

๐Ÿ—๏ธ Initialize the Data Designer Config Builder

  • The Data Designer config defines the dataset schema and generation process.

  • The config builder provides an intuitive interface for building this configuration.

  • The list of model configs is provided to the builder at initialization.

[ ]

๐Ÿ—๏ธ Defining Data Structures

Now we'll define the data models and evaluation rubrics for our product information dataset.

[ ]

๐ŸŽฒ Adding Sampler Columns

  • Sampler columns offer non-LLM based generation of synthetic data.

  • They are particularly useful for steering the diversity of the generated data, as we demonstrate below.

[ ]

๐Ÿฆœ LLM-generated columns

  • When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.

  • As we see below, nested json fields can be accessed using dot notation.

[ ]

๐Ÿ” Quality Assessment: LLM-as-a-Judge

When generating our synthetic dataset, we need to determine the quality of the generated data
We use the LLM-as-a-Judge strategy to do this.

To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt that provides relavant instructions.

[ ]

๐Ÿ” Iteration is key โ€“ย preview the dataset!

  1. Use the preview method to generate a sample of records quickly.

  2. Inspect the results for quality and format issues.

  3. Adjust column configurations, prompts, or parameters as needed.

  4. Re-run the preview until satisfied.

[ ]
[ ]

๐Ÿ“Š Analyze the generated data

  • Data Designer automatically generates a basic statistical analysis of the generated data.

  • This analysis is available via the analysis property of generation result objects.

[ ]

๐Ÿ†™ Scale up!

  • Happy with your preview data?

  • Use the create method to submit larger Data Designer generation jobs.

[ ]
[ ]
[ ]
[ ]