1 The Basics
🎨 NeMo Data Designer 101: The Basics
In this notebook, we will demonstrate the basics of Data Designer by generating a simple product review dataset.
💾 Install dependencies
IMPORTANT 👉 If you haven't already, follow the instructions in the README to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.
⚙️ Initialize the NeMo Data Designer Client
-
The data designer client is responsible for submitting generation requests to the Data Designer microservice.
-
In this notebook, we connect to the managed service of data designer. Alternatively, you can connect to your own instance of data designer by following the deployment instructions here.
-
If you have an instance of data designer running locally, you can connect to it as follows
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))
🏗️ Initialize the Data Designer Config Builder
-
The Data Designer config defines the dataset schema and generation process.
-
The config builder provides an intuitive interface for building this configuration.
-
You must provide a list of model configs to the builder at initialization.
-
This list contains the models you can choose from (via the
model_aliasargument) during the generation process.
Note: The NeMo Data Designer Managed service has access to specific models. Please visit https://build.nvidia.com/nemo/data-designer to see the latest list of which models are available.
🎲 Getting started with sampler columns
-
Sampler columns offer non-LLM based generation of synthetic data.
-
They are particularly useful for steering the diversity of the generated data, as we demonstrate below.
Let's start designing our product review dataset by adding product category and subcategory columns.
Next, let's add samplers to generate data related to the customer and their review.
🦜 LLM-generated columns
-
The real power of Data Designer comes from leveraging LLMs to generate text, code, and structured data.
-
For our product review dataset, we will use LLM-generated text columns to generate product names and customer reviews.
-
When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.
-
As we see below, nested json columns can be accessed using dot notation.
👀 Preview the dataset
-
Iteration is key to generating high-quality synthetic data.
-
Use the
previewmethod to generate 10 records for inspection.
⏭️ Next Steps
Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about: