2 Structured Outputs And Jinja Expressions
🎨 NeMo Data Designer 101: Structured Outputs and Jinja Expressions
Note: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the deployment guide for more details.
In this notebook, we will continue our exploration of Data Designer, demonstrating more advanced data generation using structured outputs and Jinja expressions.
If this is your first time using Data Designer, we recommend starting with the first notebook in this 101 series.
💾 Install dependencies
IMPORTANT 👉 If you haven't already, follow the instructions in the README to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.
⚙️ Initialize the NeMo Data Designer Client
-
The data designer client is responsible for submitting generation requests to the Data Designer microservice.
-
In this notebook, we connect to the managed service of data designer. Alternatively, you can connect to your own instance of data designer by following the deployment instructions here.
-
If you have an instance of data designer running locally, you can connect to it as follows
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))
🏗️ Initialize the Data Designer Config Builder
-
The Data Designer config defines the dataset schema and generation process.
-
The config builder provides an intuitive interface for building this configuration.
-
You must provide a list of model configs to the builder at initialization.
-
This list contains the models you can choose from (via the
model_aliasargument) during the generation process.
Note: The NeMo Data Designer Managed service has access to specific models. Please visit https://build.nvidia.com/nemo/data-designer to see the latest list of which models are available.
🧑🎨 Designing our data
-
We will again create a product review dataset, but this time we will use structured outputs and Jinja expressions.
-
Structured outputs let you specify the exact schema of the data you want to generate.
-
Data Designer supports schemas specified using either json schema or Pydantic data models (recommended).
We'll define our structured outputs using Pydantic data models:
Next, let's design our product review dataset using a few more tricks compared to the previous notebook:
👀 Preview the dataset
-
Iteration is key to generating high-quality synthetic data.
-
Use the
previewmethod to generate 10 records for inspection. -
Setting
verbose_logging=Trueprints logs within each task of the generation process.
⏭️ Next Steps
Check out the following notebooks to learn more about: