Safe Synthesizer 101
🎛️ NeMo Safe Synthesizer 101: The Basics
⚠️ Warning: NeMo Safe Synthesizer is in Early Access and not recommended for production use.
In this notebook, we demonstrate how to create a synthetic version of a tabular dataset using the NeMo Microservices Python SDK. The notebook should take about 20 minutes to run.
After completing this notebook, you'll be able to:
- Use the NeMo Microservices SDK to interact with Safe Synthesizer
- Create novel synthetic data that follows the statistical properties of your input dataset
- Access an evaluation report on synthetic data quality and privacy
💾 Install dependencies
IMPORTANT 👉 Ensure you have a NeMo Microservices Platform deployment available. Follow the quickstart or Helm chart instructions in your environment's setup guide. You may need to restart your kernel after installing dependencies.
⚙️ Initialize the NeMo Safe Synthesizer Client
- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.
http://localhost:8080is the default url for the client'sbase_urlin the quickstart.- If using a managed or remote deployment, ensure correct base URLs and tokens.
NeMo DataStore is launched as one of the services, and we'll use it to manage our storage. so we'll set the following:
📥 Load input data
Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.
The sample dataset used here is a set of women's clothing reviews, including age, product category, rating, and review text. Some of the reviews contain Personally Identifiable Information (PII), such as height, weight, age, and location.
🏗️ Create a Safe Synthesizer job
The SafeSynthesizerBuilder provides a fluent interface to configure and submit jobs.
The following code creates and submits a job:
SafeSynthesizerBuilder(client): initialize with the NeMo Microservices client..with_data_source(df): set the input data source..with_datastore(datastore_config): configure model artifact storage..with_replace_pii(): enable automatic replacement of PII..synthesize(): train and generate synthetic data..create_job(): submit the job to the platform.
👀 View synthetic data
After the job completes, fetch the generated synthetic dataset.
📊 View evaluation report
An evaluation comparing the synthetic data to the input data is performed automatically. You can:
- Inspect key scores: overall synthetic data quality and privacy.
- Download the full HTML report: includes charts and detailed metrics.
- Display the report inline: useful when viewing in notebook environments.