Notebooks
N
NVIDIA
Text To Python Evol

Text To Python Evol

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-examplestext-to-codeself-hosted-tutorialslarge-language-modelsmicroservicetriton-inference-servercommunity-contributionsLLMragnemoNeMo-Data-Designer

๐Ÿ‘จโ€๐Ÿ’ป NeMo Data Designer: Text-to-Python with Evolution

๐Ÿ“š What you'll learn

  • This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for Python code examples,
    with a focus on evolutionary improvements.

  • We'll build a system that generates Python code based on natural language instructions, validates it, analyzes issues, and then improves the code based on feedback.


๐Ÿ‘‹ IMPORTANT โ€“ย Environment Setup

  • If you haven't already, follow the instructions in the README to install the necessary dependencies.

  • You may need to restart your notebook's kernel after setting up the environment.

  • In this notebook, we assume you have a self-hosted instance of Data Designer up and running.

  • For deployment instructions, see the Installation Options section of the NeMo Data Designer documentation.

๐Ÿ“ฆ Import the essentials

  • The data_designer module of nemo_microservices exposes Data Designer's high-level SDK.

  • The essentials module provides quick access to the most commonly used objects.

[ ]

โš™๏ธ Initialize the NeMo Data Designer Client

  • NeMoDataDesignerClient is responsible for submitting generation requests to the microservice.
[ ]

๐ŸŽ›๏ธ Define model configurations

  • Each ModelConfig defines a model that can be used during the generation process.

  • The "model alias" is used to reference the model in the Data Designer config (as we will see below).

  • The "model provider" is the external service that hosts the model (see the model config docs for more details).

  • By default, the microservice uses build.nvidia.com as the model provider.

[ ]

๐Ÿ—๏ธ Initialize the Data Designer Config Builder

  • The Data Designer config defines the dataset schema and generation process.

  • The config builder provides an intuitive interface for building this configuration.

  • The list of model configs is provided to the builder at initialization.

[ ]

๐ŸŽฒ Adding Sampler Columns

  • Sampler columns offer non-LLM based generation of synthetic data.

  • They are particularly useful for steering the diversity of the generated data, as we demonstrate below.

[ ]

๐Ÿฆœ Define Initial Code Generation

First, we'll set up the columns for generating the instruction and initial code implementation using the same approach as in the text-to-python notebook.

[ ]

โšก๏ธ Quality Assessment: Code Validation

  • Now we'll add validation for the initial code and generate analysis of any issues found.

  • NeMo Data Designer includes a built-in code validation feature that automatically checks the syntactic correctness and executable validity of
    generated code snippets.

  • This helps ensure that outputs from language models are not only syntactically correct, but also able to run successfully in the
    intended programming language environment.

  • Leveraging this validation step significantly increases dataset quality by promptly identifying invalid or non-functional code,
    streamlining the process of generating reliable and production-ready data samples.

  • NeMo Data Designer supports validation for these languages

    • Python (CodeLang.PYTHON)

    • SQL dialects:

      • ANSI SQL (CodeLang.SQL_ANSI)

      • MySQL (CodeLang.SQL_MYSQL)

      • PostgreSQL (CodeLang.SQL_POSTGRES)

      • SQLite (CodeLang.SQL_SQLITE)

      • T-SQL (CodeLang.SQL_TSQL)

      • BigQuery (CodeLang.SQL_BIGQUERY)

[ ]

โšก๏ธ Code Evolution

Next, we'll create the improved version of the code based on the analysis and validation.

[ ]

๐Ÿ” Quality Assessment: LLM-as-a-Judge

When generating our synthetic dataset, we need to determine the quality of the generated data
We use the LLM-as-a-Judge strategy to do this.

To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt that provides relavant instructions.

[ ]

โšก๏ธ Quality Assessment: Code Validation

[ ]

๐Ÿ” Iteration is key โ€“ย preview the dataset!

  1. Use the preview method to generate a sample of records quickly.

  2. Inspect the results for quality and format issues.

  3. Adjust column configurations, prompts, or parameters as needed.

  4. Re-run the preview until satisfied.

[ ]
[ ]

๐Ÿ“Š Analyze the generated data

  • Data Designer automatically generates a basic statistical analysis of the generated data.

  • This analysis is available via the analysis property of generation result objects.

[ ]

๐Ÿ†™ Scale up!

  • Happy with your preview data?

  • Use the create method to submit larger Data Designer generation jobs.

[ ]
[ ]
[ ]
[ ]