NVIDIA 1 Data Preparation

1 Data Preparation

gpu-accelerationtool-callingretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-serverLLMdata-flywheelragnemo

alph-notebooks/nvidia-generative-ai-examples / 1_data_preparation.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Part I: Preparing Datasets for Fine-tuning and Evaluation

This notebook covers the following:

Download xLAM Dataset
Prepare Data for Customization
Prepare Data for Evaluation

This notebook showcases transforming a dataset for finetuning and evaluating an LLM for tool calling with NeMo Microservices.

The following code cell imports necessary libraries.

[1]

The following code cell sets a random seed for reproducibility.

[2]

The following code cell defines the data root directory and creates necessary directories for storing processed data.

[3]

Step 1: Download xLAM Data

This step loads the xLAM dataset from Hugging Face.

Ensure that you have followed the prerequisites mentioned in the associated README, obtained a Hugging Face access token, and configured it in config.py. In addition to getting an access token, you need to apply for access to the xLAM dataset on its page, which will be approved instantly.

[4]

[5]

{'answers': '[{"name": "live_giveaways_by_type", "arguments": {"type": '
            '"beta"}}, {"name": "live_giveaways_by_type", "arguments": '
            '{"type": "game"}}]',
 'id': 0,
 'query': 'Where can I find live giveaways for beta access and games?',
 'tools': '[{"name": "live_giveaways_by_type", "description": "Retrieve live '
          'giveaways from the GamerPower API based on the specified type.", '
          '"parameters": {"type": {"description": "The type of giveaways to '
          'retrieve (e.g., game, loot, beta).", "type": "str", "default": '
          '"game"}}}]'}

For more details on the structure of this data, refer to the data structure of the xLAM dataset in the Hugging Face documentation.

Step 2: Prepare Data for Customization

For Customization, the NeMo Microservices platform leverages the OpenAI data format, comprised of messages and tools:

messages include the user query, as well as the ground truth assistant response to the query. This response contains the function name(s) and associated argument(s) in a "tool_calls" dict.
tools include a list of functions and parameters available to the LLM to choose from, as well as their descriptions.

The following is an example of the data format:

{
    "messages": [
        {
            "role": "user",
            "content": "Where can I find live giveaways for beta access and games?"
        },
        {
            "role": "assistant",
            "tool_calls": [
                {
                    "id": "call_beta",
                    "type": "function",
                    "function": {
                        "name": "live_giveaways_by_type",
                        "arguments": {"type": "beta"}
                    }
                },
                {
                    "id": "call_game",
                    "type": "function",
                    "function": {
                        "name": "live_giveaways_by_type",
                        "arguments": {"type": "game"}
                    }
                }
            ]
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "live_giveaways_by_type",
                "description": "Retrieve live giveaways from the GamerPower API based on the specified type.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "type": {
                            "type": "string",
                            "description": "The type of giveaways to retrieve (e.g., game, loot, beta).",
                            "default": "game"
                        }
                    },
                    "required": []
                }
            }
        }
    ]
}

The following helper functions convert a single xLAM JSON data point into OpenAI format.

[6]

The following code cell converts the example data to the OpenAI format required by NeMo Customizer.

[7]

NOTE: The convert_example function by default only retains data points that have exactly one tool_call in the output. The llama-3.2-1b-instruct model does not support parallel tool calls. For more information, refer to the supported models in the NeMo documentation.

Process Entire Dataset

Convert each example by looping through the dataset.

[8]

Split Dataset

This step splits the dataset into a train, validation, and test set. For demonstration, we use a smaller subset of all the examples. You may choose to modify NUM_EXAMPLES to leverage a larger subset.

[9]

[10]

Step 3: Prepare Data for Evaluation

For evaluation, the NeMo Microservices platform uses a format with a minor modification to the OpenAI format. This requires tools_calls to be brought out of messages to create a distinct parallel field.

messages includes the user query
tools includes a list of functions and parameters available to the LLM to choose from, as well as their descriptions.
tool_calls is the ground truth response to the user query. This response contains the function name(s) and associated argument(s) in a "tool_calls" dict.

Here is an example -

{
    "messages": [
        {
            "role": "user",
            "content": "Where can I find live giveaways for beta access?"
        },
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "live_giveaways_by_type",
                "description": "Retrieve live giveaways from the GamerPower API based on the specified type.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "type": {
                            "type": "string",
                            "description": "The type of giveaways to retrieve (e.g., game, loot, beta).",
                            "default": "game"
                        }
                    },
                    "required": []
                }
            }
        }
    ],
    "tool_calls": [
        {
            "id": "call_beta",
            "type": "function",
            "function": {
                "name": "live_giveaways_by_type",
                "arguments": {"type": "beta"}
            }
        }
    ]
}

The following steps transform the test dataset into a format compatible with the NeMo Evaluator microservice. This dataset is for measuring accuracy metrics before and after customization.

[11]

NOTE: We have implemented a workaround for a known bug where tool calls freeze the NIM if a tool description includes a function with a larger number of parameters. As such, we have limited the dataset to use examples with available tools having at most 8 parameters. This will be resolved in the next NIM release.

[12]