Notebooks
N
NVIDIA
Grpo Training Cookbook

Grpo Training Cookbook

usage-cookbookgrpo-dapoNemotron-3-Supernvidia-nemotron

Nemotron Super V3 GRPO/DAPO Training with NemoRL

Overview

This guide demonstrates the step-by-step RL training of the Nemotron Super V3 model on an NVIDIA B200 GPU cluster or a GB200-NVL72 rack system running Slurm.

We will carry out GRPO/DAPO training of the model on the DAPO-Math-17k dataset. This is a single-domain reinforcement learning example with verifiable rewards.

Notes:

  • Due to the distributed nature of the setup and training steps, all commands in this notebook are intended to be copied, pasted, and executed in the relevant interactive Docker shell environment on either the head node or worker nodes, rather than from within a single JupyterLab environment.

  • Interactive vs. batch training: in a production setting, it is more convenient to submit training jobs as Slurm batch jobs. However, setting up an interactive training environment allows you to iterate and debug faster. Once the interactive jobs run smoothly, you can submit them as batch training jobs.

  • We start this tutorial with a batch submission guide, followed by a step-by-step guide to set up an interactive training cluster.

Prerequisites

  • Compute: 3xB200 nodes (each with 8xGPUs, i.e. 24xB200 GPUs in total) with infiniband connection, or 5xGB200 nodes (each with 4xGPUs, i.e. 20x GB200 GPUs in total) on a single GB200 NVL72 rack. Note that B200/GB200 GPUs have ~183/189 GB of HBM, which is not sufficient for full-weight co-located RL training (i.e., both training the policy model and running rollout on the same set of GPUs), so we use non-colocated training, with 1 node dedicated to rollout and 4 nodes for policy training.

    On a Slurm system, you can check the availability of GB200 nodes with something similar to:

    sinfo|grep gb200nvl72
    

    Replace "gb200nvl72" with your correct GB200 Slurm partition name. Then request a specific interactive node with:

    srun -p gb200nvl72 -w gb200-001-compute09 -t 08:00:00 --pty bash
    
  • Storage: A high-speed shared network file system for storing code, models, checkpoints, and other temporary assets. In this guide, we will assume that the shared storage is at </YOUR/SHARED/NETWORK/STORAGE> on the host system, to be mounted as /shared into the working Docker container, accessible from all nodes. We will also assume the following directory structure

</YOUR/SHARED/NETWORK/STORAGE> (on host):/shared (inside container)
|______code
|        |____RL  # NemoRL root directory
|        |____Nemotron/usage-cookbook/Nemotron-3-Super  # Repository containing this notebook
|_______models
|        |____NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16 # base model checkpoint
|_______checkpoints
|_______HF_HOME   # HuggingFace cache directory

Note that each model checkpoint (including model's weights and optimizer's state) takes up to ~1Tb of storage. You should also account for the number of checkpoints you would like to keep (e.g., best k=3 checkpoints in the NemoRL training config). In addition, the base BF16 model checkpoint requires ~231Gb of storage, and another ~231Gb for the Megatron-converted checkpoint.

[ ]
  • Docker image: Build the NemoRL Docker container

Reserve a B200/GB200 compute node. From here, clone and check out the NemoRL Super-v3 branch.

[ ]

Next, from the NemoRL repo root directory, build the Docker container and push it to a central registry accessible to all nodes, such as NVIDIA NGC or Docker Hub.

[ ]

Note: In this example, we make use of the NVIDIA Cloud Registry NGC. See the NGC user's guide on how to set up your account and team, obtain the NGC API key for Docker login, and upload an image to NGC. Replace nvcr.io/<YOUR_NGC_ORG>/nemo-rl-gb200:superv3-a90de923d with your correct NGC organization, or else use an accessible Docker hub tag.

Also, see the latest NemoRL Docker build guide for more information.

Step 1. Prepare the training config file

This step defines the training recipe for a single-domain RL workload: verifiable mathematical reasoning on DAPO-style data. The policy learns from sampled solutions, and rewards are computed by math verifiers, so this setup is ideal for tasks where correctness can be programmatically checked (e.g., arithmetic/algebra word problems and competition-style math).

GB200 Config

In this part, the YAML below is tuned for a 5xGB200 nodes (4xGPUs per node, 20 in total), non-colocated setup:

  • 1 node for vLLM generation (rollout)
  • 4 nodes for Megatron policy training

A practical workflow is to first run a short sanity pass (small max_num_steps) to validate cluster/config correctness, then scale up sequence length and rollout volume once stable.

Notes:

  • The configuration file must be written or copied under the NemoRL root directory, e.g. at </YOUR/SHARED/NETWORK/STORAGE>/code/RL/examples/configs/recipes/llm/ on your host file system.
  • ALL PATHS INSIDE THE CONFIG FILES ARE PATH AS MOUNTED INSIDE THE CONTAINER
[ ]

B200 Config

In this part, the YAML below is tuned for a 3xB200 nodes (8xGPUs per node, 24 in total), non-colocated setup:

  • 1 node for vLLM generation (rollout)
  • 2 nodes for Megatron policy training
[ ]

Step 2. Batch job submission

In a production B200 cluster, each node with 8 GPUs, you can launch production batch jobs with the following procedure from a Slurm login/head node.

[ ]

On an arm-based GB200 system with multi-node NVLink (MNNVL), use the following procedure. Work with your system administrator on how to get your own IMEX channel. See further in the interactive guide below.

[ ]

Tweaking training hyperparameters

Once training verification is successful, you can tweak the configuration parameters:

  • Run length / throughput (grpo)

    • max_num_steps, num_prompts_per_step, num_generations_per_prompt
    • These directly control training duration, sample volume, and compute cost per step.
  • Reward behavior (grpo.reward_shaping, grpo.reward_scaling)

    • reward_shaping.enabled, overlong_buffer_length, overlong_buffer_penalty
    • reward_scaling.enabled and min/max ranges
    • Use these to penalize overly long outputs and keep reward magnitude stable.
  • Policy sequence budget (policy)

    • max_total_sequence_length, train_micro_batch_size, logprob_batch_size
    • Increase carefully: longer contexts improve reasoning capacity but significantly raise memory and latency.
  • Distributed training topology (policy.megatron_cfg)

    • tensor_model_parallel_size, pipeline_model_parallel_size, context_parallel_size
    • Must match your available training GPUs and desired DP/TP/PP/CP balance.
  • Generation backend (policy.generation.vllm_cfg)

    • tensor_parallel_size, gpu_memory_utilization, max_model_len
    • Tune for rollout speed and stability on the dedicated generation node.
  • Dataset and validation (data, grpo)

    • data.dataset_name, data.validation.dataset_name, max_val_samples, val_period
    • Keep validation frequent enough to catch regressions, but not so frequent that it slows training.

Interactive Ray training cluster setup (Optional)

In a production setting, it is more convenient to submit training jobs as Slurm batch jobs. However, setting up an interactive training environment allows you to iterate and debug faster. Once the interactive jobs run smoothly, you can submit them as batch training jobs.

The rest of this guide outlines this optional path where you can set up an interactive persistent Ray cluster for quick development.

Step 4. Reserve worker nodes and start Docker containers

Reserve 5 interactive B200/GB200 nodes as worker nodes for your Ray cluster. If using GB200 nodes on a GB200NVL72 cluster, ideally these nodes should reside within the same rack. This rack-colocation of nodes will enable fast interconnect for faster training, in particular, multi-node NVLink.

Note:

  • Work with your system administrator on how to get your own IMEX channel. IMEX stands for Internode Memory Exchange/Management. It is a system service used in multi-node NVLink environments (like DGX or HGX systems) to securely facilitate sharing GPU memory between different nodes (servers) over the NVLink fabric. IMEX channels are a GPU driver feature that allows for user-based memory isolation in a multi-user environment within an IMEX domain.

  • On a GB200 node you can check your IMEX channel with

root@gb200-compute01:/home# ls /dev/nvidia-caps-imex-channels/
channel1035

In this example, your IMEX channel is hence 1035.

  • For GB200NVL72 systems, work with your system administrator on how to identify and reserve nodes on the same rack. One possible way is via node naming in Slurm. For example, nodes in the same rack could be named in consecutive numbers sharing the same prefix, such as gb200-001-compute[01-09]. In this case, reserve nodes while specifying an explicit node name, such as:
    srun -p gb200nvl72 -w gb200-001-compute01 -t 08:00:00 --pty bash
    

From each of the 5 interactive bash sessions on 5 reserved GB200 nodes, start the Docker container while mounting the shared directory as follows:

[ ]

Step 5. Manually start the Ray cluster

First, check the IPs of your nodes.

[ ]

Dedicate 1 node as the head node and start the Ray master process. From within the NemoRL docker container of the head node, replacing the IP with the correct IP of your head node:

[ ]

Then, manually start the Ray cluster on each worker node and connect to the head node, replacing the IPs with the correct IP of your head node and worker nodes:

[ ]

From the Ray head node, validate your cluster setup with

[ ]

If successful, this should list 5 available nodes with 20 GPUs in total. This persistent Ray cluster allows you to do quick-turnaround interactive development, shortening the dev-debug cycle.

======== Autoscaler status: 2026-03-03 03:47:10.276939 ========
Node status
---------------------------------------------------------------
Active:
 1 node_a906c992db87ec9c1990ad57e63a191cf183e0f578e2e70f8d348fb1
 1 node_a4b3453d98dccb507f5809a9f2d908c538869de51e7b29a830d7fc02
 1 node_18d83de59b62de00e8f66b6a2b40a1a94e95abdc14fc1ad3120f4763
 1 node_de79694e19e94d3ca67ef19392868e297d461b96a3bd276ec7d01334
 1 node_7c2c04cc7bb9b79320a8dcb8932016936840849d927998b97c303075
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/720.0 CPU
 0.0/20.0 GPU
 0B/3.79TiB memory
 0B/931.32GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 (no resource demands)

Step 6. Start Interactive RL Training

Now launch the RL workload from the Ray head node using the config prepared in Step 4.

Notes:

  • Set HF_HOME to a high-speed shared location so datasets and converted checkpoints are not stored in a small home directory.
  • Set WANDB_API_KEY if you want experiment tracking on Wandb.
  • Set GLOO_SOCKET_IFNAME to your primary ethernet interface (check with ifconfig) to avoid distributed-communication issues.

From the Ray head node, submit the interactive job with:

[ ]

This will launch an interactive training job with 5 given nodes, 1 for vLLM roll out and 4 for policy training using the Megatron backend.