Grpo Training Cookbook
Nemotron Super V3 GRPO/DAPO Training with NemoRL
Overview
This guide demonstrates the step-by-step RL training of the Nemotron Super V3 model on an NVIDIA B200 GPU cluster or a GB200-NVL72 rack system running Slurm.
We will carry out GRPO/DAPO training of the model on the DAPO-Math-17k dataset. This is a single-domain reinforcement learning example with verifiable rewards.
Notes:
-
Due to the distributed nature of the setup and training steps, all commands in this notebook are intended to be copied, pasted, and executed in the relevant interactive Docker shell environment on either the head node or worker nodes, rather than from within a single JupyterLab environment.
-
Interactive vs. batch training: in a production setting, it is more convenient to submit training jobs as Slurm batch jobs. However, setting up an interactive training environment allows you to iterate and debug faster. Once the interactive jobs run smoothly, you can submit them as batch training jobs.
-
We start this tutorial with a batch submission guide, followed by a step-by-step guide to set up an interactive training cluster.
Prerequisites
-
Compute: 3xB200 nodes (each with 8xGPUs, i.e. 24xB200 GPUs in total) with infiniband connection, or 5xGB200 nodes (each with 4xGPUs, i.e. 20x GB200 GPUs in total) on a single GB200 NVL72 rack. Note that B200/GB200 GPUs have ~183/189 GB of HBM, which is not sufficient for full-weight co-located RL training (i.e., both training the policy model and running rollout on the same set of GPUs), so we use non-colocated training, with 1 node dedicated to rollout and 4 nodes for policy training.
On a Slurm system, you can check the availability of GB200 nodes with something similar to:
sinfo|grep gb200nvl72Replace "gb200nvl72" with your correct GB200 Slurm partition name. Then request a specific interactive node with:
srun -p gb200nvl72 -w gb200-001-compute09 -t 08:00:00 --pty bash -
Storage: A high-speed shared network file system for storing code, models, checkpoints, and other temporary assets. In this guide, we will assume that the shared storage is at
</YOUR/SHARED/NETWORK/STORAGE>on the host system, to be mounted as/sharedinto the working Docker container, accessible from all nodes. We will also assume the following directory structure
</YOUR/SHARED/NETWORK/STORAGE> (on host):/shared (inside container)
|______code
| |____RL # NemoRL root directory
| |____Nemotron/usage-cookbook/Nemotron-3-Super # Repository containing this notebook
|_______models
| |____NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16 # base model checkpoint
|_______checkpoints
|_______HF_HOME # HuggingFace cache directory
Note that each model checkpoint (including model's weights and optimizer's state) takes up to ~1Tb of storage. You should also account for the number of checkpoints you would like to keep (e.g., best k=3 checkpoints in the NemoRL training config). In addition, the base BF16 model checkpoint requires ~231Gb of storage, and another ~231Gb for the Megatron-converted checkpoint.
- Model: download the HuggingFace-format model with the HF CLI tool to the shared location on the high-speed storage. In this example, we will start from a Nemotron Super-V3 pretrained model https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16 fresh out of pretraining. The DAPO training process can help the model discover advanced math reasoning entirely by itself, aka. the Deepseek "aha" moment.
- Docker image: Build the NemoRL Docker container
Reserve a B200/GB200 compute node. From here, clone and check out the NemoRL Super-v3 branch.
Next, from the NemoRL repo root directory, build the Docker container and push it to a central registry accessible to all nodes, such as NVIDIA NGC or Docker Hub.
Note: In this example, we make use of the NVIDIA Cloud Registry NGC. See the NGC user's guide on how to set up your account and team, obtain the NGC API key for Docker login, and upload an image to NGC.
Replace nvcr.io/<YOUR_NGC_ORG>/nemo-rl-gb200:superv3-a90de923d with your correct NGC organization, or else use an accessible Docker hub tag.
Also, see the latest NemoRL Docker build guide for more information.
Step 1. Prepare the training config file
This step defines the training recipe for a single-domain RL workload: verifiable mathematical reasoning on DAPO-style data. The policy learns from sampled solutions, and rewards are computed by math verifiers, so this setup is ideal for tasks where correctness can be programmatically checked (e.g., arithmetic/algebra word problems and competition-style math).
GB200 Config
In this part, the YAML below is tuned for a 5xGB200 nodes (4xGPUs per node, 20 in total), non-colocated setup:
- 1 node for vLLM generation (rollout)
- 4 nodes for Megatron policy training
A practical workflow is to first run a short sanity pass (small max_num_steps) to validate cluster/config correctness, then scale up sequence length and rollout volume once stable.
Notes:
- The configuration file must be written or copied under the NemoRL root directory, e.g. at
</YOUR/SHARED/NETWORK/STORAGE>/code/RL/examples/configs/recipes/llm/on your host file system. - ALL PATHS INSIDE THE CONFIG FILES ARE PATH AS MOUNTED INSIDE THE CONTAINER
B200 Config
In this part, the YAML below is tuned for a 3xB200 nodes (8xGPUs per node, 24 in total), non-colocated setup:
- 1 node for vLLM generation (rollout)
- 2 nodes for Megatron policy training
Step 2. Batch job submission
In a production B200 cluster, each node with 8 GPUs, you can launch production batch jobs with the following procedure from a Slurm login/head node.
On an arm-based GB200 system with multi-node NVLink (MNNVL), use the following procedure. Work with your system administrator on how to get your own IMEX channel. See further in the interactive guide below.
Tweaking training hyperparameters
Once training verification is successful, you can tweak the configuration parameters:
-
Run length / throughput (
grpo)max_num_steps,num_prompts_per_step,num_generations_per_prompt- These directly control training duration, sample volume, and compute cost per step.
-
Reward behavior (
grpo.reward_shaping,grpo.reward_scaling)reward_shaping.enabled,overlong_buffer_length,overlong_buffer_penaltyreward_scaling.enabledand min/max ranges- Use these to penalize overly long outputs and keep reward magnitude stable.
-
Policy sequence budget (
policy)max_total_sequence_length,train_micro_batch_size,logprob_batch_size- Increase carefully: longer contexts improve reasoning capacity but significantly raise memory and latency.
-
Distributed training topology (
policy.megatron_cfg)tensor_model_parallel_size,pipeline_model_parallel_size,context_parallel_size- Must match your available training GPUs and desired DP/TP/PP/CP balance.
-
Generation backend (
policy.generation.vllm_cfg)tensor_parallel_size,gpu_memory_utilization,max_model_len- Tune for rollout speed and stability on the dedicated generation node.
-
Dataset and validation (
data,grpo)data.dataset_name,data.validation.dataset_name,max_val_samples,val_period- Keep validation frequent enough to catch regressions, but not so frequent that it slows training.
Interactive Ray training cluster setup (Optional)
In a production setting, it is more convenient to submit training jobs as Slurm batch jobs. However, setting up an interactive training environment allows you to iterate and debug faster. Once the interactive jobs run smoothly, you can submit them as batch training jobs.
The rest of this guide outlines this optional path where you can set up an interactive persistent Ray cluster for quick development.
Step 4. Reserve worker nodes and start Docker containers
Reserve 5 interactive B200/GB200 nodes as worker nodes for your Ray cluster. If using GB200 nodes on a GB200NVL72 cluster, ideally these nodes should reside within the same rack. This rack-colocation of nodes will enable fast interconnect for faster training, in particular, multi-node NVLink.
Note:
-
Work with your system administrator on how to get your own IMEX channel. IMEX stands for Internode Memory Exchange/Management. It is a system service used in multi-node NVLink environments (like DGX or HGX systems) to securely facilitate sharing GPU memory between different nodes (servers) over the NVLink fabric. IMEX channels are a GPU driver feature that allows for user-based memory isolation in a multi-user environment within an IMEX domain.
-
On a GB200 node you can check your IMEX channel with
root@gb200-compute01:/home# ls /dev/nvidia-caps-imex-channels/
channel1035
In this example, your IMEX channel is hence 1035.
- For GB200NVL72 systems, work with your system administrator on how to identify and reserve nodes on the same rack. One possible way is via node naming in Slurm. For example, nodes in the same rack could be named in consecutive numbers sharing the same prefix, such as
gb200-001-compute[01-09]. In this case, reserve nodes while specifying an explicit node name, such as:srun -p gb200nvl72 -w gb200-001-compute01 -t 08:00:00 --pty bash
From each of the 5 interactive bash sessions on 5 reserved GB200 nodes, start the Docker container while mounting the shared directory as follows:
Step 5. Manually start the Ray cluster
First, check the IPs of your nodes.
Dedicate 1 node as the head node and start the Ray master process. From within the NemoRL docker container of the head node, replacing the IP with the correct IP of your head node:
Then, manually start the Ray cluster on each worker node and connect to the head node, replacing the IPs with the correct IP of your head node and worker nodes:
From the Ray head node, validate your cluster setup with
If successful, this should list 5 available nodes with 20 GPUs in total. This persistent Ray cluster allows you to do quick-turnaround interactive development, shortening the dev-debug cycle.
======== Autoscaler status: 2026-03-03 03:47:10.276939 ========
Node status
---------------------------------------------------------------
Active:
1 node_a906c992db87ec9c1990ad57e63a191cf183e0f578e2e70f8d348fb1
1 node_a4b3453d98dccb507f5809a9f2d908c538869de51e7b29a830d7fc02
1 node_18d83de59b62de00e8f66b6a2b40a1a94e95abdc14fc1ad3120f4763
1 node_de79694e19e94d3ca67ef19392868e297d461b96a3bd276ec7d01334
1 node_7c2c04cc7bb9b79320a8dcb8932016936840849d927998b97c303075
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/720.0 CPU
0.0/20.0 GPU
0B/3.79TiB memory
0B/931.32GiB object_store_memory
Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands)
Step 6. Start Interactive RL Training
Now launch the RL workload from the Ray head node using the config prepared in Step 4.
Notes:
- Set
HF_HOMEto a high-speed shared location so datasets and converted checkpoints are not stored in a small home directory. - Set
WANDB_API_KEYif you want experiment tracking on Wandb. - Set
GLOO_SOCKET_IFNAMEto your primary ethernet interface (check withifconfig) to avoid distributed-communication issues.
From the Ray head node, submit the interactive job with:
This will launch an interactive training job with 5 given nodes, 1 for vLLM roll out and 4 for policy training using the Megatron backend.