Trtllm Cookbook
Deploying NVIDIA Nemotron-3-Super with TensorRT LLM
This notebook will walk you through how to run the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B model via TensorRT-LLM.
TensorRT LLM is NVIDIA’s open-source library for accelerating and optimizing LLM inference performance on NVIDIA GPUs.
For more details on the model click here.
Prerequisites:
- NVIDIA GPU with recent drivers (>= 264 GB VRAM for BF16, >= 160 GB VRAM for FP8) and CUDA 12.x.
- Python 3.10+
- TensorRT-LLM (you can refer to NVIDIA documentation, or pull this container)
Launch on NVIDIA Brev
You can simplify the environment setup by using NVIDIA Brev. Click the button to launch this project on a Brev instance with the necessary dependencies pre-configured.
Once deployed, click on the "Open Notebook" button to get started with this guide.
For BF16 (4x H100):
For FP8 (2x H100):
Prerequisites & environment
Set up a containerized environment for TensorRT-LLM by running the following in a host terminal.
For example, on Brev you can set CACHE_ROOT=/ephemeral for model cache and temp files so the container does not run out of space.
If you are not on Brev, set CACHE_ROOT to a writable path on your machine.
export CACHE_ROOT=/ephemeral
mkdir -p "$CACHE_ROOT/trtllm_cache"
mkdir -p "$CACHE_ROOT/trtllm_tmp"
docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
-p 8000:8000 \
-v "$CACHE_ROOT":"$CACHE_ROOT" \
-e HF_HOME="$CACHE_ROOT/trtllm_cache" \
-e HUGGINGFACE_HUB_CACHE="$CACHE_ROOT/trtllm_cache" \
-e TMPDIR="$CACHE_ROOT/trtllm_tmp" \
-e TEMP="$CACHE_ROOT/trtllm_tmp" \
-e TMP="$CACHE_ROOT/trtllm_tmp" \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc4
You now have TRT-LLM set up!
Verify GPU
Check that CUDA is available and the GPU is detected correctly.
Python: 3.12.12 (main, Feb 12 2026, 00:42:14) [Clang 21.1.4 ] CUDA available: True Num GPUs: 4 GPU[0]: NVIDIA H100 PCIe GPU[1]: NVIDIA H100 PCIe GPU[2]: NVIDIA H100 PCIe GPU[3]: NVIDIA H100 PCIe
OpenAI-compatible server
Start a local OpenAI-compatible server with TensorRT-LLM via the terminal, within the running docker container.
Ensure that the following commands are executed from the docker terminal.
Choose the variant you want to serve:
- BF16:
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16on 4x H100 (--tp_size 4 --ep_size 4) - FP8:
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8on 2x H100 (--tp_size 2 --ep_size 2)
Create a YAML file with the required configuration
cat > ./extra-llm-api-config.yml << EOF
kv_cache_config:
enable_block_reuse: false
moe_config:
backend: TRTLLM
cuda_graph_config:
enable_padding: true
batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
EOF
Load the model
BF16 (4x H100)
mpirun -n 1 --allow-run-as-root --oversubscribe \
trtllm-serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--tp_size 4 --ep_size 4 \
--max_num_tokens 16384 \
--trust_remote_code \
--reasoning_parser nano-v3 \
--tool_parser qwen3_coder \
--extra_llm_api_options extra-llm-api-config.yml
FP8 (2x H100)
mpirun -n 1 --allow-run-as-root --oversubscribe \
trtllm-serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--tp_size 2 --ep_size 2 \
--max_num_tokens 16384 \
--trust_remote_code \
--reasoning_parser nano-v3 \
--tool_parser qwen3_coder \
--extra_llm_api_options extra-llm-api-config.yml
Your server is now running!
Use the API
Use the OpenAI-compatible client to send requests to the TensorRT-LLM server.
Note: The model supports two modes - Reasoning ON (default) vs OFF. This can be toggled by setting enable_thinking to False, as shown below.
Reasoning on
Okay, user wants a haiku about GPUs. Hmm, they're probably a tech enthusiast or developer who appreciates both poetry and hardware. The challenge is capturing GPU essence in 5-7-5 syllables while making it vivid.
First, gotta nail the core GPU functions: parallel processing, rendering graphics, massive computation. But haiku can't be technical - needs natural imagery. "Silent" feels right for idle GPUs, "boom" for when they kick in.
Wait, should I mention "cuda" or "shaders"? No, too jargon-heavy for poetry. "Cores" is perfect - simple but precise. "Cycles" implies the rhythm of processing.
*checks syllable count*
"Silent cores awaken" (5)
"Lightning cycles paint the screen" (7 - "lightning" as 2 syllables, "paint" as 1)
"Endless data flows" (5)
*double-checks*
Yep, "endless" is 2, "data" is 2, "flows" is 1. Nailed it. User'll probably smile at "lightning" - it's techy but poetic. Hope they like the "sinuous" touch too; shows how data moves like liquid light.
...Did I just write a haiku? *quick mental sigh* Yeah. But it's good.
Here's a haiku capturing the essence of GPUs:
**Silent cores awaken,
Lightning cycles paint the screen—
Endless data flows.**
*(5-7-5 syllable count)*
**Why it works:**
- "Silent cores awaken" → GPU cores idle, then spring to life (technical yet poetic).
- "Lightning cycles paint the screen" → Fast processing (lightning) rendering visuals (paints the screen).
- "Endless data flows" → Continuous, massive data streams ("sinuous" implied in "flows").
No jargon—just the *feeling* of a GPU at work. 🖥️✨
Reasoning off
- TensorRT-LLM is NVIDIA's optimized inference engine for large language models, designed to deliver high throughput and low latency on NVIDIA GPUs.
- It supports advanced features like tensor parallelism, pipeline parallelism, and quantization (e.g., FP8, INT4) to maximize performance and reduce memory usage.
- TensorRT-LLM integrates seamlessly with popular frameworks like PyTorch and Hugging Face Transformers, enabling efficient deployment of models such as Llama, GPT, and Mistral.
Streaming response: The first five prime numbers are: 1. 2 2. 3 3. 5 4. 7 5. 11
Tool calling
Use the OpenAI tools schema to call functions via the TensorRT-LLM endpoint.
The user wants to calculate a 15% tip on a $50 bill. I need to use the calculate_tip function. The function requires bill_total and tip_percentage parameters. Bill total is 50, tip percentage is 15. I'll call the function.
[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-d2783abb5b494d77b07c9212dd0f011c', function=Function(arguments='{"bill_total": 50, "tip_percentage": 15}', name='calculate_tip'), type='function')]
Controlling Reasoning Budget
The reasoning_budget parameter allows you to limit the length of the model's reasoning trace. When the reasoning output reaches the specified token budget, the model will attempt to gracefully end the reasoning at the next newline character.
If no newline is encountered within 500 tokens after reaching the budget threshold, the reasoning trace will be forcibly terminated at reasoning_budget + 500 tokens to prevent excessive generation.
Reasoning: We need to write a haiku about GPUs. A haiku is 5-7-5 syllable structure. Provide 3 lines, 5 syllables, 7 syllables, 5 syllables. Might mention graphics, processing, cores, etc. Provide a nice poetic haiku. Ensure correct syllable count. Let's craft: "Silicon veins pulse / Parallel rivers compute dreams / Light builds worlds anew" Check syllables: Line1: "Silicon veins pulse" -> Sil-i-con (3) veins (1) pulse (1) = 5? Actually "Silicon" is 3 syllables (sil-i-con. Content: Silicon veins pulse Parallel rivers compute dreams Light builds worlds anew
Cleanup and shutdown
To tear down this TensorRT-LLM workflow:
- In the terminal running
trtllm-serve, pressCtrl+Cto stop the server. - In the Docker shell, run
exitto stop the container (--rmremoves it automatically). - Optionally run the next cell to clear notebook-side CUDA cache before your next run.