Vllm Cookbook
Deploying NVIDIA Nemotron-3-Super with vLLM
This notebook will walk you through how to run the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B model with vLLM.
vLLM is a fast and easy-to-use library for LLM inference and serving.
For more details on the model click here
Prerequisites:
- NVIDIA GPU with recent drivers (>= 264 GB VRAM for BF16, >= 160 GB VRAM for FP8, >= 80 GB VRAM for NVFP4) and CUDA 12.x.
- Python 3.10+
Note: NVFP4 requires Blackwell architecture. This notebook includes B200-focused steps for NVFP4 setup and serving.
Launch on NVIDIA Brev
You can simplify the environment setup by using NVIDIA Brev. Click the button to launch this project on a Brev instance with the necessary dependencies pre-configured.
Once deployed, click on the "Open Notebook" button to get started with this guide.
For BF16 (4x H100):
For FP8 (2x H100):
Install dependencies
For FP8, ensure CUDA toolkit and build tools are available in your environment before first run (required for kernel JIT on some setups):
sudo apt update
sudo apt install -y cuda-toolkit-12-8 ninja-build gcc g++ build-essential
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:${LD_LIBRARY_PATH}
FP8 may take longer on first startup due to kernel compile/autotune; later runs are usually faster from cache.
Looking in links: /tmp/tmpccruqxe6 Processing /tmp/tmpccruqxe6/pip-25.0.1-py3-none-any.whl Installing collected packages: pip Successfully installed pip-25.0.1
Verify GPU
Confirm CUDA is available and your GPU is visible to PyTorch.
CUDA available: True Num GPUs: 4 GPU[0]: NVIDIA H100 PCIe GPU[1]: NVIDIA H100 PCIe GPU[2]: NVIDIA H100 PCIe GPU[3]: NVIDIA H100 PCIe
Load the model
Initialize the Nemotron model in vLLM with BF16, FP8, or NVFP4.
/home/shadeform/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
INFO 02-23 22:30:52 [utils.py:263] non-default args: {'trust_remote_code': True, 'kv_cache_dtype': 'fp8', 'tensor_parallel_size': 4, 'disable_log_stats': True, 'model': 'nvidia/nemotron-super-sft-020426'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
A new version of the following files was downloaded from https://huggingface.co/nvidia/nemotron-super-sft-020426:
- configuration_nemotron_h.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
2026-02-23 22:30:52,661 INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
INFO 02-23 22:30:59 [model.py:530] Resolved architecture: NemotronHForCausalLM
INFO 02-23 22:30:59 [model.py:1545] Using max model len 262144
INFO 02-23 22:30:59 [cache.py:206] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
INFO 02-23 22:30:59 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 02-23 22:30:59 [config.py:543] Updating mamba_ssm_cache_dtype to 'float32' for NemotronH model
INFO 02-23 22:30:59 [config.py:476] Setting attention block size to 4160 tokens to ensure that attention page size is >= mamba page size.
INFO 02-23 22:30:59 [config.py:500] Padding mamba page size by 0.10% to ensure that mamba page size and attention page size are exactly equal.
INFO 02-23 22:30:59 [vllm.py:630] Asynchronous scheduling is enabled.
INFO 02-23 22:30:59 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
WARNING 02-23 22:31:01 [system_utils.py:136] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
(EngineCore_DP0 pid=14604) INFO 02-23 22:31:06 [core.py:97] Initializing a V1 LLM engine (v0.14.1) with config: model='nvidia/nemotron-super-sft-020426', speculative_config=None, tokenizer='nvidia/nemotron-super-sft-020426', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/nemotron-super-sft-020426, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=14604) WARNING 02-23 22:31:06 [multiproc_executor.py:880] Reducing Torch parallelism from 124 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 02-23 22:31:13 [parallel_state.py:1214] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:56631 backend=nccl
INFO 02-23 22:31:13 [parallel_state.py:1214] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:56631 backend=nccl
INFO 02-23 22:31:14 [parallel_state.py:1214] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:56631 backend=nccl
INFO 02-23 22:31:14 [parallel_state.py:1214] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:56631 backend=nccl
INFO 02-23 22:31:14 [pynccl.py:111] vLLM is using nccl==2.27.5
WARNING 02-23 22:31:15 [symm_mem.py:99] SymmMemCommunicator: symmetric memory initialization failed: CUDA driver error: invalid device ordinal Communicator is not available. To suppress this warning set VLLM_ALLREDUCE_USE_SYMM_MEM=0
WARNING 02-23 22:31:15 [symm_mem.py:99] SymmMemCommunicator: symmetric memory initialization failed: CUDA driver error: invalid device ordinal Communicator is not available. To suppress this warning set VLLM_ALLREDUCE_USE_SYMM_MEM=0
WARNING 02-23 22:31:15 [symm_mem.py:99] SymmMemCommunicator: symmetric memory initialization failed: CUDA driver error: invalid device ordinal Communicator is not available. To suppress this warning set VLLM_ALLREDUCE_USE_SYMM_MEM=0
WARNING 02-23 22:31:15 [symm_mem.py:99] SymmMemCommunicator: symmetric memory initialization failed: CUDA driver error: invalid device ordinal Communicator is not available. To suppress this warning set VLLM_ALLREDUCE_USE_SYMM_MEM=0
WARNING 02-23 22:31:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 02-23 22:31:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 02-23 22:31:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 02-23 22:31:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 02-23 22:31:15 [parallel_state.py:1425] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3
INFO 02-23 22:31:15 [parallel_state.py:1425] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2
INFO 02-23 22:31:15 [parallel_state.py:1425] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
INFO 02-23 22:31:15 [parallel_state.py:1425] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
(Worker_TP0 pid=14843) INFO 02-23 22:31:16 [gpu_model_runner.py:3808] Starting to load model nvidia/nemotron-super-sft-020426...
(Worker_TP0 pid=14843) INFO 02-23 22:31:32 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN')
(Worker_TP3 pid=14846) INFO 02-23 22:33:48 [weight_utils.py:510] Time spent downloading weights for nvidia/nemotron-super-sft-020426: 135.321143 seconds
Loading safetensors checkpoint shards: 100% Completed | 50/50 [00:23<00:00, 2.09it/s]
(Worker_TP0 pid=14843)
(Worker_TP0 pid=14843) INFO 02-23 22:34:12 [default_loader.py:291] Loading weights took 23.97 seconds
(Worker_TP0 pid=14843) INFO 02-23 22:34:13 [gpu_model_runner.py:3905] Model loading took 56.94 GiB memory and 175.983706 seconds
(Worker_TP0 pid=14843) INFO 02-23 22:34:21 [backends.py:644] Using cache directory: /home/shadeform/.cache/vllm/torch_compile_cache/4852213ca2/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=14843) INFO 02-23 22:34:21 [backends.py:704] Dynamo bytecode transform time: 5.45 s
(Worker_TP3 pid=14846) INFO 02-23 22:34:24 [backends.py:261] Cache the graph of compile range (1, 16384) for later use
(Worker_TP0 pid=14843) INFO 02-23 22:34:24 [backends.py:261] Cache the graph of compile range (1, 16384) for later use
(Worker_TP2 pid=14845) INFO 02-23 22:34:24 [backends.py:261] Cache the graph of compile range (1, 16384) for later use
(Worker_TP1 pid=14844) INFO 02-23 22:34:28 [backends.py:261] Cache the graph of compile range (1, 16384) for later use
(Worker_TP2 pid=14845) /home/shadeform/.venv/lib/python3.12/site-packages/torch/_inductor/compile_fx.py:312: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
(Worker_TP2 pid=14845) warnings.warn(
(Worker_TP0 pid=14843) /home/shadeform/.venv/lib/python3.12/site-packages/torch/_inductor/compile_fx.py:312: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
(Worker_TP0 pid=14843) warnings.warn(
(Worker_TP3 pid=14846) /home/shadeform/.venv/lib/python3.12/site-packages/torch/_inductor/compile_fx.py:312: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
(Worker_TP3 pid=14846) warnings.warn(
(Worker_TP1 pid=14844) /home/shadeform/.venv/lib/python3.12/site-packages/torch/_inductor/compile_fx.py:312: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
(Worker_TP1 pid=14844) warnings.warn(
(Worker_TP0 pid=14843) WARNING 02-23 22:34:31 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/shadeform/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=672,device_name=NVIDIA_H100_PCIe.json
(Worker_TP0 pid=14843) INFO 02-23 22:35:03 [backends.py:278] Compiling a graph for compile range (1, 16384) takes 40.16 s
(Worker_TP0 pid=14843) INFO 02-23 22:35:03 [monitor.py:34] torch.compile takes 45.62 s in total
(Worker_TP0 pid=14843) INFO 02-23 22:35:05 [gpu_worker.py:358] Available KV cache memory: 7.94 GiB
(EngineCore_DP0 pid=14604) INFO 02-23 22:35:06 [kv_cache_utils.py:1305] GPU KV cache size: 690,560 tokens
(EngineCore_DP0 pid=14604) INFO 02-23 22:35:06 [kv_cache_utils.py:1310] Maximum concurrency for 262,144 tokens per request: 14.49x
(Worker_TP2 pid=14845) 2026-02-23 22:35:07,077 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=14843) 2026-02-23 22:35:07,077 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP3 pid=14846) 2026-02-23 22:35:07,077 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP1 pid=14844) 2026-02-23 22:35:07,077 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=14843) 2026-02-23 22:35:07,497 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP1 pid=14844) 2026-02-23 22:35:07,497 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP2 pid=14845) 2026-02-23 22:35:07,497 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP3 pid=14846) 2026-02-23 22:35:07,497 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:10<00:00, 5.01it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 51/51 [00:08<00:00, 6.34it/s]
(Worker_TP0 pid=14843) INFO 02-23 22:35:26 [gpu_model_runner.py:4856] Graph capturing finished in 19 secs, took 1.10 GiB
(EngineCore_DP0 pid=14604) INFO 02-23 22:35:26 [core.py:273] init engine (profile, create kv cache, warmup model) took 71.27 seconds
(EngineCore_DP0 pid=14604) INFO 02-23 22:35:29 [vllm.py:630] Asynchronous scheduling is enabled.
INFO 02-23 22:35:29 [llm.py:347] Supported tasks: ['generate']
Model ready
Generate responses
Generate text with vLLM using single, batched, and simple streaming examples.
Single or batch prompts
Send one prompt or a list to run batched generation.
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 138.63it/s] Processed prompts: 100%|██████████| 1/1 [00:23<00:00, 23.58s/it, est. speed input: 1.15 toks/s, output: 8.48 toks/s] Okay, the user is asking for three bullet points about vLLM. Let me recall what I know about vLLM. It's a high-throughput inference library for large language models, right? Developed by the team at UC Berkeley. The main selling point is its efficiency in handling LLM inference, especially for serving. First, I should focus on the key features that make vLLM stand out. The most notable is probably the PagedAttention mechanism. That's a big deal because it solves the memory fragmentation issue in traditional attention implementations. Traditional methods waste a lot of memory due to fragmentation, but PagedAttention uses a more efficient memory management approach. I should explain that clearly but concisely. Next, the user might be interested in performance metrics. vLLM is known for its high throughput and low latency. I remember it achieves significantly higher throughput compared to other frameworks like Hugging Face Transformers or TensorRT-LLM. Maybe mention a specific benchmark, like Adding requests: 100%|██████████| 3/3 [00:00<00:00, 2671.53it/s] Processed prompts: 100%|██████████| 3/3 [00:02<00:00, 1.19it/s, est. speed input: 26.12 toks/s, output: 110.81 toks/s] Prompt 1: 'Hello, my name is' Hello! It's nice to meet you. How can I assist you today? 😊 </think> Hello! It's nice to meet you. How can I assist you today? 😊 Prompt 2: 'The capital of France is' The user asks: "The capital of France is". This is a simple factual question. The answer: Paris. Provide concise answer. </think> The capital of France is **Paris**. Prompt 3: 'Explain quantum computing in simple terms:' Okay, the user asked me to explain quantum computing in simple terms. Let me start by recalling what I know. Quantum computing is a complex topic, so I need to break it down without using too much jargon. The user probably isn't a physicist, so analogies and everyday examples would help. First, I should compare it to classical computing since that's familiar. Classical computers use bits that are either 0 or 1. Quantum computers use qubits, which can be both 0 and 1 at the same time. That's superposition. But how to explain superposition simply? Maybe use a spinning coin—while it's spinning, it's kind of both heads and tails, but when it lands, it's one. But I should clarify that it's not just a coin; it's a quantum state. Then there's entanglement. That's trickier. If two qubits are entangled, changing one instantly affects the other, no matter the distance
Streamed generation
Print characters as they are produced.
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 1740.38it/s] Processed prompts: 100%|██████████| 1/1 [00:02<00:00, 2.23s/it, est. speed input: 10.78 toks/s, output: 95.67 toks/s]Response: The user wants a haiku about GPUs. Haiku is 5-7-5 syllables. Must be about GPUs. Ensure correct syllable count. Provide a haiku. No extra text? Probably just the haiku. Let's craft: "Silicon hearts beat, / Crunching pixels, data streams / Speed of light awakes." Count syllables: "Silicon hearts beat" = Si-li-con (3) hearts(1) beat(1) =5. "Crunching pixels, data streams" = Crunch-ing(2) pix-els(2) da-ta(2) streams(1) =7. "Speed of light awakes" = Speed(1) of(1) light(1) a-wakes(2) =5? Actually "awakes" is 2 syllables (a-wakes). So total 5. Good. Provide that. </think> Silicon hearts beat, Crunching pixels, data streams, Speed of light awakes.
OpenAI-compatible server
Serve the model via an OpenAI-compatible API using vLLM.
Release GPU memory before server mode
This notebook used GPU memory for in-process inference.
Before starting vllm serve, run the cleanup cell below to free memory.
If server startup still runs out of memory, restart the kernel, then continue from this section.
Choose the variant you want to serve:
- BF16:
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16on 4x H100 - FP8:
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8on 2x H100 - NVFP4:
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4on B200
(Worker_TP0 pid=14843) INFO 02-23 22:41:05 [multiproc_executor.py:707] Parent process exited, terminating worker (Worker_TP2 pid=14845) INFO 02-23 22:41:05 [multiproc_executor.py:707] Parent process exited, terminating worker (Worker_TP3 pid=14846) INFO 02-23 22:41:05 [multiproc_executor.py:707] Parent process exited, terminating worker (Worker_TP1 pid=14844) INFO 02-23 22:41:05 [multiproc_executor.py:707] Parent process exited, terminating worker Cleanup complete. If server startup still OOMs, restart the kernel before proceeding.
Prepare terminal environment
Use the same virtual environment where dependencies were installed.
For example, on Brev run:
source /home/shadeform/.venv/bin/activate
If you are not on Brev, replace this with your own virtual environment path.
Start server
After cleanup (or after restart if needed), run this in a terminal.
BF16 (4x H100)
wget "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/resolve/main/super_v3_reasoning_parser.py"
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
--async-scheduling \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \
--data-parallel-size 1 \
--swap-space 0 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-seqs 512 \
--served-model-name nemotron \
--host 0.0.0.0 \
--port 5000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
--reasoning-parser super_v3
FP8 (2x H100)
Note: the first FP8 server startup may take longer than BF16 while kernels are compiled/autotuned. This is expected.
wget "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8/resolve/main/super_v3_reasoning_parser.py"
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
--async-scheduling \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--data-parallel-size 1 \
--swap-space 0 \
--trust-remote-code \
--attention-backend TRITON_ATTN \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-seqs 512 \
--served-model-name nemotron \
--host 0.0.0.0 \
--port 5000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
--reasoning-parser super_v3
NVFP4 (B200)
wget "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/resolve/main/super_v3_reasoning_parser.py"
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--async-scheduling \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--data-parallel-size 1 \
--swap-space 0 \
--trust-remote-code \
--attention-backend TRITON_ATTN \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-seqs 512 \
--served-model-name nemotron \
--host 0.0.0.0 \
--port 5000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
--reasoning-parser super_v3
Your server is now running!
Use the API
Send chat and streaming requests to your vLLM server using the OpenAI-compatible client.
Note: The model supports two modes - Reasoning ON (default) vs OFF. This can be toggled by setting enable_thinking to False, as shown below.
Reasoning on Reasoning: We need to produce a haiku (5-7-5 syllable poem) about GPUs. Should be creative. Probably mention silicon, compute, speed, cooling, etc. Count syllables. Let's craft: "Silicon veins pulse" (5? Sil-i-con =3, veins=1, pulse=1 total 5). Good. Second line 7 syllables: "Thunder of cores, rendering dreams". Count: Thun-der(2) of(1) cores(1)=4, ren-dering(2) =6, dreams(1)=7. Good. Third line 5 syllables: "Cool fans whisper fast". Count: Cool(1) fans(1)=2, whis-per(2)=4, fast(1)=5. Done. Check haiku format: 5-7-5. Good. Provide as answer. Content: Silicon veins pulse Thunder of cores, rendering dreams Cool fans whisper fast Reasoning off Sure! Here are three interesting facts about **vLLM** (Virtual Large Language Model), a high-performance inference engine for large language models: 1. **Pioneering PagedAttention for Memory Efficiency** vLLM introduced **PagedAttention**, a novel memory management technique inspired by virtual memory in operating systems. It allows the model to store key-value (KV) cache in non-contiguous memory pages, drastically reducing memory fragmentation and enabling up to **2x higher throughput** compared to traditional methods. 2. **Optimized for High-Throughput Serving** vLLM is designed specifically for serving LLMs at scale. It uses **continuous batching** and **dynamic scheduling** to efficiently handle multiple requests simultaneously, making it ideal for production environments where low latency and high throughput are critical. 3. **Open-Source and Framework-Agnostic** vLLM is open-source and supports multiple LLM frameworks, including Hugging Face Transformers. It’s compatible with popular models like Llama, Mistral, and Falcon, and integrates easily with tools like FastAPI and Triton Inference Server, making it a go-to choice for researchers and engineers building scalable AI applications. These features make vLL
The first five prime numbers are: 1. **2** 2. **3** 3. **5** 4. **7** 5. **11**
Tool calling
Call functions using the OpenAI Tools schema and inspect returned tool_calls.
The user wants to calculate a 15% tip on a $50 bill. I have access to the calculate_tip function which requires bill_total and tip_percentage as parameters. I need to call this function with bill_total = 50 and tip_percentage = 15. Let me do that.
[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-a3926e1998a22969', function=Function(arguments='{"bill_total": 50, "tip_percentage": 15}', name='calculate_tip'), type='function')]
Controlling Reasoning Budget
The reasoning_budget parameter allows you to limit the length of the model's reasoning trace. When the reasoning output reaches the specified token budget, the model will attempt to gracefully end the reasoning at the next newline character.
If no newline is encountered within 500 tokens after reaching the budget threshold, the reasoning trace will be forcibly terminated at reasoning_budget + 500 tokens to prevent excessive generation.
/home/shadeform/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Reasoning: The user asks: Write a haiku about GPUs. Haiku is 5-7-5 syllable poem. So need 5 syllables first line,. Content: Silicon hearts pulse in a cascade of glowing cores dreams rendered in math
Cleanup and shutdown
To free resources after this notebook:
- Release in-process model memory (run the next cell).
- Stop the OpenAI-compatible
vllm serveprocess in the terminal where it was started (Ctrl+C). - If needed, restart the kernel to ensure a clean state.
In-process cleanup complete. If vllm serve is running in a terminal, stop it with Ctrl+C.