Vllm Cookbook
Running NVIDIA Nemotron Nano 2 VL with vLLM
This notebook will walk you through how to run the nvidia/Nemotron-Nano-12B-v2-VL-BF16 model locally with vLLM.
vLLM is a fast and easy-to-use library for LLM inference and serving.
For more details on the model click here
Prerequisites:
- NVIDIA GPU with recent drivers (≥ 24 GB VRAM recommended; BF16-capable) and CUDA 12.x
- Python 3.10+
Prerequisites & environment
Set up a clean Python environment for running vLLM locally.
Create and activate a virtual environment. The sample here uses Conda but feel free to choose whichever tool you prefer. Run these commands in a terminal before using this notebook:
conda create -n nemotron-vllm-env python=3.10 -y
conda activate nemotron-vllm-env
If running notebook locally, install ipykernel and switch the kernel to this environment:
- Installation
pip install ipykernel
- Kernel → Change kernel → Python (nemotron-vllm-env)
Install dependencies
Verify GPU
Confirm CUDA is available and your GPU is visible to PyTorch.
Load the model
Initialize the Nemotron model in vLLM with BF16 for efficient GPU inference.
Generate responses
Once the model is loaded successfully above, you can continue with text generation:
Single or batch prompts
Send one prompt or a list to run batched generation.
Streamed generation
Printing characters as they are produced.
OpenAI-compatible server
Serve the model via an OpenAI-compatible API using vLLM.
Before starting the server:
- Restart the kernel to free GPU memory used by the in-process LLM
- Ensure you use the same virtual environment with installed dependancies in your terminal
- Use
--video-pruning-rateto set EVS. The default EVS is 0.
After restarting the kernel, run this in a terminal:
git clone https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16
git clone https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2
vllm serve nvidia/Nemotron-Nano-12B-v2-VL-BF16 --trust-remote-code --dtype bfloat16 --enable-auto-tool-choice --tool-parser-plugin "NVIDIA-Nemotron-Nano-9B-v2/nemotron_toolcall_parser_no_streaming.py" --tool-call-parser "nemotron_json"
Your server is now running!
Use the API
Send chat and streaming requests to your local vLLM server using the OpenAI-compatible client.
Note: The model supports two modes - Reasoning ON (default) vs OFF. These can be toggled by passing /think vs /no_think as a part of the "system" message.
The /think or /no_think keywords can also be provided in “user” messages for turn-level reasoning control.
Tool calling
Call functions using the OpenAI Tools schema and inspect returned tool_calls.