Notebooks
N
NVIDIA
Vllm Cookbook

Vllm Cookbook

usage-cookbookNemotron-Nano2-VLnvidia-nemotron

Running NVIDIA Nemotron Nano 2 VL with vLLM

This notebook will walk you through how to run the nvidia/Nemotron-Nano-12B-v2-VL-BF16 model locally with vLLM.

vLLM is a fast and easy-to-use library for LLM inference and serving.

For more details on the model click here

Prerequisites:

  • NVIDIA GPU with recent drivers (≥ 24 GB VRAM recommended; BF16-capable) and CUDA 12.x
  • Python 3.10+

Prerequisites & environment

Set up a clean Python environment for running vLLM locally.

Create and activate a virtual environment. The sample here uses Conda but feel free to choose whichever tool you prefer. Run these commands in a terminal before using this notebook:

conda create -n nemotron-vllm-env python=3.10 -y
conda activate nemotron-vllm-env

If running notebook locally, install ipykernel and switch the kernel to this environment:

  • Installation
pip install ipykernel
  • Kernel → Change kernel → Python (nemotron-vllm-env)

Install dependencies

[ ]

Verify GPU

Confirm CUDA is available and your GPU is visible to PyTorch.

[ ]

Load the model

Initialize the Nemotron model in vLLM with BF16 for efficient GPU inference.

[ ]

Generate responses

Once the model is loaded successfully above, you can continue with text generation:

Single or batch prompts

Send one prompt or a list to run batched generation.

[ ]

Streamed generation

Printing characters as they are produced.

[ ]

OpenAI-compatible server

Serve the model via an OpenAI-compatible API using vLLM.

Before starting the server:

  • Restart the kernel to free GPU memory used by the in-process LLM
  • Ensure you use the same virtual environment with installed dependancies in your terminal
  • Use --video-pruning-rate to set EVS. The default EVS is 0.

After restarting the kernel, run this in a terminal:

git clone https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16
git clone https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2

vllm serve nvidia/Nemotron-Nano-12B-v2-VL-BF16 --trust-remote-code --dtype bfloat16 --enable-auto-tool-choice --tool-parser-plugin "NVIDIA-Nemotron-Nano-9B-v2/nemotron_toolcall_parser_no_streaming.py" --tool-call-parser "nemotron_json"

Your server is now running!

Use the API

Send chat and streaming requests to your local vLLM server using the OpenAI-compatible client.

Note: The model supports two modes - Reasoning ON (default) vs OFF. These can be toggled by passing /think vs /no_think as a part of the "system" message.

The /think or /no_think keywords can also be provided in “user” messages for turn-level reasoning control.

[ ]

Tool calling

Call functions using the OpenAI Tools schema and inspect returned tool_calls.

[ ]