Transformers Cookbook
Running NVIDIA Nemotron Nano 9B v2 with Hugging Face Transformers
This notebook will walk you through how to run the nvidia/NVIDIA-Nemotron-Nano-9B-v2 model with Hugging Face Transformers
Hugging Face Transformers is a model-definition framework for state-of-the-art models across text, vision, audio, video, and multimodal tasks, supporting both training and inference.
For more details on the model click here
Prerequisites:
- NVIDIA GPU with recent drivers (≥ 24 GB VRAM recommended; BF16-capable) and CUDA 12.x
- Python 3.10+
Prerequisites & environment
Set up a clean Python environment for running this locally.
Create and activate a virtual environment. The sample here uses Conda but feel free to choose whichever tool you prefer. Run these commands in a terminal before using this notebook:
conda create -n nemotron-transformers-env python=3.10 -y
conda activate nemotron-transformers-env
If running notebook locally, install ipykernel and switch the kernel to this environment:
- Installation
pip install ipykernel
- Kernel → Change kernel → Python (nemotron-transformers-env)
Install dependencies
Verify GPU
Check that CUDA is available and the GPU is detected correctly.
Generating responses via pipeline
Use the Transformers pipeline with the model’s chat template to quickly generate replies.
Load the Nemotron model and tokenizer manually.
Load AutoTokenizer and AutoModelForCausalLM for full control over inputs and generation settings.
Reasoning
Nemotron Nano supports two reasoning modes: ON (default) and OFF. Toggle via /think or /no_think in the chat messages.
Note: You can include /think or /no_think in system or user messages for turn-level control.
Compare the same prompt with reasoning disabled using /no_think to see differences in style and latency.
Streamed generation
Stream tokens as they are produced using TextIteratorStreamer to display output incrementally while generate runs in a background thread.