Notebooks
N
NVIDIA
Transformers Cookbook

Transformers Cookbook

usage-cookbookNemotron-Nano-9B-v2nvidia-nemotron

Running NVIDIA Nemotron Nano 9B v2 with Hugging Face Transformers

This notebook will walk you through how to run the nvidia/NVIDIA-Nemotron-Nano-9B-v2 model with Hugging Face Transformers

Hugging Face Transformers is a model-definition framework for state-of-the-art models across text, vision, audio, video, and multimodal tasks, supporting both training and inference.

For more details on the model click here

Prerequisites:

  • NVIDIA GPU with recent drivers (≥ 24 GB VRAM recommended; BF16-capable) and CUDA 12.x
  • Python 3.10+

Prerequisites & environment

Set up a clean Python environment for running this locally.

Create and activate a virtual environment. The sample here uses Conda but feel free to choose whichever tool you prefer. Run these commands in a terminal before using this notebook:

conda create -n nemotron-transformers-env python=3.10 -y
conda activate nemotron-transformers-env

If running notebook locally, install ipykernel and switch the kernel to this environment:

  • Installation
pip install ipykernel
  • Kernel → Change kernel → Python (nemotron-transformers-env)

Install dependencies

[ ]

Verify GPU

Check that CUDA is available and the GPU is detected correctly.

[ ]

Generating responses via pipeline

Use the Transformers pipeline with the model’s chat template to quickly generate replies.

[ ]

Load the Nemotron model and tokenizer manually.

Load AutoTokenizer and AutoModelForCausalLM for full control over inputs and generation settings.

[ ]

Reasoning

Nemotron Nano supports two reasoning modes: ON (default) and OFF. Toggle via /think or /no_think in the chat messages.

Note: You can include /think or /no_think in system or user messages for turn-level control.

[ ]

Compare the same prompt with reasoning disabled using /no_think to see differences in style and latency.

[ ]

Streamed generation

Stream tokens as they are produced using TextIteratorStreamer to display output incrementally while generate runs in a background thread.

[ ]