Vllm Api Cookbook
Running Nemotron-49B-v1.5 with vLLM on NVIDIA GPUs
This notebook provides a comprehensive guide on how to run the Nemotron-49B-v1.5 model using vLLM, a high-performance library for LLM inference and serving.
This notebook is divided into two parts:
- Part 1: Demonstrates how to use the direct vLLM Python API for inference, including batch generation and pseudo-streaming.
- Part 2: Covers how to deploy the model with an OpenAI-compatible web server for robust chat, streaming, and tool-use capabilities.
Launch on NVIDIA Brev
You can simplify the environment setup by using NVIDIA Brev. Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.
Once deployed, click on the "Open Notebook" button to get started with this guide.
- Model card: nvidia/Llama-3.3-Nemotron-Super-49B-v1.5
- vLLM Docs: https://docs.vllm.ai/
Table of Contents
Part 1: Inference with the Python API
Prerequisites
Hardware: This notebook requires a machine with at least 2 NVIDIA GPUs with sufficient VRAM to hold the 49B parameter model.
Software:
- Python 3.10+
- CUDA 12.x
- PyTorch 2.3+
- vLLM 0.10.x
Setup
/home/ubuntu/.venv/bin/python3: No module named pip Note: you may need to restart the kernel to use updated packages.
/home/shadeform/miniconda3/envs/nemotron/lib/python3.13/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import]
CUDA available: True Num GPUs: 1 GPU[0]: NVIDIA H200
Loading the Model
Showcasing Reasoning Modes: think vs. no_think
The Nemotron model supports two reasoning modes, which can be controlled via the system message:
- Reasoning ON (default): The model generates a
<think>block with its reasoning process before the answer. - Reasoning OFF (
/no_think): By adding/no_thinkto the system prompt, the model provides a direct answer without the<think>block. This is useful for simple tasks where you want a concise response.
Since we are using the vllm python client which does not use the chat template, we will demonstrate this feature in the OpenAI-compatible server section.
Single and Batch Generation
Adding requests: 0%| | 0/1 [00:00<?, ?it/s]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
It’s a new version of the popular Nemotron font, which was originally designed for the Commodore 64 computer. Version 2.0 of Nemotron Super includes a full set of 96 characters, including uppercase and lowercase letters, numbers, and symbols. The font is designed to look like the text you'd see on a retro computer or video game screen. It’s perfect for creating a nostalgic or vintage look in your designs. Nemotron Super is a monospaced font, meaning that each character takes up the same amount of horizontal space. This makes it ideal for use in programming, coding, or any situation where alignment is important. The font’s design is inspired by the pixelated text of early computing and gaming, giving it a distinctive and charming aesthetic. One of the standout features of Nemotron Super is its versatility. It can be used in a variety of contexts, from web design and digital interfaces to print media and logos. The font’s retro style can add a unique
Adding requests: 0%| | 0/3 [00:00<?, ?it/s]
Processed prompts: 0%| | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Prompt 1: 'Hello, my name is' (Your Name), and I am a [Your Profession/Student/Parent, etc.] from [Your Location]. I am writing to express my strong support for the proposed regulations to [Briefly Mention the Policy or Regulation, e.g., "strengthen background checks for firearm purchases" or "increase funding for renewable energy projects"]. I believe that [Policy/Regulation] is a critical step towards [Explain the Main Benefit, e.g., "ensuring public safety" or "addressing climate change"]. As someone who [Personal Connection, e.g., "has been affected by gun violence" or "is passionate about environmental conservation"], I have seen firsthand the importance of [Reiterate the Policy's Goal, e.g., "preventing tragedies" or "promoting clean energy"]. The current [Current Situation, e.g., "lax regulations on gun sales" or "reliance on fossil fuels"] poses significant risks to [Affected Group or Community, e.g., Prompt 2: 'The capital of France is' Paris, and the capital of Japan is Tokyo. The capital of the United States is Washington, D.C. The capital of Australia is Canberra, and the capital of Canada is Ottawa. The capital of Brazil is Brasília, and the capital of South Africa is Pretoria (administrative), Cape Town (legislative), and Bloemfontein (judicial). The capital of Egypt is Cairo, and the capital of India is New Delhi. The capital of Mexico is Mexico City, and the capital of Russia is Moscow. The capital of China is Beijing, and the capital of Germany is Berlin. The capital of Italy is Rome, and the capital of Spain is Madrid. The capital of Argentina is Buenos Aires, and the capital of South Korea is Seoul. The capital of Saudi Arabia is Riyadh, and the capital of Iran is Tehran. The capital of Turkey is Ankara, and the capital of Indonesia is Jakarta. The capital of Nigeria is Abuja, and the capital of the United Kingdom is Prompt 3: 'Explain quantum computing in simple terms:' how it works, its applications, and why it matters Quantum computing is a revolutionary technology that uses the principles of quantum mechanics to perform calculations and solve problems that are intractable for classical computers. Here's a simple explanation: ### How It Works: 1. **Qubits (Quantum Bits):** Unlike classical bits that are either 0 or 1, qubits can exist in a state of **superposition**, meaning they can be both 0 and 1 simultaneously. This allows quantum computers to process a vast number of possibilities at once. 2. **Entanglement:** Qubits can be **entangled**, so the state of one qubit is directly related to the state of another, no matter the distance between them. This enables instantaneous coordination and enhances processing power. 3. **Quantum Gates:** Similar to logic gates in classical computing, quantum gates manipulate qubits. However, quantum gates operate on probabilities and can create complex states through interference, allowing for more sophisticated
Streaming (Pseudo)
vLLM’s Python API is designed for high throughput and returns complete RequestOutput objects. For true token-by-token streaming, the OpenAI-compatible server (covered in Part 2) is the recommended approach.
However, we can simulate streaming by iterating through the characters of the final generated text. This is useful for seeing the output progressively in a notebook environment but does not reflect true streaming inference.
Adding requests: 0%| | 0/1 [00:00<?, ?it/s]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Response: Here are some ideas to consider: their speed, power consumption, heat generation, parallel processing capabilities, use in gaming, scientific computing, or machine learning. Silent, swift and hot, GPUs crunch numbers with flair, Lighting up the screen. Another one: Circuits blaze with speed, Billions of cores work as one, Gaming's heart beats fast. And another: Cool
Part 2: OpenAI-Compatible Server
vLLM offers an OpenAI-compatible server that allows you to use familiar tools like the OpenAI Python client and curl. This is the recommended way to use features like chat templates, streaming, and tool calling.
Launch Server
Run the following command in your terminal to start the server.
python -m vllm.entrypoints.openai.api_server \
--model "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5" \
--dtype bfloat16 \
--trust-remote-code \
--served-model-name nemotron \
--host 0.0.0.0 \
--port 5000 \
--max-model-len 65536 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser llama3_json
Chat and Streaming
<think> Okay, the user is asking for three bullet points about vLLM. Let me start by recalling what vLLM is. I know it's a framework for running large language models efficiently. The main points should highlight its key features. First, I remember that vLLM is designed for high throughput and low latency. That's important for applications needing quick responses. So maybe the first bullet can be about efficient inference with techniques like PagedAttention. Second, it's built on a modular architecture. This allows for customization, like integrating different models or backends. That's a good point for developers who need flexibility. <think> Okay, the user wants a short poem about GPUs. Let me start by recalling what GPUs are. They're graphics processing units, right? Used for rendering images, video, and also for parallel computing tasks like machine learning. Hmm, I need to make the poem engaging and not too technical. Maybe focus on their speed, power, and applications. Words like "silicon heart" could personify the GPU. Also, terms like "rendering light" or "shadows dance" might evoke imagery related to graphics. I should structure it in stanzas, maybe four quatrains. Rhyming scheme could
Reasoning Modes (think vs. no_think)
curl Examples
You can also interact with the server directly using curl.
Chat completion:
curl -sS -X POST http://127.0.0.1:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.0
}'
Streaming chat completion:
curl -N -sS -X POST http://127.0.0.1:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron",
"messages": [
{"role": "user", "content": "Write a short poem about GPUs."}
],
"stream": true
}'
Cleanup
To stop the OpenAI-compatible server, press CTRL+C in the terminal where it is running.
Resource Notes
- Hardware: Nemotron-49B-v1.5 is a large model. For optimal performance, running on a multi-GPU setup with high-speed interconnects (like NVLink) is recommended.
- Quantization: vLLM supports various quantization techniques that can significantly reduce the memory footprint of the model, allowing it to run on smaller GPUs.
- Chat Templates: When using the OpenAI-compatible server, vLLM automatically applies the correct chat template for the model, which is crucial for getting properly formatted and accurate responses in conversational tasks.
- Tool Calling: The
--enable-auto-tool-choiceand--tool-call-parserflags enable advanced tool-calling capabilities for the model.
Conclusion
In this notebook, you have learned how to:
- Run inference with the Nemotron-49B-v1.5 model using the vLLM Python API.
- Deploy the model as an OpenAI-compatible server.
- Interact with the server using both a Python client and
curlfor chat, streaming, and reasoning mode demonstrations. - Utilize the model's reasoning modes for different use cases.
This notebook provides a solid foundation for building applications with Nemotron and vLLM.