HyperAI

NVIDIA has introduced Nemotron Nano 2 9B, a new open model designed to supercharge edge AI with high-accuracy reasoning. Built for enterprise-grade agentic AI, the model combines a hybrid Transformer–Mamba architecture with a configurable thinking budget, enabling developers to balance accuracy, speed, and cost for real-world applications. Nemotron Nano 2 leads in accuracy among models of its size across key reasoning tasks like math, coding, and science. It excels in both instruction following and function calling, making it ideal for autonomous agents that solve complex, multi-step problems. The hybrid architecture leverages Mamba-2 selective state-space modules for efficient long-context reasoning, reducing memory usage and enabling faster token generation. Interleaved Transformer layers preserve the ability to connect distant information, maintaining high performance while boosting throughput. One of the standout features is the thinking budget, which lets developers limit the amount of internal reasoning the model performs. By inserting the tag, users can cap the reasoning process, reducing unnecessary token generation and cutting inference costs by up to 60% without sacrificing accuracy. This is especially valuable for low-latency edge environments like customer support systems and real-time autonomous agents. The model was developed from a larger 12B base model, which was fine-tuned and aligned through multiple stages including supervised fine-tuning, reinforcement learning, and preference optimization. To fit within the memory constraints of edge GPUs like the A10G, the team applied model compression using the Minitron framework. This involved pruning across depth, width, and layer dimensions, followed by knowledge distillation from the larger model to recover performance. The final 9B version runs efficiently at 128k context length with optimized memory use. Nemotron Nano 2 is available through NVIDIA’s NIM (NVIDIA Inference Microservices) and can be deployed via vLLM. Developers can integrate it into their workflows using a client-side thinking budget implementation that controls reasoning length and ensures efficient inference. The model supports two modes: reasoning on (default), which generates a chain-of-thought with thinking tokens, and reasoning off, which delivers direct responses. NVIDIA has open-sourced the model weights, training datasets, and technical methods to support the broader AI community. This enables developers to adapt and enhance the model for specific use cases. With 6x higher throughput than comparable open models and strong accuracy, Nemotron Nano 2 is a powerful tool for building efficient, intelligent AI agents at the edge. Developers can start using it today at build.nvidia.com.

Supercharge Edge AI with High-Accuracy Reasoning Using NVIDIA Nemotron Nano 2 9B

Related Links