Nemotron 3 Nano 4B emerges for efficient local AI
NVIDIA has launched Nemotron 3 Nano 4B, a compact hybrid language model designed for efficient local and edge AI deployment. This new addition to the Nemotron 3 family utilizes a hybrid Mamba-Transformer architecture and features just 4 billion parameters. It is built to deliver state-of-the-art instruction following and tool-use capabilities while maintaining a minimal VRAM footprint. The model is optimized to run on NVIDIA Jetson platforms, including Jetson Thor and Jetson Orin Nano, as well as NVIDIA DGX Spark and GeForce RTX GPUs. These capabilities enable faster response times, enhanced data privacy, and flexible deployment without sacrificing performance, making it ideal for local conversational agents and robotics applications. The model was developed by compressing and distilling the Nemotron Nano 9B v2 using NVIDIA's proprietary Nemotron Elastic framework. This technology employs a structured pruning approach guided by a jointly trained router, which determines the optimal network architecture to meet the 4-billion-parameter target. By pruning specific axes, including model depth, Mamba heads, and intermediate dimensions, the framework achieves a significant reduction in model size without the high costs associated with training from scratch. Following compression, the model underwent a two-stage distillation process to recover accuracy. The first stage focused on short-context training using 63 billion tokens, while the second stage extended the context window to 49,000 tokens using 150 billion tokens to improve long-horizon reasoning. Further refinement was achieved through supervised fine-tuning and multi-stage reinforcement learning. The model was trained on diverse datasets covering math, coding, science, and agentic tasks, followed by safety-focused training. A three-stage reinforcement learning pipeline using NeMo-RL was then applied to enhance instruction following and tool-calling behaviors. To ensure maximum efficiency on resource-constrained devices, the model was released in both FP8 and Q4_K_M GGUF quantized formats. Selective quantization strategies were employed, keeping specific attention and Mamba layers in higher precision (BF16) while quantizing the rest to FP8. This approach allowed the FP8 version to achieve 100% median accuracy recovery compared to the full precision model while delivering up to 1.8 times improvement in latency and throughput. The Q4_K_M quantized version also proved highly effective on Jetson Orin Nano devices, achieving inference speeds of 18 tokens per second, which is twice as fast as the larger 9B predecessor. As an open-source model, Nemotron 3 Nano 4B empowers developers to customize and fine-tune the architecture for specific domain needs. It is compatible with major inference engines such as Transformers, vLLM, TRT-LLM, and Llama.cpp, facilitating broad adoption across various hardware setups. NVIDIA has provided detailed usage examples and step-by-step instructions for Jetson deployments on their AI Lab model page. Additionally, the model integrates with the NVIDIA In-Game Inferencing SDK to optimize performance when running alongside intensive graphics workloads. By combining high accuracy with extreme efficiency, this release sets a new benchmark for lightweight small language models capable of running directly on edge devices.
