NVIDIA Releases Nemotron 3 Ultra NVFP4 Checkpoint Using Model Optimizer
NVIDIA has unveiled the Nemotron 3 Ultra NVFP4 checkpoint, a newly optimized large language model designed to address the growing computational demands of extended context windows. By leveraging NVIDIA Model Optimizer, the company successfully quantized its 550-billion parameter architecture into NVFP4, a 4-bit floating-point format introduced with the Blackwell GPU architecture. The resulting checkpoint reduces the model’s storage footprint from 1,121 gigabytes in BF16 precision to 352.3 gigabytes, achieving a 3.2x compression ratio while maintaining accuracy parity with standard BF16 implementations across nearly all benchmarks. Unlike conventional quantization approaches that uniformly compress every network layer, NVIDIA employed a strategic mixed-precision framework. Sensitive components such as embedding layers, attention projections, and latent MoE modules retain BF16 precision, while MoE routed experts, shared experts, and KV caches utilize NVFP4 or FP8 formats. This selective optimization enables the checkpoint to operate seamlessly across both NVIDIA Hopper and Blackwell hardware. On Blackwell systems, the model utilizes native W4A4 processing, whereas Hopper architectures automatically transition to W4A16 to accommodate Multi-Token Prediction without exceeding memory limits. The development process required overcoming significant calibration challenges inherent to 4-bit quantization. Standard scaling methods, including maximum value and mean squared error approaches, proved vulnerable to outlier weights or failed to correlate directly with downstream accuracy. NVIDIA researchers resolved this by implementing a four-over-six scaling algorithm, which dynamically assigns each weight block a maximum scale of either four or six based on the quantization grid. This innovation reduced median reconstruction error by 16.4 percent compared to standard max calibration and achieved 98.5 percent median recovery relative to the original BF16 model. Furthermore, systematic tuning of effective bits per element identified 5.03 as the optimal operating point, balancing memory efficiency with benchmark performance. To streamline deployment, the quantization pipeline was engineered for distributed execution across sixteen GPUs using NVIDIA Megatron-LM and Model Optimizer. This parallelized workflow cut total calibration and export time from two hours to forty-five minutes. The team provides a fully customizable, config-driven quantization recipe available through an open-source launcher, enabling developers to replicate the process on custom checkpoints or local deployments. NVIDIA has published the complete Nemotron 3 Ultra NVFP4 recipe on GitHub, encouraging community contributions to advance high-efficiency model optimization. The release underscores NVIDIA’s commitment to democratizing access to high-performance, hardware-aware AI models while pushing the boundaries of inference throughput and memory efficiency.
