HyperAI

Nvidia stands alone in the AI industry as the only company capable of giving away its models for free while still maintaining profitability. This unique position stems from its dominant hardware business, which allows it to build massive AI training clusters at scale and at low cost. With its highly profitable GPU sales—especially in the Blackwell series, which can cost between $35,000 and $45,000 per unit—Nvidia can afford to offer its AI models at little or no cost, while charging a relatively modest $4,500 per GPU per year for its AI Enterprise software stack. This stack includes libraries and tools that support a wide range of AI and high-performance computing models. This strategy echoes the early days of IBM’s System/360 mainframes, where hardware was sold with free software support and expert assistance. Over time, IBM turned that support into a major revenue stream. Nvidia appears to be following a similar path, aiming for full stack integration—from chips to software to data centers—eventually becoming an AI utility rather than just a cloud provider. Nvidia has long been involved in open source AI, having supported nearly every major open model and helped run leading closed models like Google Gemini, Anthropic Claude, and OpenAI’s GPT. In a prelaunch briefing for Nemotron 3, Kari Briski, VP of generative AI software for enterprise at Nvidia, revealed that over the past 2.5 years, around 350 million open source AI frameworks and models have been downloaded. The Hugging Face repository now hosts over 2.8 million open models, and roughly 60% of companies are using open source AI tools. In 2025, Nvidia became the top contributor to open source on Hugging Face, releasing 650 models and 250 datasets. Nvidia’s journey into open models began with Megatron-LM in 2019, a transformer model capable of training 8 billion parameters across 512 GPUs. It evolved into Megatron-Turing NLG with Microsoft, reaching 530 billion parameters. The NeMo toolkit, introduced alongside Megatron-LM, became the foundation for the Nemotron series. The original Nemotron-4 models, launched in June 2024, included a 340-billion-parameter version. The Nemotron 1 series combined Llama 3.1 with Nvidia’s reasoning techniques, offering variants at 8B, 49B, 70B, and 235B parameters. Nemotron 2 Nano, released earlier this year, introduced a hybrid architecture blending the traditional transformer model with Mamba, a selective state space model developed at Carnegie Mellon and Princeton. Transformers excel at capturing broad data patterns, while Mamba efficiently processes smaller, focused data segments. Nemotron 3, unveiled this week, takes this hybrid approach further with a mixture of experts (MoE) architecture designed for multi-agent systems. This allows the model to activate only the necessary parts during inference, improving efficiency. Briski explained that the hybrid design reduces memory usage by avoiding large attention maps and key-value caches, enabling more experts to be used without increasing memory load. The Nemotron 3 family includes three models: Nano, Super, and Ultra. The Nano version has 30 billion total parameters, with only 3 billion active at a time, making it capable of running on a single L40S GPU. The Super variant has 100 billion total parameters, activating up to 10 billion at once. The Ultra version reaches 500 billion total parameters with 50 billion active. A key innovation in the Super and Ultra models is the latent mixture of experts, where experts share a common core but keep private components—like chefs sharing a kitchen but using their own spice racks. This design allows four times more experts to be used without sacrificing performance. Nemotron 3 models rely heavily on reinforcement learning, unlike Nemotron 2 Nano, which used more supervised learning. They also feature a context window of up to 1 million tokens. The models are pretrained on a 25 trillion token dataset using NVFP4 4-bit precision, boosting inference throughput. Benchmarking by Artificial Analysis shows Nemotron 3 Nano 30B/3B significantly outperforms earlier Nemotron 2 models in both token throughput and accuracy. On a graph plotting intelligence against speed, it lands in the upper right—ideal for high performance and high accuracy. In the Openness Index comparison, it scores well on both openness and correctness. Nvidia may offer technical support for Nemotron 3 as part of its AI Enterprise stack or separately. If so, it could charge just enough to cover development costs, undercutting the increasingly closed models from OpenAI, Anthropic, and Google. In a world where AI is becoming more proprietary, Nvidia’s open model strategy, backed by its hardware dominance, may be the only sustainable way to keep AI accessible and competitive.

Related Links

Related Links

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

Command Palette

Nvidia Leverages Hardware Dominance to Offer Free AI Models, Outpacing Closed Competitors with Open Nemotron 3 Stack

Related Links

Command Palette

Nvidia Leverages Hardware Dominance to Offer Free AI Models, Outpacing Closed Competitors with Open Nemotron 3 Stack

Related Links

Command Palette

Nvidia Leverages Hardware Dominance to Offer Free AI Models, Outpacing Closed Competitors with Open Nemotron 3 Stack

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.