HyperAIHyperAI

Command Palette

Search for a command to run...

NVIDIA and Mistral AI Launch Accelerated Open Models for Scalable AI Performance

Mistral AI has launched the Mistral 3 family of open-source, multilingual, and multimodal models, optimized for deployment across NVIDIA’s full-stack ecosystem—from supercomputing platforms to edge devices. The release includes Mistral Large 3, a large mixture-of-experts (MoE) model with 675 billion total parameters and 41 billion active parameters, and three smaller dense models: Ministral-3B, Ministral-8B, and Ministral-14B. All models feature a 256K context window, enabling powerful long-form reasoning and document processing. Trained on NVIDIA Hopper GPUs, the models are now available on Hugging Face, with support for multiple precision formats and open-source inference frameworks. Mistral Large 3 leverages NVIDIA’s GB200 NVL72 supercomputing platform, delivering industry-leading performance and efficiency. By using a MoE architecture, the model activates only the most relevant experts for each input, reducing computational waste and improving scalability. This design, combined with NVIDIA’s Wide Expert Parallelism (Wide-EP) and low-latency inference optimizations, enables the model to achieve over 5 million tokens per second per megawatt at 40 tokens per second per user—up to 10x faster than the previous-generation H200. The performance is further enhanced by NVIDIA Dynamo, a disaggregated inference framework that optimizes prefill and decode phases, and NVFP4 quantization, which maintains accuracy while reducing compute and memory costs through fine-grained block scaling and FP8 scaling factors. The NVFP4 variant is specifically optimized for NVIDIA Blackwell architecture and runs seamlessly on GB200 NVL72 systems. Developers can deploy Mistral Large 3 using popular frameworks like TensorRT-LLM, vLLM, SGLang, and Llama.cpp, with support for speculative decoding, multitoken prediction, and other advanced optimizations. NVIDIA has also partnered with vLLM and SGLang to extend support for Blackwell-specific features, including disaggregation and expanded parallelism. The Ministral 3 series is designed for edge deployment, offering high performance on NVIDIA GeForce RTX AI PCs, DGX Spark systems, and Jetson devices. These compact models deliver fast inference—up to 385 tokens per second on RTX 5090 GPUs—and efficient execution on Jetson Thor, achieving 52 tokens per second at single concurrency and scaling to 273 with eight concurrent users. NVIDIA has integrated the models with Llama.cpp and Ollama, enabling low-latency, privacy-preserving AI on local devices. Enterprise developers can use Mistral Large 3 and Ministral-14B-Instruct via NVIDIA’s API catalog and preview APIs, with downloadable NVIDIA NIM microservices expected soon for on-premises or hybrid deployment. The models are also compatible with NVIDIA NeMo tools, allowing developers to customize models for specific use cases with data design, guardrails, and agent lifecycle management. Mistral 3 marks a major step toward “distributed intelligence,” bridging cutting-edge research with real-world applications. By combining open-source accessibility with deep hardware and software optimization, the models empower developers to innovate faster, deploy efficiently, and scale across cloud, data center, and edge environments. The release is available now on Hugging Face and build.nvidia.com/mistralai, with full support across NVIDIA’s ecosystem.

Related Links