HyperAI

Google has officially launched DiffusionGemma, an experimental open-source model that marks a significant shift in generative paradigms for large language models. Part of the Gemma 4 family, this model employs a sparse Mixture-of-Experts (MoE) architecture with 26 billion parameters but activates only 3.8 billion during inference. Unlike traditional autoregressive word-by-word generation, DiffusionGemma introduces image diffusion concepts into the textual domain: starting from random placeholders, it processes data through multiple forward propagation steps in parallel, generating 256 tokens directly at each step until converging on complete text via iterative refinement. This architecture fundamentally overcomes computational bottlenecks associated with local inference. Traditional models suffer from low GPU utilization due to memory bandwidth constraints inherent in sequential prediction, whereas DiffusionGemma concentrates its computational load, achieving reasoning speeds exceeding 1,000 Tokens per second on NVIDIA H100 GPUs and surpassing 700 Tokens per second on RTX 5090s—representing up to a fourfold increase in overall speed. Leveraging bidirectional attention mechanisms, the model demonstrates exceptional performance in non-linear tasks such as intra-line editing, code completion, mathematical graph structures, and real-time self-correction. Google emphasizes that DiffusionGemma is explicitly designed for local deployment and low-concurrency scenarios. While parallel generation significantly boosts speed, the overall output quality remains slightly lower than standard Gemma 4 outputs, making it unsuitable for cloud services requiring high queries-per-second (QPS). Released under the Apache 2.0 license, the quantized model requires merely 18 GB of VRAM to operate and fully supports mainstream frameworks including vLLM, MLX, and Hugging Face Transformers. It also features deep optimization for the NVIDIA Blackwell architecture and NVFP4 precision. Developers can now obtain weights from Hugging Face to begin experimentation.

Related Links

Related Links

Related Links

Online Tutorial | UC Berkeley/NVIDIA and Others Release Gsplat, an open-source 3DGS Library That Saves 4x GPU Memory and Reduces Training Time by 10%.

Online Tutorial | UC Berkeley/NVIDIA and Others Release Gsplat, an open-source 3DGS Library That Saves 4x GPU Memory and Reduces Training Time by 10%.

Command Palette

DeepMind drops DiffusionGemma

Related Links

Command Palette

DeepMind drops DiffusionGemma

Related Links

Command Palette

DeepMind drops DiffusionGemma

Related Links

Online Tutorial | UC Berkeley/NVIDIA and Others Release Gsplat, an open-source 3DGS Library That Saves 4x GPU Memory and Reduces Training Time by 10%.

Online Tutorial | UC Berkeley/NVIDIA and Others Release Gsplat, an open-source 3DGS Library That Saves 4x GPU Memory and Reduces Training Time by 10%.