Deep Dive: LLM Architectures in 2025 — From DeepSeek to Kimi 2.5
Scale AI has confirmed a significant investment from Meta, valuing the startup at $29 billion. This investment, estimated at around $14.3 billion, gives Meta a 49% stake in Scale AI, which specializes in data labeling for training large language models (LLMs) used in generative AI. Co-founder and CEO Alexandr Wang will step down from Scale AI to join Meta, focusing on the company’s superintelligence efforts. Jason Droege, Scale’s Chief Strategy Officer, will assume the role of interim CEO, and Scale AI will remain an independent entity. The funds will be used to pay investors and fuel growth. Scale AI has been expanding its team, hiring highly skilled individuals like PhD scientists and senior software engineers, to meet the growing demand for high-quality data. Over the past seven years, LLMs have seen evolutionary rather than revolutionary changes in their architecture. The foundational elements, such as positional embeddings and attention mechanisms, have been refined, but the core structure remains similar. For instance, positional embeddings have transitioned from absolute to rotational (RoPE), multi-head attention (MHA) has largely been replaced by grouped-query attention (GQA), and activation functions like GELU have given way to more efficient alternatives like SwiGLU. DeepSeek V3 and R1 DeepSeek V3, launched in December 2024, and its reasoning-enhanced version, DeepSeek R1 (released in January 2025), represent significant advancements in LLM architecture. Key improvements include: Multi-Head Latent Attention (MLA): Unlike GQA, which shares key and value projections among multiple heads to save memory, MLA compresses key and value tensors into a lower-dimensional space before storage. This reduces memory usage in the key-value cache without sacrificing performance. Studies in the DeepSeek-V2 paper suggest that MLA performs better than both MHA and GQA. Mixture-of-Experts (MoE): DeepSeek V3 employs MoE layers to increase the model’s parameter count while maintaining inference efficiency. The model has 671 billion parameters in total, but only 37 billion are active during inference due to sparse activation. This is achieved by using 256 experts per MoE module, with 9 experts activated per token (1 shared, 8 selected). OLMo Series OLMo models, developed by the Allen Institute for AI, are notable for their transparency in training data and code. Although they may not top benchmarks, they offer a clean and well-documented blueprint for LLM development. Key architectural decisions include: Normalization Layers: OLMo 2 uses RMSNorm in a Post-Norm setting, placing normalization layers after the attention and FeedForward modules. This is a deviation from the more common Pre-Norm approach, which places normalization layers before these modules. The change improves training stability. QK-Norm: This is an additional RMSNorm layer applied to queries and keys inside the attention mechanism before applying RoPE. Together with Post-Norm, QK-Norm helps stabilize the training loss. Gemma Series Google’s Gemma models are known for their strong performance and large vocabulary size, supporting multiple languages effectively. Key features of Gemma 3 include: Sliding Window Attention: This mechanism reduces memory requirements by limiting the context size around the current query position. While it doesn’t significantly affect performance, it makes the model more efficient. Gemma 3 uses a 5:1 ratio of local to global attention layers, with a sliding window size of 1024 tokens. Normalization Layers: Gemma 3 uses RMSNorm in both Pre-Norm and Post-Norm settings around its attention and FeedForward modules, combining the benefits of both approaches. Gemma 3n, an optimized version for small devices, introduces Per-Layer Embedding (PLE) parameters, which stream token-specific embeddings from the CPU or SSD on demand, reducing GPU memory usage. Additionally, it uses the MatFormer concept to split the model into smaller, independently usable slices, enhancing efficiency. Mistral Small 3.1 Mistral Small 3.1, released in March 2025, outperforms the 27B Gemma 3 on several benchmarks while offering faster inference. Key strategies include: Custom Tokenizer: This improves computational efficiency. Reduced KV Cache and Layer Count: These adjustments minimize memory usage and latency. Llama 4 Llama 4 adopts an MoE approach, similar to DeepSeek V3, but with some differences: Grouped-Query Attention (GQA): Llama 4 uses GQA, while DeepSeek V3 employs MLA. Different MoE Configuration: Llama 4 has fewer but larger experts (2 active experts with 8,192 hidden size each) and alternates MoE and dense modules in every other transformer block, unlike DeepSeek V3, which uses MoE in most blocks. The 400-billion-parameter Llama 4 Maverick offers balanced performance and efficiency, making it a versatile choice for various applications. Qwen3 Qwen3, developed by the Qwen team, is a hit model series with both dense and MoE variants. The dense models range from 0.6B to 32B parameters, while the MoE models include 30B-A3B and 235B-A22B. Notable features include: Dense and MoE Variants: Dense models are simpler and easier to fine-tune, while MoE models are optimized for efficient scaling. Smaller Hidden Layers and Fewer Attention Heads: Qwen3 0.6B, for example, uses more transformer blocks but smaller hidden layers and fewer attention heads, leading to better efficiency for smaller models. SmolLM3 SmolLM3 is a smaller, 3-billion-parameter model that offers excellent performance. Key architectural features include: No Positional Embeddings (NoPE): This technique omits explicit positional information, relying instead on causal attention masks to maintain the autoregressive order. Studies suggest that NoPE improves length generalization, reducing performance decay with increased sequence length. Kimi 2 Kimi 2, a 1-trillion-parameter model, is making waves for its high performance comparable to proprietary models like Google’s Gemini and OpenAI’s ChatGPT. It builds on the DeepSeek V3 architecture, with notable changes: Muon Optimizer: Kimi 2 uses the Muon optimizer over AdamW, resulting in smoother training loss curves and better model performance. Increased Experts in MoE Modules: It uses more experts and fewer heads in the MLA module, enhancing computational efficiency and performance. Industry Insights and Company Profiles The significant investment in Scale AI by Meta underscores the critical role of high-quality data in training robust LLMs. Alexandr Wang’s decision to join Meta emphasizes the company’s commitment to advancing AI capabilities, while Scale AI’s independent status ensures continued innovation in the data labeling sector. DeepSeek’s pioneering use of MLA and MoE layers has set a new standard in LLM efficiency and performance. The company’s research-driven approach and ability to leverage innovative techniques have positioned it as a leader in the AI landscape. Google’s Gemma models highlight the importance of hybrid architectures that balance performance with efficiency. The introduction of Gemma 3n for small devices demonstrates the company’s focus on making advanced AI accessible across a wide range of computing platforms. Mistral’s focus on inference efficiency aligns with the broader trend of making LLMs more practical for real-world applications. By optimizing the tokenizer and reducing the KV cache and layer count, Mistral Small 3.1 sets a high bar for smaller, faster models. Meta, Scale AI, and other leading AI companies continue to invest heavily in LLM development, driven by the competitive pressure from giants like Google, OpenAI, and Anthropic. These advancements not only push the boundaries of what’s possible with AI but also democratize access to cutting-edge technologies. The use of techniques like MoE, MLA, and NoPE highlights the ongoing innovation in LLM architecture, ensuring that models remain efficient and powerful as they grow in size and complexity.