Kimi K2.5 Review: Powerful Vision and Agent Swarm, But Verbosity Hurts Efficiency
Kimi K2.5, released by Beijing-based Moonshot AI on January 27, 2026, remains a compelling option two weeks after launch, particularly for vision and agentic workloads. At 1.04 trillion parameters with 32 billion activated per token, it stands as one of the largest open-weight models, outpacing competitors like MiniMax-M2.5, Qwen3.5, and GLM-5 in scale. The model leverages 384 experts with 8 activated per token, MLA attention, SwiGLU activation, and a 256K context window—architecture identical to its predecessor, Kimi K2. What sets K2.5 apart is its training: it builds on a 15T text-only pre-training base and continues with approximately 15T mixed visual-text tokens, 1T for ViT training, and 700B for long-context mid-training, totaling roughly 32T tokens across the pipeline. The vision encoder, MoonViT-3D, is a 400M-parameter native-resolution ViT based on SigLIP-SO-400M, using NaViT packing to handle variable image sizes. For video, frames are grouped in fours and temporally pooled, achieving 4x compression—similar to Qwen3.5’s early fusion approach. Late fusion is no longer viable for frontier models. The standout innovation is Agent Swarm, powered by Parallel-Agent Reinforcement Learning (PARL). Instead of sequential tool use, K2.5 decomposes tasks into parallel subtasks and delegates them to frozen sub-agents, with only the orchestrator updated via RL. This avoids credit assignment issues. Training included auxiliary rewards to prevent serial collapse and spurious parallelism, which were later annealed. Results show significant gains: BrowseComp jumps from 60.6% to 78.4%, WideSearch F1 from 72.7% to 79.0%, and execution time drops 3–4.5x. Notably, Qwen3.5 achieves a similar BrowseComp score (78.6%) without swarm, but K2.5 is the first open-weight model to train for true parallelization. Benchmark performance is strong in several areas: HLE-Full with tools (50.2% vs. GPT-5.2’s 45.5%), OCRBench (92.3%), MathVista (90.1%), and InfoVQA (92.6%). However, it lags in AIME 2025 (96.1% vs. GPT-5.2’s 100%), SWE-Bench Verified (76.8% vs. Claude Opus 4.5’s 80.9%), GPQA-Diamond (87.6% vs. GPT-5.2’s 92.4%), and Terminal-Bench 2.0 (50.8% vs. Claude’s 59.3%). It also underperforms on WeirdML (46% vs. GPT-5.2’s 72%) and scores -11 on Artificial Analysis’s omniscience index, indicating higher hallucination than top models. Community feedback highlights strengths and weaknesses. Coding performance is strong, especially in front-end and visual-to-code tasks, with developers reporting rapid project completion at a fraction of Opus’s cost. However, K2.5 often produces verbose, over-engineered code initially, requiring simplification. Agent Swarm works well for parallel research and data collection but suffers from inconsistent sub-agent definitions—especially in spreadsheet tasks where column meanings vary. Vision capabilities are now genuinely competitive; K2.5 matched Gemini 3 on Chinese document transcription, outperforming previous Chinese models. Qwen3.5 also shows strong vision performance, particularly on MathVista and UI understanding. A major concern is verbosity: K2.5 generates 89 million output tokens in evaluations, six times the median. While input/output pricing appears low, the high token volume increases effective costs. Kilo Code observed usage exceeding 50B tokens per day during a free trial, undermining input caching benefits. Multimodal insights from the technical report are valuable: early fusion with low vision ratio (10%) outperformed late fusion with high ratio (50%). Surprisingly, text-only fine-tuning (zero-vision SFT) activated visual reasoning, and visual RL even improved text benchmarks like MMLU-Pro and GPQA-Diamond. Geopolitically, K2.5 was trained under U.S. export controls and remains competitive despite hardware constraints. Moonshot’s $4.8B valuation funds aggressive user acquisition with generous free tiers. The Modified MIT license allows commercial use under 100M MAU. Deployment requires serious resources: ~595GB at native INT4, or ~375GB with Unsloth’s 1.8-bit quant. It runs on a single 24GB GPU with 256GB+ RAM at ~10 tokens/sec. Support is available via vLLM, SGLang, and KTransformers, though backend parsing issues persist. GGUF/llama.cpp vision support is not yet available. API providers include Fireworks (fastest), DeepInfra (cheapest), and Baseten (highest throughput). In summary, Kimi K2.5 is a powerful open-weight model with leading vision and agent swarm capabilities, but verbosity and inconsistent output quality remain key challenges. Its true value depends on the use case—especially where parallelization and multimodal reasoning matter most. The long-term viability of its aggressive pricing and the generalization of PARL remain to be seen.
