Kimi K2.5 Multimodal VLM Now Available on NVIDIA GPU-Accelerated Endpoints for Advanced AI Development
Kimi K2.5 is the latest addition to the Kimi family of open vision language models (VLMs), designed as a general-purpose multimodal model capable of excelling in a wide range of high-demand tasks such as agentic AI workflows, natural language chat, complex reasoning, coding, and mathematical problem solving. Built using the open-source Megatron-LM framework, Kimi K2.5 leverages advanced GPU-accelerated training techniques through tensor, data, and sequence parallelism, enabling efficient scaling of massive transformer-based architectures. The model features a mixture-of-experts (MoE) design with 384 total experts, including one shared expert, and selects 8 experts per token, resulting in an efficient 3.2% activation rate per token. With 32.86 billion active parameters out of a total of 1 trillion, the model is optimized for both performance and resource efficiency. Kimi K2.5 supports multiple modalities, including text, images, and video. Its large vocabulary of approximately 164,000 tokens includes specialized tokens for visual input, enabling robust multimodal understanding. The model’s visual processing pipeline is powered by MoonViT3d, a custom vision tower developed by Kimi that transforms images and video frames into high-dimensional embeddings, enhancing the model’s ability to interpret and reason over visual content. Developers can begin experimenting with Kimi K2.5 immediately through free access to GPU-accelerated endpoints on build.nvidia.com, part of the NVIDIA Developer Program. This browser-based platform allows users to test the model with their own data without requiring local infrastructure. For production use, NVIDIA NIM microservices—containerized inference solutions—are expected to be available soon. The model is also accessible via the NVIDIA-hosted API, available at no cost with registration in the NVIDIA Developer Program. Developers can invoke the model using standard OpenAI-compatible API calls. Example code using Python’s requests library demonstrates how to send a chat completion request, including parameters for streaming, temperature, and token limits. The API supports tool calling through the OpenAI-compatible tools parameter, enabling integration with external tools and agents. For deployment at scale, Kimi K2.5 can be served using the vLLM inference framework, which offers high throughput and low latency. Detailed instructions and a recipe for deploying Kimi K2.5 with vLLM are available for developers. Customization and domain-specific adaptation are supported through the NVIDIA NeMo Framework, an open-source suite for scalable model training and post-training. Using NeMo AutoModel, developers can fine-tune Kimi K2.5 directly from Hugging Face checkpoints without conversion, enabling rapid experimentation. The framework supports supervised fine-tuning, parameter-efficient methods, and reinforcement learning, making it suitable for enterprise applications involving multimodal reasoning and agentic workflows. From data center deployments on NVIDIA’s Blackwell architecture to fully managed enterprise solutions via NVIDIA NIM, developers have multiple pathways to integrate Kimi K2.5 into their systems. To get started, visit the Kimi K2.5 model page on Hugging Face, explore the Kimi API Platform, or test the model interactively on the build.nvidia.com playground.
