NVIDIA Dynamo Integrations Simplify AI Inference at Data Center Scale
AI inference is evolving from simple, single-node deployments to complex, multi-node systems that require advanced orchestration. As AI models grow in size and complexity—especially in multi-agent workflows—scaling inference across clusters has become essential to support millions of users and deliver low-latency responses. NVIDIA’s full-stack inference platform, including the Dynamo and Grove tools, is designed to meet this challenge by enabling efficient, scalable, and high-performance AI serving. A key innovation is disaggregated inference, which separates the prompt processing (prefill) and response generation (decode) phases and assigns them to optimally configured GPUs. This approach avoids bottlenecks by allowing each phase to run on hardware best suited for its workload. NVIDIA Dynamo integrates this capability into production environments, enabling systems like Baseten to double inference speed and boost throughput by 60% without adding hardware. Benchmarks from SemiAnalysis show that Dynamo on NVIDIA GB200 NVL72 systems delivers the lowest cost per million tokens for large reasoning models such as DeepSeek-R1. Scaling these systems across dozens or hundreds of nodes requires robust orchestration. Kubernetes, the industry standard for container management, now serves as the backbone for multi-node AI inference. NVIDIA Dynamo enhances Kubernetes by adding deep integration with NVIDIA’s accelerated computing infrastructure, supporting deployment across Blackwell-based systems like GB200 and GB300 NVL72. To simplify the management of complex inference architectures, NVIDIA has introduced Grove—a Kubernetes API now embedded in Dynamo. Grove allows developers to define entire inference systems—such as prefill, decode, routing, and vision encoders—as a single, high-level specification. This declarative approach enables precise control over component placement, startup order, scaling policies, and network topology. Grove’s core components include PodClique (for grouping pods by role), ScalingGroup (for bundling dependent components), and PodCliqueSet (for defining the full system). These work together to enable multilevel autoscaling, hierarchical gang scheduling, and topology-aware placement. For example, prefill and decode components can scale independently based on workload, while ensuring that each model replica is co-located on the same high-speed interconnect to minimize latency. Grove also supports system-level lifecycle management, including recovery and rolling updates, treating the entire inference system as a single operational unit. This ensures that if a prefill worker fails, it reconnects properly to its leader, and updates maintain low-latency network topologies. The platform is fully open source and available on GitHub. Developers can deploy disaggregated systems using a simple manifest, as demonstrated with a Qwen3 0.6B model. The workflow includes installing Dynamo CRDs, creating a DynamoGraphDeployment, and verifying the deployment via port-forwarding and API testing. Grove is already being adopted by cloud providers and enterprises. Nebius, for instance, is building its cloud around NVIDIA’s infrastructure with Grove as a key enabler. The technology is also being showcased at KubeCon 2025 in Atlanta. In summary, the shift to multi-node, disaggregated inference demands smarter orchestration. NVIDIA Grove, integrated into Dynamo and running on Kubernetes, provides a powerful, scalable, and open solution for deploying complex AI systems. It allows developers to focus on building intelligent applications while the platform handles the complexity of distributed inference, making large-scale AI deployment faster, more efficient, and production-ready.
