HyperAI

The introduction of the llm-d community at Red Hat Summit 2025 represents a significant advancement in the field of generative AI inference for the open source ecosystem. Built on top of vLLM and Inference Gateway, llm-d introduces a Kubernetes-native architecture designed for large-scale inference deployments, aiming to enhance efficiency and performance. Key Components Supporting llm-d Accelerated Inference Data Transfer Large-scale distributed inference relies heavily on efficient data transfer techniques among GPUs. These methods, including tensor, pipeline, and expert parallelism, require low-latency, high-throughput communication. To meet this demand, llm-d leverages NVIDIA NIXL, a component of NVIDIA Dynamo. NIXL is a high-throughput, low-latency point-to-point communication library that facilitates rapid and asynchronous data movement across various memory and storage tiers. Specifically, NIXL accelerates the key-value (KV) cache transfer between prefill and decode GPU workers in environments where these phases are separated, optimizing the disaggregated serving process. Prefill and Decode Disaggregation Traditionally, large language models (LLMs) run both the computationally intensive prefill phase and the memory-heavy decode phase on the same GPU, leading to suboptimal resource utilization. Disaggregated serving addresses this issue by separating these phases onto different GPUs or nodes. This approach allows for independent optimization and better use of hardware resources. To support this practice in the open source community, NVIDIA has contributed to the design and implementation of prefill and decode request scheduling algorithms in the vLLM project, enhancing the efficiency of these processes in llm-d. Future Collaboration and Features Dynamic GPU Resource Planning Modern LLM inference workloads present unique challenges, particularly with varying input and output sequence lengths. Traditional autoscaling methods based on metrics like queries per second (QPS) fail to accurately predict resource needs or balance GPU loads in disaggregated setups. Recognizing this, NVIDIA is set to integrate the NVIDIA Dynamo Planner into the llm-d Variant Autoscaler component. The Dynamo Planner is a specialized planning engine that understands LLM-specific inference patterns, making intelligent scaling decisions and optimizing GPU resource allocation. KV Cache Offloading The high cost of storing large volumes of KV cache in GPU memory is a significant concern in AI inference. To mitigate this, NVIDIA and the community will implement the NVIDIA Dynamo KV Cache Manager in the llm-d KV Cache subsystem. This manager offloads less frequently accessed KV cache to more cost-effective storage solutions such as CPU host memory, SSDs, or networked storage. Leveraging NIXL for seamless KV cache tiering, the Dynamo KV Cache Manager aims to reduce storage costs while maintaining performance. Commercial Support and Deployment NVIDIA NIM for Optimized AI Inference For enterprises requiring robust, secure, and reliable AI inference solutions, NVIDIA NIM (NVIDIA Inference Microservices) integrates leading inference technologies, including SGLang and NVIDIA TensorRT-LLM, with support for Dynamo components forthcoming. NIM is designed as a set of microservices for secure, high-performance deployment of AI models across various environments, including clouds, data centers, and workstations. It is supported through the NVIDIA AI Enterprise commercial license on Red Hat OpenShift AI, simplifying deployment and management. Collaboration with Red Hat NVIDIA and Red Hat have a longstanding partnership in supporting Red Hat OpenShift and OpenShift AI on NVIDIA accelerated computing platforms. The certification of NVIDIA GPU Operator, Network Operator, and NIM Operator on Red Hat OpenShift ensures compatibility and streamlined deployment of AI workloads. Red Hat has also integrated NVIDIA NIM into its OpenShift AI application catalog, facilitating support on any NVIDIA certified system. Red Hat is currently validating support on the NVIDIA GB200 NVL72 systems, further expanding the range of supported hardware. Getting Started with NVIDIA Dynamo At NVIDIA GTC 2025, the company announced the release of NVIDIA Dynamo, an open-source inference serving framework for deploying generative AI and reasoning models in large-scale distributed environments. The v0.2 release of Dynamo introduces several key features aimed at enhancing AI inference work: GPU Autoscaling for Disaggregated Serving Workloads Effective autoscaling in LLM-serving environments is crucial for cost efficiency and operational flexibility. Traditional metrics like QPS are insufficient due to the variability in inference requests. The NVIDIA Dynamo Planner, introduced in the v0.2 release, uses LLM-specific metrics to make more accurate scaling decisions. It monitors prefill and decode workload patterns and dynamically manages GPU resources, ensuring optimal utilization and reducing inference costs. Simplified Production Deployment Transitioning LLMs from local development to production in Kubernetes environments can be complex and error-prone. The v0.2 release of Dynamo includes the NVIDIA Dynamo Kubernetes Operator, which automates this process. The operator handles image building, graph management, and resource provisioning, allowing developers to move from a prototype on a desktop GPU to a scalable, data center-scale deployment with a single command. This automation drastically reduces development time and improves the reliability and scalability of LLM deployments. Optimized KV Cache Transfers on AWS KV cache management is a critical factor in AI inference costs. Dynamo's NIXL library supports low-latency data transfer in multinode setups, integrating seamlessly with AWS Elastic Fabric Adapter (EFA) in the v0.2 release. This allows AI service providers deploying LLMs on NVIDIA-powered Amazon EC2 instances, such as the P5 and P6 families, to benefit from distributed and disaggregated serving capabilities, enhancing performance and reducing costs. Industry Insights and Community Engagement Industry insiders and company profiles highlight the significance of the llm-d project and NVIDIA Dynamo. Experts from Google, Neural Magic, NVIDIA, and Red Hat emphasize that this initiative will foster innovation and improve the efficiency of AI inference in open source ecosystems. The dynamic and intelligent resource management provided by Dynamo makes it a powerful tool for enterprises, aligning with the growing demand for scalable and cost-effective AI solutions. The llm-d community and NVIDIA are committed to ongoing collaboration, with the aim of continuously improving the framework and addressing emerging challenges. Developers and researchers are encouraged to join the llm-d and NVIDIA Dynamo projects on GitHub, contributing to the development and shaping the future of open source inference. To further engage the community, NVIDIA hosts regular user meetups. The first in-person user meetup is scheduled for June 5 in San Francisco, focusing on the v0.2 release and the Dynamo roadmap. Attendees will have the opportunity to gain deeper insights into the latest features and provide feedback to drive the project's evolution. In summary, the llm-d community and NVIDIA Dynamo represent significant strides in optimizing AI inference, reducing costs, and enhancing performance. Through collaborative efforts and innovative technologies, these projects are poised to revolutionize the way AI models are deployed and managed in both open source and enterprise environments.

Related Links

Related Links

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

Command Palette

NVIDIA Dynamo Boosts LLM-D Community with New Features

Related Links

Command Palette

NVIDIA Dynamo Boosts LLM-D Community with New Features

Related Links

Command Palette

NVIDIA Dynamo Boosts LLM-D Community with New Features

Related Links

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.