NVIDIA Launches TensorRT LLM AutoDeploy to Automate Inference Optimization for LLMs, Enabling Rapid Deployment of Diverse Models with High Performance
NVIDIA has introduced AutoDeploy as a beta feature in TensorRT LLM, a major step toward automating inference optimizations for large language models. Traditionally, deploying new LLM architectures required extensive manual work to implement features like KV cache management, GPU sharding, kernel fusion, and runtime integration. AutoDeploy eliminates this burden by automatically converting off-the-shelf PyTorch models into high-performance inference-optimized graphs, enabling faster deployment without requiring model authors to rewrite inference logic. AutoDeploy works by capturing the model’s computation graph using PyTorch’s torch.export API, then applying a series of automated transformations. It standardizes common components—such as attention layers, RoPE, mixture-of-experts (MoE), and state space models—into canonicalized custom operators. This ensures consistent representation across diverse models, simplifying downstream optimizations like caching and kernel selection. Developers can also inject custom kernels by decorating operations as PyTorch custom operators, allowing AutoDeploy to preserve them without modification. The system then applies performance-optimized compiler passes, including operation fusion, sharding across GPUs based on heuristics or user-provided hints, and integration with optimized kernels. It supports flexible attention mechanisms and automatically integrates caching into TensorRT LLM’s optimized cache manager, handling a mix of softmax attention, linear attention (DeltaNet), Mamba2 state space layers, and causal convolutions. AutoDeploy also handles runtime integration seamlessly, including advanced features like overlap scheduling, chunked prefill, speculative decoding, and state management—without requiring model authors to manage complex dependencies. To demonstrate its effectiveness, NVIDIA onboarded the Nemotron 3 Nano, a hybrid MoE model. While manual tuning would typically take weeks, AutoDeploy enabled deployment within days, achieving performance on par with a manually optimized baseline on a single NVIDIA Blackwell DGX B200 GPU. It delivered up to 350 tokens per second per user and 13,000 output tokens per second in high-throughput scenarios. Another example is Nemotron-Flash, a research model combining multiple token mixer types. AutoDeploy reused existing optimization passes and added support for new layers like DeltaNet with minimal effort, enabling full performance optimization in days. Benchmarking against Qwen2.5 3B Instruct showed Nemotron-Flash outperformed the hand-tuned model in throughput and latency trade-offs. AutoDeploy currently supports over 100 text-to-text LLMs and offers early support for vision-language models and state space models. It integrates with standard tooling like torch.compile, CUDA Graphs, and multistream optimizations. This shift treats inference optimization as a compiler and runtime responsibility, allowing model developers to focus on architecture while the system handles performance. AutoDeploy enables rapid deployment, broader model coverage, and a clean separation between model design and inference engineering. For developers interested in trying it out, NVIDIA provides documentation and example scripts to get started with TensorRT LLM AutoDeploy. The feature is actively evolving, with contributions from a dedicated team across NVIDIA.
