HyperAI

NVIDIA has introduced TensorRT Edge-LLM, a new open source C++ framework designed to accelerate large language model (LLM) and vision language model (VLM) inference for real-time applications in automotive and robotics. Unlike data center-focused inference tools that prioritize high throughput and concurrent request handling, TensorRT Edge-LLM is purpose-built for embedded systems where low latency, reliability, and offline operation are critical. The framework is optimized for NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor platforms, offering a lightweight, minimal dependency design that reduces resource usage while delivering high performance. It supports key features such as EAGLE-3 speculative decoding for faster response times, NVFP4 quantization to reduce memory footprint, and chunked prefill to improve efficiency during long input processing. TensorRT Edge-LLM provides a complete end-to-end workflow for deploying LLMs and VLMs on edge devices. The process begins with a Python export pipeline that converts Hugging Face models into ONNX format, supporting quantization, LoRA adapters, and speculative decoding. The engine builder then generates highly optimized TensorRT engines tailored to the target hardware. Finally, the C++ runtime executes inference on the device, managing the autoregressive decoding loop for real-time, token-by-token generation. The framework is now available as part of the JetPack 7.1 release on GitHub, enabling developers to quickly deploy models on Jetson AGX Thor DevKits. For NVIDIA DRIVE AGX Thor users, TensorRT Edge-LLM is included in the DriveOS release package, with future updates coming from the open source repository. Major industry partners are already adopting the framework. Bosch is using it in its AI-powered cockpit solution developed with Microsoft and NVIDIA, enabling natural voice interactions through on-device LLM inference combined with speech recognition and text-to-speech. ThunderSoft is integrating TensorRT Edge-LLM into its AIBOX platform based on DRIVE AGX Orin, delivering low-latency, on-device conversational AI within strict power and memory constraints. MediaTek is leveraging the framework for its CX1 system-on-chip, accelerating LLM and VLM inference for driver monitoring, cabin activity tracking, and advanced human-machine interaction. The open source nature of TensorRT Edge-LLM allows for community contributions and customization. Developers can get started by downloading JetPack 7.1, cloning the GitHub repository, and following the Quick Start Guide to convert, build, and run models. The documentation includes examples and a customization guide to help users adapt the framework to their specific use cases. As LLMs and VLMs move from the cloud to the edge, TensorRT Edge-LLM offers a reliable, high-performance foundation for building intelligent, real-time applications in vehicles and robots. It enables developers to bring powerful AI capabilities directly to the device, ensuring faster response times, better privacy, and consistent performance even without internet connectivity. For more information, visit the NVIDIA/TensorRT-Edge-LLM GitHub repository.

Related Links

Related Links

Related Links

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

Command Palette

NVIDIA Launches TensorRT Edge-LLM for High-Performance LLM and VLM Inference in Automotive and Robotics

Related Links

Command Palette

NVIDIA Launches TensorRT Edge-LLM for High-Performance LLM and VLM Inference in Automotive and Robotics

Related Links

Command Palette

NVIDIA Launches TensorRT Edge-LLM for High-Performance LLM and VLM Inference in Automotive and Robotics

Related Links

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.