NVIDIA Jetson AGX Thor Unlocks 7x Gen AI Speedup with Advanced Quantization and Speculative Decoding
NVIDIA has announced that the Jetson AGX Thor platform now delivers up to a 7x increase in generative AI performance through continuous software optimization and advanced inference techniques. This follows the platform’s initial launch in August, which already offered a 5x performance boost over the Jetson AGX Orin. The latest improvements, driven by updates to the vLLM container and support for cutting-edge features like speculative decoding, are enabling developers to run large language models (LLMs) more efficiently at the edge. With the most recent vLLM release, Jetson Thor achieves up to 3.5x faster inference on the same model and quantization level compared to its launch performance. Benchmarks show significant gains: Llama 3.3 70B increased from 41.5 to 122.6 output tokens per second, and DeepSeek R1 70B rose from 40.2 to 111.5 tokens per second under identical conditions. These results were measured at a sequence length of 2048 and output length of 128 with a concurrency of 8 and MAXN power mode. A key performance driver is the integration of EAGLE-3 speculative decoding in vLLM containers. When enabled, this technique can boost throughput to 88.62 tokens per second on Llama 3.3 70B, achieving a 7x speedup from launch-day performance. Speculative decoding works by using a smaller, faster draft model to generate candidate tokens, which are then verified in bulk by the main model—reducing latency and improving throughput when acceptance rates are high. Jetson Thor also supports advanced quantization formats critical for edge deployment. FP8 offers a near-lossless reduction in model size, enabling 70B models to run on-device with minimal accuracy drop—typically less than 1%—making it ideal for general-purpose AI tasks. For even greater efficiency, W4A16 (4-bit weights, 16-bit activations) allows multiple large models to run simultaneously on a single device, making it possible to serve models with over 175 billion parameters on a single Jetson Thor. Developers are advised to start with W4A16 for the best balance of speed and memory efficiency. If accuracy is a concern—especially for complex reasoning or code generation—switching to FP8 provides a reliable alternative with strong performance. To maximize results, NVIDIA recommends a structured approach: first establish a quality baseline using high-precision formats like FP16 or FP8, then progressively apply quantization while monitoring accuracy. Once the model meets quality standards, benchmark performance under real-world conditions, including concurrency, context length, and output size. NVIDIA is simplifying the process with a standalone, monthly-updated vLLM container optimized for Jetson Thor. Developers can easily deploy models using commands like: vllm serve "RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16" --trust_remote_code -- --speculative-config '{"method":"eagle3","model":"yuhuili/EAGLE3-LLaMA3.3-Instruct-70B","num_speculative_tokens":5}' This combination of quantization and speculative decoding unlocks unprecedented generative AI performance on edge devices. With day 0 support for new models like gpt-oss and NVIDIA Nemotron series, Jetson Thor empowers developers to experiment with the latest AI advancements immediately. For those ready to begin, the Jetson AGX Thor Developer Kit is available, paired with the latest NVIDIA JetPack 7 to accelerate development and deployment of intelligent edge applications.
