HyperAI

Curious about which GPU is best for running large language model inference? This article presents a detailed benchmark comparing the RTX 4090, RTX 5090, and RTX PRO 6000 across multiple configurations, including 1x, 2x, and 4x setups. The tests focus on real-world performance using the vLLM inference engine to serve LLaMA and Qwen models in production-like scenarios. The benchmark evaluates key metrics critical for LLM serving: model download speed, token generation latency (especially TTFT), throughput, and system stability under high concurrency. These factors are more important than raw FLOPS when deploying models interactively. Testing methodology includes four main steps: running a system benchmark with YABS to assess CPU, memory, disk, and network performance; downloading models from Hugging Face to measure real-world download speeds; launching a vLLM container with an OpenAI-compatible API; and finally, running a multi-request, high-concurrency inference test using the Qwen3-Coder-30B-A3B-Instruct model with tensor parallelism. A critical finding was the impact of driver versions. On the RTX 5090, performance with driver 570.86.15 was similar to the RTX 4090. Upgrading to driver 575.57.08 resulted in substantial improvements across all benchmarks, highlighting the importance of using up-to-date drivers for optimal performance. Hardware tested included 4x RTX 4090, 4x RTX 5090, 1x RTX PRO 6000, and 2x RTX PRO 6000. All tests were run on cloud servers with 10Gbps network connectivity, though actual download speeds varied based on distance from Hugging Face’s servers. Key takeaways: Model download speed can be a bottleneck if bandwidth or storage is limited. Using HF_HUB_ENABLE_HF_TRANSFER=1 significantly improved download performance. Token generation latency, especially time to first token (TTFT), varied between servers with similar GPUs due to differences in memory bandwidth, backend configuration, and software stack. For smaller models like Qwen-3B or LLaMA-8B, the RTX 4090 offers strong value. However, for larger models or batch inference, the RTX PRO 6000 outperforms even multiple 4090s and 5090s. Even on the 30B model used in testing, a single PRO 6000 delivered faster performance than four 4090s or 5090s. This is due to the PRO 6000’s higher VRAM, superior memory bandwidth, and optimized architecture for sustained workloads. Techniques like prefill-decode-disaggregation help reduce PCIe bottlenecks on lower-VRAM GPUs, but they don’t fully close the gap. In most cases, the PRO 6000 remains the superior choice for serious LLM inference. The full benchmark code is available on GitHub. Users can clone the repository, set up dependencies, and run their own tests with custom models and configurations. The project is designed to be flexible and extensible, and community feedback via Discord or comments is encouraged to guide future benchmarking efforts.

RTX 4090 vs RTX 5090 vs RTX PRO 6000: Real-World LLM Inference Benchmarks with vLLM

Related Links