Alibaba's Tongyi Qwen3 Models Optimized for Production Deployment on NVIDIA GPUs with TensorRT-LLM and Other Frameworks
Alibaba recently introduced Tongyi Qwen3, a series of open-source hybrid-reasoning large language models (LLMs). The Qwen3 family includes two mixture-of-experts (MoE) models—235B-A22B and 30B-A3B—and six dense models ranging from 0.6B to 32B parameters. These models are designed to excel in reasoning, instruction-following, agent capabilities, and multilingual support, making them a significant advancement in the field of LLMs. Efficient Integration and Deployment with NVIDIA GPUs Developers can integrate and deploy Tongyi Qwen3 models into production applications using various high-performance frameworks like NVIDIA TensorRT-LLM, Ollama, SGLang, and vLLM. Each framework offers unique benefits suited to different use cases, whether optimizing for high throughput, low latency, or minimizing GPU footprint. Here’s a detailed look at how these frameworks can be utilized: TensorRT-LLM TensorRT-LLM is a powerful inference engine that supports advanced optimization techniques, making it an excellent choice for efficient LLM deployment on NVIDIA GPUs. The key features include: High-performance Compute Kernels: Designed for ultra-fast token generation. Custom Attention Kernels: Enhance model accuracy and efficiency. In-flight Batching: Process multiple requests simultaneously to improve throughput. Paged KV Caching: Optimize memory usage for large models. Quantization: Support for FP8, FP4, INT4 AWQ, and INT8 SmoothQuant to reduce computational load. Speculative Decoding: Predict and generate tokens ahead of time to speed up inference. To set up and optimize a Qwen3-4B model, developers follow these steps: Prepare the Benchmark Test Dataset: python python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py \ --tokenizer=/path/to/Qwen3-4B \ --stdout token-norm-dist --num-requests=32768 \ --input-mean=1024 --output-mean=1024 \ --input-stdev=0 --output-stdev=0 > /path/to/dataset.txt Create Configuration File: ```yaml cat >/path/to/extra-llm-api-config.yml <<EOF pytorch_backend_config: use_cuda_graph: true cuda_graph_padding_enabled: true cuda_graph_batch_sizes: 1 2 4 8 16 32 64 128 256 384 print_iter_log: true enable_overlap_scheduler: true EOF ``` Run the Benchmark Command: bash trtllm-bench \ --model Qwen/Qwen3-4B \ --model_path /path/to/Qwen3-4B \ throughput \ --backend pytorch \ --max_batch_size 128 \ --max_num_tokens 16384 \ --dataset /path/to/dataset.txt \ --kv_cache_free_gpu_mem_fraction 0.9 \ --extra_llm_api_options /path/to/extra-llm-api-config.yml \ --concurrency 128 \ --num_requests 32768 \ --streaming Host the Model for Inference: bash trtllm-serve \ /path/to/Qwen3-4B \ --host localhost \ --port 8000 \ --backend pytorch \ --max_batch_size 128 \ --max_num_tokens 16384 \ --kv_cache_free_gpu_memory_fraction 0.95 \ --extra_llm_api_options /path/to/extra-llm-api-config.yml Make Inference Calls: bash curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-4B", "Max_tokens": 1024, "Temperature": 0, "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' Using TensorRT-LLM, developers achieve significant speedups in inference throughput. For instance, the Qwen3-4B dense model running on NVIDIA GPUs with BF16 precision showed up to 16.04x speedup compared to the BF16 baseline (Figure 1). Other Frameworks Ollama Ollama is another framework that supports local execution of Qwen3 models. To run the Qwen3-4B model locally: bash ollama run qwen3:4b Users can enable thinking mode for more detailed responses: bash "Write a python lambda function to add two numbers" - Thinking mode enabled "Write a python lambda function to add two numbers /no_think" - Non-thinking mode SGLang SGLang is a versatile framework that can be installed via pip and used to launch a server for Qwen3 models: bash pip install "sglang[all]" Download the model: bash huggingface-cli download --resume-download Qwen/Qwen3-4B --local-dir ./ Start the server: python python -m sglang.launch_server \ --model-path /ssd4TB/huggingface/hub/models/ \ --trust-remote-code \ --device "cuda:0" \ --port 30000 \ --host 0.0.0.0 Make inference calls: bash curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' vLLM vLLM is a lightweight framework that simplifies model serving: bash pip install vllm Launch the server: bash vllm serve "Qwen/Qwen3-4B" \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.85 \ --device "cuda:0" \ --max-num-batched-tokens 8192 \ --max-num-seqs 256 Make inference calls: bash curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' Key Parameters for Framework Selection The choice of framework for model deployment and inference depends on several key parameters: - Performance: High-throughput models may require frameworks like TensorRT-LLM with advanced optimization techniques. - Resources: Smaller models can be effectively deployed using lightweight frameworks like Ollama or vLLM. - Cost: Balancing performance and resource utilization is crucial for cost-effective deployments. By carefully evaluating these factors, developers can select the most appropriate framework to achieve optimal performance in their production environments. Industry Evaluation and Company Profiles Industry insiders have praised Qwen3 for its state-of-the-art accuracy and versatility, particularly in reasoning tasks and multilingual support. Alibaba Cloud, known for its robust cloud services and AI innovations, continues to lead the charge in developing cutting-edge LLMs that bridge the gap between research and practical applications. NVIDIA, a pioneer in GPU technology, provides the powerful hardware and software tools necessary to make these models accessible and efficient for a wide range of developers. The release of Qwen3 and the availability of these deployment frameworks mark a significant step forward in the democratization of AI, enabling faster and more efficient development cycles for AI-driven applications.