NVIDIA GB200 Systems Power LMArena's AI Model to Rank and Optimize Large Language Tasks

LMArena, an initiative from the University of California, Berkeley, is revolutionizing the way large language models (LLMs) are evaluated and ranked. Thanks to early access to NVIDIA GB200 systems and the support of Nebius AI Cloud, LMArena has developed a robust model called Prompt-to-Leaderboard (P2L) that leverages human preferences to create detailed, task-specific leaderboards for various AI models. Wei-Lin Chiang, a co-founder of LMArena and a doctoral student at Berkeley, explained that P2L captures user preferences across different tasks like math, coding, and creative writing. By applying the Bradley-Terry coefficient, P2L identifies the top-performing model in each domain. This system goes beyond a single, overall ranking, providing users with nuanced insights into the strengths and weaknesses of each model. For instance, a model may excel at math but perform averagely in writing. Evan Frick, an LMArena senior researcher and doctoral student, highlighted the importance of these detailed rankings. "One model might excel at math but be average at writing," he noted. "A single score often hides these nuances." P2L also supports cost-based routing, where users can set a budget, and the system will automatically select the best-performing model within that financial limit, ensuring optimal use of resources. In February, LMArena began deploying P2L on the NVIDIA GB200 NVL72, a powerful AI platform hosted by Nebius via NVIDIA DGX Cloud. The collaboration involved deep coordination and a shared sandbox environment, which streamlined onboarding and allowed early adopters to test NVIDIA's Blackwell platform using runbooks and best practices. The NVIDIA GB200 NVL72, equipped with 36 Grace CPUs and 72 Blackwell GPUs, offers high-bandwidth, low-latency performance backed by up to 30 TB of unified LPDDR5X and HBM3E memory. This setup ensures efficient resource allocation for demanding AI tasks. LMArena ran consecutive training sessions, first on a single node and then scaling to multiple nodes, demonstrating significant improvements in single-node throughput and efficient horizontal scalability. According to Chiang, the challenge was ensuring real-time performance while maintaining adaptability to continuous data feedback, but this also made the project exciting. The success of P2L on the GB200 NVL72 was partly due to the efforts of the NVIDIA DGX Cloud team, which worked closely with Nebius and LMArena to validate and compile key AI frameworks. These included PyTorch, DeepSpeed, Hugging Face Transformers, Accelerate, Triton, vLLM, xFormers, torchvision, and llama.cpp. They also ensured compatibility and performance with emerging model frameworks like WAN2.1 video diffusion, tailored for the Arm64 architecture and the latest CUDA versions. Paul Abruzzo, a senior engineer on the NVIDIA DGX Cloud team, emphasized the deep coordination among the teams. "The DGX Cloud team was able to provide the necessary open-source frameworks for this engagement, enabling rapid deployment and scale experimentation," he stated. This collaboration not only achieved technical milestones but also created a repeatable deployment model for future large-scale AI projects. Andrey Korolenko, Nebius’s chief product and infrastructure officer, underscored the impact of this partnership. He noted that validated frameworks, onboarding guides, and deployment blueprints now simplify the adoption of GB200 NVL72 for new customers, whether they need full rack scale or smaller, targeted configurations. Chiang praised the transformation brought by this collaboration: "The GB200 NVL72’s performance gave us the flexibility to experiment, iterate quickly, and deliver a real-time routing model that adapts to live user input. We’re seeing improved accuracy and efficiency as a result." Industry insiders commend the innovation and efficiency demonstrated by this partnership. The integration of human feedback into AI models is seen as a critical step toward more transparent and user-centric AI evaluation. Moreover, the NVIDIA GB200 NVL72's performance and scalability set new benchmarks for AI deployments, paving the way for future advancements in the field. LMArena’s work is anticipated to inspire further developments in AI model evaluation and deployment, particularly in academic and research settings. NVIDIA DGX Cloud and Nebius AI Cloud aim to continue this momentum by offering easy access to cutting-edge AI infrastructure. Interested developers and researchers can contact NVIDIA to explore how to leverage the GB200 NVL72 for their own AI projects, benefiting from reduced deployment complexity and enhanced performance.

NVIDIA GB200 Systems Power LMArena's AI Model to Rank and Optimize Large Language Tasks

Related Links