HyperAI초신경
Back to Headlines

Sequoia China Launches xbench to Evaluate AI’s Real-World Productivity

2일 전

In an era dominated by advanced agents, a new benchmarking suite has emerged to evaluate their real-world productivity. Launched by Sequoia Capital China, xbench aims to quantify the capabilities of intelligent agents across various domains, ensuring that these evaluations remain up-to-date and relevant. The first component of xbench is "xbench-ScienceQA," which focuses on assessing the scientific knowledge and reasoning abilities of agents. This evaluation set includes high-quality questions that cover areas such as relatable difficulty levels found in reputable educational settings, and it incorporates both explicit and implicit information retrieval and answer verification. To maintain its rigor and accuracy, xbench-ScienceQA updates its question pool quarterly, ensuring that the assessments remain fresh, reliable, and consistent. The second component, "xbench-DeepSearch," evaluates the deep search capabilities of AI systems. It specifically tests their ability to autonomously plan, collect information, analyze, and summarize complex data, with a particular emphasis on the Chinese-language internet environment. This evaluation requires agents to demonstrate end-to-end integration skills, as all questions are human-generated and curated through rigorous validation. The assessment pool is updated monthly to reflect the latest models and standards, providing continuous insights into their performance. A secondary focus of xbench is the "quantification of AI system value in real-world scenarios," also known as "Profession-Aligned" evaluation. Unlike traditional benchmarks that rely on static datasets, this approach treats AI systems as "digital employees" and assesses their performance in real business processes. The core of this evaluation lies in measuring the actual outcomes and commercial value produced, rather than the specific methods used. According to the team, the Profession-Aligned metric addresses the need for benchmarks that align with industry requirements, especially in emerging fields where applications are still maturing. To prevent the pitfalls of static evaluation sets, which can quickly become outdated and ineffective, xbench employs a "Long-lived Evaluation (Evergreen Evaluation)" mechanism. This ensures that the content remains current and dynamic, addressing the challenges of maintaining up-to-date benchmarks. For AGI Tracking evaluations, this mechanism supports ongoing research in alignment methods, and it provides third-party, transparent, and real-time evaluations, facilitating the continuous expansion and improvement of these benchmarks. For Profession-Aligned evaluations, xbench establishes a system for dynamically collecting and updating questions from real business environments. The team invites experts from various industries to contribute and maintain this evaluation set, ensuring that it reflects real-world needs accurately. Through these dynamic updates and cross-sectional comparisons, the team hopes to observe not just how models rank, but also the speed and nature of their development. Additionally, this approach helps determine whether these models meet practical industry benchmarks and can effectively manage existing business processes, thereby offering standardized service models. In summary, xbench offers a comprehensive and dynamic way to measure the productivity and value of intelligent agents in real-world applications, bridging the gap between theoretical evaluations and practical industry needs.

Related Links