HyperAI
Back to Headlines

Practical Test of LLM Agents: 24% of Tasks Can Be Completed Autonomously

vor 4 Tagen

Every day, we interact with computers in various aspects of our lives and work. Many tasks can be entirely managed through computer access and internet connectivity. Thanks to advancements in large language models (LLMs), artificial intelligence (AI) agents capable of interacting with and influencing their surroundings are rapidly evolving. However, how well do these AI agents perform when it comes to accelerating or autonomously executing professional tasks? The answer to this question is crucial for industries aiming to integrate AI into their workflows, as well as for economic policymakers who need to understand the potential impacts on the labor market. To gauge the progress of these LLM-driven agents in handling real-world professional tasks, this study introduces TheAgentCompany, an extensive benchmarking tool designed to evaluate AI agents that can engage with their environment much like digital workers. These agents perform tasks such as browsing the web, writing code, running programs, and communicating with colleagues. We created a self-contained environment that includes internal websites and data, simulating a small software company. Within this setting, we developed a variety of tasks that employees in such a company might typically undertake. Our tests involved both closed-API and open-weight language models, driving the baseline agents. We found that the most advanced agent could autonomously complete 24% of the tasks. This result paints a nuanced picture of AI task automation using language models. In a simulated work environment, a significant portion of simple tasks can be handled autonomously by these agents. However, more complex and long-term tasks remain beyond the capabilities of current systems. The findings highlight both the potential and the limitations of AI agents in professional settings, providing valuable insights for those considering their adoption.

Related Links