HyperAI
Back to Headlines

ChatGPT Agent Delivers Mixed Results: One Success Among Eight Tests Highlights Potential and Limitations

12 days ago

OpenAI recently unveiled Agent, an advanced AI tool that combines the capabilities of Deep Research and Operator, allowing it to navigate and interact with applications and the web. Agent is currently available only to Pro tier subscribers, who pay $200 per month and receive 400 agent interactions. Plus tier subscribers, who pay $20 per month, will soon gain access with 40 interactions per month. Despite Agent’s promising features, initial tests reveal mixed results, raising questions about its reliability and practicality. David Gewirtz, a tech expert, conducted eight comprehensive tests to evaluate Agent’s performance. These tests covered a range of tasks, including online shopping, comparing prices, creating presentations, and analyzing legal documents. Selecting Products on Amazon: Objective: Find three configurations of Power-over-Ethernet cable tools: budget, mid-tier, and premium. Outcome: Agent found a budget kit accurately and provided reasoning, but the image was incorrect. For the mid-tier and premium options, the links were broken, and some products didn’t exist on Amazon. Agent also failed to follow the instruction to search exclusively on Amazon, fetching data from other sources instead. Comparing Egg Prices: Objective: Compare egg prices at local grocery stores using Instacart. Outcome: Agent visited 21 stores within a 47-mile radius and ranked the eggs by price but often chose more expensive options. This indicates it understood the task but lacked precision in selecting the least expensive items. Creating a PowerPoint Slide: Objective: Add a new slide to an existing deck to update Bitcoin investment data. Outcome: Agent understood the task and moved existing data points to accommodate the new node. However, it failed to adjust the scale, reproduce fonts, and place text blocks correctly, pushing the entire graphic off-center. Article Categorization: Objective: Scroll through and categorize articles in a newsletter archive. Outcome: Agent struggled to scroll through the article list using JavaScript and reached the end of its browsing session limit, collecting only partial data. This highlights limitations in handling large-scale tasks and time constraints. Extracting Text from Video: Objective: Transcribe a specific segment from a video. Outcome: On its first attempt, Agent returned a mix of transcripts and its own analysis. After a second, more specific request, it provided the accurate transcript, albeit with a delayed and repetitive process. Creating a Trend Analysis Presentation: Objective: Prepare a comprehensive trend analysis on remote work for a management team. Outcome: Agent produced a well-organized 17-slide deck but with poor graphic quality. Many of the claims in the presentation were not entirely verifiable, with only five fully confirmed out of 17 data points. This underscores the AI’s tendency to "hallucinate" or produce fabricated information. Vetting a Presentation for Accuracy: Objective: Validate the accuracy of the remote work trends presentation. Outcome: Agent provided a detailed analysis, confirming only five out of 17 claims. This contrasts sharply with GPT-4o, which confirmed all claims, indicating a need for further refinement in data verification capabilities. Analyzing Building Code for Fence Installation: Objective: Review and interpret local building codes for installing a fence. Outcome: In this test, Agent excelled, providing a detailed and accurate analysis within four minutes. It even created working diagrams, demonstrating high potential for complex, specific tasks. Evaluation and Industry Insight Industry experts, including David Gewirtz, note that while Agent shows promise, it is far from reliable in its current form. The tool's tendency to produce "alternative facts" and its limitations in handling large-scale tasks and detailed graphic design work are significant drawbacks. Gewirtz suggests that the current state of Agent makes it unsuitable for professional use without extensive human oversight and verification. However, the successful fence installation test demonstrates that when applied to specific, well-contained tasks, Agent can produce highly useful and accurate results. This suggests that with further development and refinement, the tool could become a valuable asset in many industries. Company Profiles: - OpenAI: A leading AI research lab known for developing advanced AI models like ChatGPT and DALL-E. Its recent focus on practical AI tools aims to bridge the gap between cutting-edge research and real-world applications. - Meta: A prominent tech giant with significant investments in AI, particularly in large language models and superintelligent systems. Meta's partnership with Scale AI indicates a strategic push to enhance its AI capabilities and compete with industry leaders.

Related Links