HyperAIHyperAI
Back to Headlines

AIM-Bench: Benchmark for AI-Powered Stock Management Decisions

5 days ago

Researchers from Nanjing University, led by Professor Sun Yuxiang and his collaborators, have developed AIM-Bench, the first benchmark platform specifically designed to evaluate the decision-making behaviors and biases of large language model agents in inventory management. The platform features five distinct supply chain environments of varying complexity: the Newsvendor Problem, multi-period replenishment, the Beer Game, a two-tier warehouse network, and a full supply chain network. Each environment incorporates one or more sources of uncertainty, such as stochastic demand, variable lead times, and unpredictable partner behavior. The study uncovered several key findings. First, large language models exhibit decision-making biases similar to those observed in humans. In the Newsvendor Problem, most models demonstrate a "pull-to-center" effect: under low-profit scenarios, average order quantities exceed optimal levels, while under high-profit scenarios, they fall short. This deviation is attributed to the anchoring and insufficient adjustment heuristic, where the mean demand serves as a dominant anchor. Interestingly, previous demand realizations—another potential anchor—showed little influence on model decisions. In multi-period replenishment tasks, models displayed "bracing behavior," characterized by overestimating the likelihood of negative events. Faced with uncertain demand and lead times, models tend to over-order, reacting to perceived volatility. In the Beer Game, all tested models exhibited a pronounced bullwhip effect—where demand variability amplifies as it moves upstream through the supply chain. This demonstrates that even advanced AI agents are susceptible to the same systemic distortions that plague human decision-makers. Surprisingly, the framing effect—where decision outcomes change based on how a problem is presented—was not significant in the inventory context. Despite prior evidence of risk preference reversals (such as loss aversion) in domains like healthcare and finance, altering the description of the Newsvendor Problem (emphasizing gains vs. losses) did not significantly affect model behavior. This suggests that AI decision-making biases are highly task-dependent and cannot be simply extrapolated from human behavioral theories. Information sharing was found to significantly mitigate the bullwhip effect. When models were given access to upstream and downstream inventory and order data, the bullwhip effect index (BWE) dropped by approximately 60% on average. For example, Qwen-2.5’s BWE decreased from 23.07 to 10.73. This highlights the potential of transparency and coordination in improving AI-driven supply chain decisions. Another critical insight is that process-level metrics outperform outcome-level metrics in assessing decision quality. Inspired by dynamic programming, the researchers computed the distance between model decisions and the optimal replenishment policy over time. These fine-grained process indicators proved more sensitive than traditional outcome measures like inventory cost or stockout rate. For instance, while GPT-4.1 and Qwen-2.5 had similar stockout rates, GPT-4.1’s decisions were closer to the optimal path, indicating superior decision-making fidelity. The research has several promising applications. First, AIM-Bench can be used to train or select more reliable large models for automated replenishment systems—especially in volatile environments like e-commerce, fast-moving consumer goods, and manufacturing. Second, it can serve as a training and simulation platform for supply chain professionals, enabling them to compare AI and human decisions and identify cognitive biases in their own behavior. Third, it can form the basis of AI bias detection and correction tools integrated into ERP or SCM systems, capable of flagging anomalous decisions and triggering corrective actions like information sharing or multi-model consensus. Fourth, it supports human-AI collaboration frameworks, combining the speed of AI with human judgment to enhance decision robustness in complex, uncertain settings. The researchers note that while large models have excelled in tasks like mathematical reasoning, code generation, and long-term planning, their performance in real-world operational contexts—especially under uncertainty—remains underexplored. Inventory management, a core function affecting both cost and customer satisfaction, is particularly sensitive to decision errors. Prior studies have tested models in isolated settings, but few have systematically evaluated how models perform across multiple uncertainty types and multi-agent dynamics. Sun Yuxiang and Professor Chen Caihua, along with master’s students Zhao Xuhua and Xie Yuxuan, designed AIM-Bench to address this gap. Their goals were threefold: to build a comprehensive benchmark, to assess model decision-making under uncertainty, and to investigate whether strategies like cognitive reflection and information sharing can reduce bias. One striking observation occurred during the development of the multi-agent Beer Game. The team noticed that GPT-4o, when exposed to full information, exhibited "action chasing"—overly imitating partners’ orders, which reduced the bullwhip effect to near zero but also eliminated its ability to explore better strategies. This revealed a critical trade-off: information sharing is not a universal solution. Designing "moderate" information-sharing mechanisms that prevent overfitting to others’ behavior remains a key challenge. Another unexpected finding was the absence of risk reversal in the Newsvendor Problem. After rigorous validation and consultation with behavioral economists, the team concluded that AI biases are context-specific, challenging the assumption that human behavioral theories can be directly applied to AI. Looking ahead, the team plans to expand AIM-Bench to include more realistic factors—such as transportation losses, replenishment costs, supplier reliability, and multi-product coordination. They also aim to integrate reinforcement learning with large models to train agents that can self-correct over time. Additionally, they are developing explainability tools to visualize AI decision processes, enabling human users to understand "why" a model made a certain choice. The researchers are also pursuing open-source collaboration, planning to release AIM-Bench to the academic and industrial community to foster shared development of evaluation standards. Currently, they are piloting a product based on reinforcement learning and large models for intelligent scheduling and inventory optimization, with early trials in manufacturing and retail showing promising results in reducing inventory costs and improving responsiveness. Sun Yuxiang emphasized: “AI is not meant to replace human decision-makers. Instead, our work aims to understand AI’s limitations and biases so we can build more trustworthy, responsible, and effective human-AI systems. This research is not just about evaluating AI—it’s about reimagining how we design intelligent systems that enhance, rather than replace, human judgment. We hope to partner with industry leaders to turn these academic insights into real-world solutions that strengthen supply chain resilience.”

Related Links