Command Palette
Search for a command to run...
ما وراء الأدوات الثابتة: تطوير الأدوات في وقت الاختبار للاستدلال العلمي
ما وراء الأدوات الثابتة: تطوير الأدوات في وقت الاختبار للاستدلال العلمي
Abstract
التحدي المركزي للذكاء الاصطناعي في مجال العلوم ليس التفكير بحد ذاته، بل القدرة على إنشاء أساليب حسابية في عالم علمي مفتوح الاتجاهات. تعتمد الوكلاء القائمين على النماذج الكبيرة للغة (LLM) حاليًا على مكتبات أدوات ثابتة ومحددة مسبقًا، وهي نموذج يفشل جوهريًا في المجالات العلمية التي تتميز بندرة الأدوات وتنوعها وتداخلها، فضلًا عن كونها ذات طبيعة لا نهائية بالأساس. في هذا البحث، نقترح نموذج التطور الأداتي في وقت الاختبار (TTE)، وهو نموذج جديد يمكّن الوكلاء من توليد الأدوات القابلة للتنفيذ، والتحقق منها، وتطويرها أثناء عملية الاستنتاج. من خلال تحويل الأدوات من موارد ثابتة إلى كيانات تُنشأ استجابةً للمشكلات، يتجاوز نموذج TTE قيود المكتبات الثابتة للأدوات ومشكلة التوزيع الطويلة (long-tail). ولتمكين تقييم دقيق، نقدم SciEvo، وهو معيار يحتوي على 1590 مهمة استدلال علمي مدعومة بـ 925 أداة تم تطويرها تلقائيًا. أظهرت التجارب الواسعة أداءً متفوقًا على مستوى الأداء الحالي من حيث الدقة وكفاءة الأدوات، مع تمكين التكيف الفعّال بين المجالات المختلفة للأساليب الحسابية. تم إتاحة الكود والمعيار على الرابط التالي: https://github.com/lujiaxuan0520/Test-Time-Tool-Evol.
One-sentence Summary
The authors from Shanghai AI Lab, Fudan University, and collaborating institutions propose Test-Time Tool Evolution (TTE), a novel paradigm enabling LLM agents to dynamically synthesize, verify, and evolve executable tools during inference, overcoming the limitations of static tool libraries in scientific AI by treating tools as problem-driven artifacts, achieving state-of-the-art performance on SciEvo, a benchmark with 1,590 tasks and 925 evolved tools, with broad cross-domain applicability.
Key Contributions
- Scientific reasoning with LLMs is limited by static, pre-defined tool libraries that cannot adapt to the sparse, heterogeneous, and open-ended nature of real-world scientific problems, restricting agents to passive tool selection rather than active discovery.
- The paper introduces Test-Time Tool Evolution (TTE), a novel paradigm that enables LLM agents to dynamically synthesize, verify, and evolve executable tools during inference, transforming tools into problem-driven artifacts rather than fixed resources.
- Evaluated on SciEvo, a benchmark with 1,590 scientific tasks and 925 evolved tools, TTE achieves state-of-the-art performance in accuracy and tool efficiency, demonstrating strong cross-domain adaptability and effective on-demand tool generation.
Introduction
The authors address a critical gap in AI for scientific reasoning: existing large language model (LLM) agents rely on static, pre-defined tool libraries that are ill-suited for the open-ended, evolving nature of scientific discovery. These static libraries suffer from sparsity, heterogeneity, and an inability to generate novel computational primitives on demand, limiting agents to passive tool selection rather than active problem-solving. To overcome this, the authors introduce Test-Time Tool Evolution (TTE), a paradigm that enables agents to dynamically synthesize, verify, and evolve executable tools during inference. This shift transforms tools from fixed resources into problem-driven artifacts, allowing agents to adapt to unseen scientific challenges in real time. The framework is evaluated on SciEvo, a novel benchmark with 1,590 scientific tasks and 925 evolved tools, demonstrating state-of-the-art performance in both accuracy and tool efficiency, with strong cross-domain adaptability.
Dataset
- The SciEvo benchmark is constructed through an evolutionary process using the TTE-Zero framework, where tools are generated from scratch rather than sourced from static codebases, ensuring alignment with real scientific reasoning tasks.
- The seed data comprises 1,590 high-quality scientific problems drawn from three sources: SciEval (Sun et al., 2024), SciBench (Wang et al., 2024), and a proprietary materials science dataset focused on domain-specific calculations.
- Only computational problems requiring multi-step reasoning and precise numerical solutions are retained; purely knowledge-based queries are filtered out.
- To ensure diversity, candidate questions are embedded using a sentence embedding model (Reimers and Gurevych, 2019), clustered via K-Means, and uniformly sampled from each cluster to form a balanced seed set.
- The seed set provides problem contexts (Q) and ground-truth answers (A) for validating tool correctness during synthesis.
- Using the TTE-Zero framework, the agent starts with an empty tool library and iteratively generates, executes, and validates Python functions in response to seed questions. Only atomic functions that successfully contribute to correct answers are retained.
- This process yields a final verified library of 925 atomic tools, fully aligned with the problem space.
- Tools are organized into a hierarchical taxonomy of 25 sub-disciplines across four major fields: Physics (499 tools), Chemistry (192), Mathematics (171), and Materials Science (63), established through PCA on tool description embeddings and refined by PhD-level experts.
- The dataset is used in experiments to evaluate both problem-solving accuracy and tool evolution efficiency, with SciEvo serving as the primary benchmark alongside SciBench and SciEval.
- In training and evaluation, the model uses a mixture of problems from the curated SciEvo dataset, with tool usage dynamically adapted during inference.
- No explicit cropping is applied; instead, the dataset is processed through semantic clustering to ensure balanced representation across disciplines.
- Metadata for each tool includes its domain, sub-discipline, function signature, and validation status, constructed during the evolution process to support traceability and analysis.
Method
The authors leverage a closed-loop evolutionary framework for Test-Time Tool Evolution (TTE), which fundamentally shifts from static tool paradigms by enabling tools to be generated and refined during problem-solving. The overall architecture, as illustrated in the framework diagram, consists of five integrated modules that operate in a continuous cycle: Structured Task Decomposition, Dynamic Tool Retrieval, Generative Tool Synthesis, Atomic Tool Refinement, and Runtime Execution. This workflow begins with the Problem Analyzer, which decomposes a complex scientific query into a sequence of executable sub-goals, each requiring specific computational operations. The system then proceeds to the Dynamic Tool Retrieval stage, where it queries the Dynamic Tool Registry for existing tools using semantic similarity between the sub-goal description and tool metadata. The retrieval process is governed by a threshold-based decision: if a retrieved tool's similarity score exceeds a predefined threshold, it is selected; otherwise, the system triggers the Generative Tool Synthesis module to create a new tool on-demand.

The Generative Tool Synthesis module employs a chain-of-thought reasoning process to propose a new tool, which is then rigorously validated by the Tool Verifier through syntax checking, execution testing, and domain validation. Only tools that pass all checks are passed to the Atomic Tool Refinement stage. Here, the Atomic Decomposer breaks down the complex tool into fundamental atomic units, maximizing the expected reuse improvement by enabling partial reusability of sub-functions. The Redundancy Checker then compares these new atomic tools against the existing library using semantic similarity; a new tool is only registered if it is sufficiently dissimilar to all existing entries. Concurrently, the library is pruned to maintain efficiency, removing low-usage tools when the capacity is exceeded. The final stage, the Runtime Execution Engine, integrates the retrieved or generated tools into a sequence to synthesize the final answer, closing the loop and applying the evolved library capabilities to the user query. This entire process is designed to be robust, with a fallback mechanism that degrades gracefully in the event of tool synthesis failure.
Experiment
- TTE-Zero evaluates tool evolution from scratch with a maximum library capacity of 500. On SciBench, it achieves 0.45 accuracy, surpassing KTCE (0.37) and CheMatAgent (0.34). On SciEvo, it reaches 0.62 accuracy, outperforming CheMatAgent (0.56) and KTCE (0.55), validating the advantage of dynamic tool synthesis over static or retrieval-based methods.
- TTE-Zero demonstrates high tool reusability: TRR@1 reaches 0.99 on SciEvo, indicating near-complete utilization of generated tools, while TRR@10 remains at 0.41, confirming the emergence of reusable scientific primitives, in contrast to baselines like Creator (TRR@10 = 0.02).
- Ablation study shows sub-goal decomposition ("S+Tools") significantly improves accuracy over direct query-based retrieval ("Q+Tools"), with gains up to 0.364 vs. 0.313 on Qwen2.5-7B, highlighting the importance of structured decomposition for precise tool retrieval.
- TTE-Adapt enables cross-domain adaptation by balancing prior knowledge retention and new knowledge consolidation. On Chemistry and Physics, it reduces TRR_trans@1 from 0.26 to 0.23 while increasing TRR_evol@1 to 0.24 and 0.32, respectively, demonstrating effective mitigation of negative transfer and successful knowledge substitution.
- Tool reuse analysis reveals TTE shifts tool utilization toward higher reuse frequencies (10–50+), indicating a transition from disposable scripts to generalized primitives, while baselines exhibit left-skewed distributions with most tools used only once or twice.
- The "Tool Overload Phenomenon" is observed: increasing library size from 100 to 500 degrades performance in query-to-tool matching due to retrieval collisions and contextual interference, underscoring the need for advanced retrieval architectures beyond flat similarity search.
- Case studies confirm TTE’s ability to autonomously synthesize missing primitives: in molar mass estimation, it evolves a dedicated calculate_molar_volume function, achieving exact solution (169 g/mol); in electroplating stoichiometry, it generates calculate_moles_of_electrons and calculate_area tools, yielding precise results (31.6 g, 1283 cm²).
The authors use a bar chart to compare the accuracy of different models under three tool library sizes (100, 250, and 500 tools) and three settings: no tool call, Q+Tools (using the original query), and S+Tools (using sub-goal decomposition). Results show that the S+Tools method consistently achieves the highest accuracy across all models and library sizes, with the largest gains observed in the 500-tool setting. The Q+Tools method generally outperforms the no tool call baseline, but the S+Tools approach demonstrates a clear advantage, particularly for models like GPT-3.5-turbo and GPT-4o, indicating that sub-goal decomposition significantly improves tool utilization and problem-solving performance.

The authors use the provided table to illustrate the adaptive execution process of the TTE framework, where the system dynamically decides whether to retrieve existing tools or evolve new ones. Results show that the system successfully retrieves standard tools for common operations like charge calculation and mass conversion, but evolves new primitives for missing functions such as calculating moles of electrons and tray area, enabling accurate problem-solving through targeted tool synthesis.

The authors use a step-by-step execution trace to demonstrate how the system handles a complex scientific problem by combining retrieved tools with newly evolved primitives. In Step 3, the system identifies a missing computational primitive for calculating molar volume and autonomously synthesizes a new tool, which is then successfully executed to produce the correct intermediate result, enabling accurate final computation. This adaptive behavior highlights the system's ability to bridge gaps in its tool library through on-demand synthesis.

Results show that TTE-Adapt achieves higher accuracy than the "No Tool" and "Source Only" baselines in both cross-domain settings, with performance gains driven by an adaptive substitution mechanism. The system reduces reliance on pre-existing tools (lower TRR_trans) while effectively consolidating new knowledge into reusable primitives (higher TRR_evol), indicating successful adaptation to the target domain.

The authors use the Tool Reuse Rate (TRR@k) to evaluate the quality of tool evolution in the TTE-Zero setting, where the system synthesizes a tool library from scratch. Results show that TTE-Zero achieves significantly higher reuse rates across all thresholds compared to baselines, with TRR@1 reaching 0.89 on SciBench and 0.99 on SciEvo, indicating near-complete utilization of generated tools and effective consolidation of reusable scientific primitives.
