Command Palette
Search for a command to run...
ARIS: البحث الذاتي من خلال التعاون متعدد الوكلاء في إطار المعارك
ARIS: البحث الذاتي من خلال التعاون متعدد الوكلاء في إطار المعارك
Ruofeng Yang Yongcan Li Shuai Li
الملخص
يصف هذا التقرير نظام ARIS (البحث التلقائي أثناء النوم)، وهو إطار عمل مفتوح المصدر للبحث الذاتي، بما في ذلك هندسته المعمارية، وآليات ضمان المصداقية، وخبرات النشر الأولية. تعتمد أداء أنظمة الـ agents المبنية على نماذج اللغات الكبيرة (LLMs) على كل من أوزان النموذج والإطار المحيط به، الذي يقرر المعلومات التي سيتم تخزينها واسترجاعها وعرضها على النموذج. في سير العمل البحثي طويل الأمد، لا تمثل الأعطال الظاهرة الوضع المركزي للفشل، بل النجاح المقنع غير المدعوم بأدلة كافية: حيث يمكن لنظام الـ agent الذي يعمل لفترة طويلة أن ينتج ادعاءات تكون أدلتها غير مكتملة، أو مضللة، أو موروثة صامتاً من صياغة المنفذ (executor). لذلك، نقدم ARIS كإطار عمل للبحث ينسق سير عمل البحث في مجال تعلم الآلة من خلال التعاون العدائي عبر النماذج كإعداد افتراضي: حيث يقود نموذج منفذ (executor model) التقدم، ويُستحسن وجود مراجع من عائلة نماذج مختلفة لنقد المخرجات الوسيطة وطلب المراجعات.يتكون ARIS من ثلاث طبقات معمارية. توفر طبقة التنفيذ أكثر من 65 مهارة قابلة لإعادة التعريف عبر Markdown، ودمجاً للنماذج عبر MCP، وموسوعة بحثية دائمة لإعادة الاستخدام التكراري للنتائج السابقة، وتوليداً حتمياً للأشكال البيانية. تنسق طبقة الترتيب خمس سير عمل متكاملة من البداية للنهاية مع إعدادات جهد قابلة للضبط وتوجيه قابل للتكوين نحو نماذج المراجعة. وتشمل طبقة الضمان عملية من ثلاث مراحل لفحص ما إذا كانت الادعاءات التجريبية مدعومة بالأدلة: التحقق من السلامة، وخريطة ربط النتائج بالادعاءات، ومراجعة الادعاءات التي تقارن عبارات المخطوطة بسجل الادعاءات والأدلة الخام، بالإضافة إلى مخطط تحرير علمي ذو خمس مراحل، وفحوصات للبراهين الرياضية، والتفتيش البصري لملف PDF المُنشأ. وتسجل حلقة التحسين الذاتي النموذجية آثار البحث وتقترح تحسينات على الإطار، لا يتم اعتمادها إلا بعد موافقة المراجع.
One-sentence Summary
Addressing the risk of plausible unsupported success in long-horizon workflows, ARIS is an open-source research harness that coordinates autonomous machine learning research through cross-model adversarial collaboration, pairing an executor model that drives progress with a reviewer from a different model family that critiques intermediate artifacts, while its execution layer supports iterative discovery via a persistent research wiki, model integrations via MCP, deterministic figure generation, and more than 65 Markdown-defined skills.
Key Contributions
- ARIS is an open-source research harness that coordinates autonomous machine learning workflows through a three-layer architecture featuring over 65 reusable Markdown-defined skills, model integrations via MCP, a persistent research wiki, and deterministic figure generation.
- The framework implements a cross-model adversarial collaboration mechanism where an executor model drives progress while a reviewer from a distinct model family critiques intermediate artifacts and requests revisions, effectively breaking self-review blind spots without the coordination overhead of larger multi-agent committees.
- The system incorporates an explicit assurance stack for integrity auditing and cross-platform portability, supporting reproducible end-to-end research workflows that span idea generation, experimentation, and manuscript preparation across multiple host environments.
Introduction
Automating the end-to-end machine learning research pipeline holds significant promise for accelerating scientific discovery, yet long-horizon autonomous tasks remain highly susceptible to hallucinations, correlated errors, and misreported results. Prior autonomous research agents typically rely on same-model self-refinement and tightly coupled workflows, which fail to catch shared inductive biases, lack modular resumption capabilities, and offer minimal system-level verification of experimental integrity. To address these gaps, the authors propose ARIS, a framework that treats single-agent execution as inherently unreliable and decomposes the research process into modular, state-preserving stages. The authors leverage an adversarial cross-family executor-reviewer pairing with explicit assurance checks at critical milestones, ensuring that each research artifact is independently validated by a distinct model family before advancing to the next workflow phase.
Dataset
- Dataset composition and sources: The authors compile a skill inventory that catalogs core framework skills from a current release, as outlined in Table 5.
- Subset details: The provided excerpt does not define distinct subsets, so specific sizes, origins, or filtering criteria are not documented.
- Usage and training configuration: The authors reference the inventory to establish a baseline skill framework. The text does not outline training splits, mixture ratios, or data blending strategies.
- Processing and metadata: No cropping methods, metadata generation steps, or additional preprocessing pipelines are described in the available content.
Method
The ARIS framework is structured around three primary architectural layers—execution, orchestration, and assurance—that collectively address the challenges of long-horizon, autonomous machine learning research. The execution layer provides a modular foundation through more than 65 reusable skills, each defined as a plain-text Markdown file (SKILL.md) that specifies inputs, outputs, procedural steps, and quality gates. These skills are coordinated via versionable artifact contracts, enabling checkpoint-based recovery and auditability. The orchestration layer manages five end-to-end workflows—idea discovery, experiment bridge, auto-review loop, paper writing, and rebuttal—chained through these contracts and grouped into four research phases: Discovery, Experimentation, Manuscript, and Post-Submission. This layer supports adjustable effort settings and configurable routing to reviewer models, allowing users to scale depth and breadth while maintaining core review invariants. The assurance layer implements a multi-stage process to detect and mitigate plausible unsupported success, including evidence integrity verification, result-to-claim mapping, and claim auditing against raw evidence and a claim ledger. It also includes a five-pass scientific-editing pipeline, mathematical-proof checks, and visual inspection of rendered PDFs. A prototype meta-optimization loop records research traces and proposes harness improvements that are only adopted after reviewer approval, enabling iterative refinement of the system itself. The overall architecture is designed to enforce independent assurance by default, leveraging cross-model adversarial collaboration between an executor and a reviewer drawn from different model families, thereby reducing shared inductive biases and enhancing critical evaluation.
Experiment
The evaluation relies on observational deployment tracking and a single overnight operational run to assess ARIS under realistic conditions. These experiments validate the system's practical capability to autonomously prune unsupported claims and iteratively refine manuscripts through automated review cycles, alongside demonstrating substantial ecosystem expansion across multiple technical domains. Because the reported outcomes remain observational, they confirm operational feasibility rather than establishing causal advantages for specific reviewer architectures or model configurations. A future controlled benchmark protocol will be required to isolate the impact of algorithmic design from external variables like researcher expertise and task difficulty.
The authors present a comparison of different systems, including their own, across several capabilities such as cross-family review, adversarial review, composability, E2E research workflows, assurance stack, and cross-platform portability. The system developed by the authors, ARIS, demonstrates a comprehensive set of features, particularly in composability and E2E research workflows, and supports cross-platform portability. ARIS supports cross-platform portability and E2E research workflows, which are not supported by other systems. ARIS includes composable skills and an assurance stack, features absent in most compared systems. ARIS implements a default cross-family policy, whereas other systems either lack this capability or use partial or none policies.
The evaluation compares ARIS against existing systems across multiple capability dimensions, validating its performance in cross-family and adversarial reviews, composability, end-to-end research workflows, assurance stacks, and cross-platform portability. ARIS demonstrates superior integration by uniquely supporting full end-to-end workflows and cross-platform deployment while introducing composable skills and a dedicated assurance stack absent in competing tools. Additionally, it establishes a comprehensive default cross-family policy that addresses the partial or missing coverage found in alternative approaches.