HyperAIHyperAI

Command Palette

Search for a command to run...

ARIS: 적대적 멀티 에이전트 협력을 통한 자율 연구

Ruofeng Yang Yongcan Li Shuai Li

초록

이 보고서는 오픈소스 자율 연구 프레임워크인 ARIS(Auto-Research-in-sleep)의 아키텍처, 보증 메커니즘, 초기 배포 경험을 기술한다. 대규모 언어 모델(LLM) 기반 agent 시스템의 성능은 모델 가중치뿐만 아니라, 해당 모델을 감싸는 프레임워크의 역할에 의해서도 좌우되는데, 이 프레임워크는 모델에 어떤 정보를 저장하고, 검색하며, 제시할지를 통제한다. 장기적인 연구 워크플로우에서 핵심적인 실패 모드는 명확한 오류 발생이 아니라, 근거가 불충분하거나 잘못 보고되었거나 실행자의 문맥에서 암묵적으로 계승된 주장이 제시되는 '평범해 보이는 부정확한 성공'이다. 따라서 본 보고서는 ARIS를, 기본 구성으로 서로 다른 모델 계열 간 적대적 협업을 통해 머신러닝 연구 워크플로우를 조정하는 연구 프레임워크로 제시한다. 구체적으로, 한 모델 계열의 실행자 agent가 진전을 주도하는 동안, 다른 모델 계열의 검토자가 중간 산출물을 비판하고 수정을 요청하는 구조이다.ARIS는 세 가지 아키텍처 레이어로 구성된다. 실행 레이어는 65개 이상의 재사용 가능한 Markdown 기반 스킬, MCP를 통한 모델 통합, 이전 연구 결과의 반복적 재사용을 위한 영구적 연구 위키, 그리고 결정론적 그래프 생성 기능을 제공한다. 오케스트레이션 레이어는 조정 가능한 노력 설정과 검토 모델로의 구성 가능한 라우팅 기능을 갖춘 5가지 끝-to-끝 워크플로우를 조정한다. 보증 레이어는 실험적 주장이 증거에 의해 뒷받침되는지 확인하기 위한 3단계 프로세스를 포함한다. 이 프로세스에는 무결성 검증, 결과-주장 매핑, 그리고 주장 기록부 및 원시 증거와 대비하여 논문 내 진술을 교차 검증하는 주장 감사 등이 있다. 또한, 5단계 과학적 편집 파이프라인, 수학 증명 검사, 렌더링된 PDF의 시각적 검사도 포함된다. 프로토타입 형태의 자기개선 루프는 연구 흔적을 기록하고 프레임워크 개선을 제안하며, 이러한 제안은 검토자의 승인을 받은 후에야 적용된다.

One-sentence Summary

Addressing the risk of plausible unsupported success in long-horizon workflows, ARIS is an open-source research harness that coordinates autonomous machine learning research through cross-model adversarial collaboration, pairing an executor model that drives progress with a reviewer from a different model family that critiques intermediate artifacts, while its execution layer supports iterative discovery via a persistent research wiki, model integrations via MCP, deterministic figure generation, and more than 65 Markdown-defined skills.

Key Contributions

  • ARIS is an open-source research harness that coordinates autonomous machine learning workflows through a three-layer architecture featuring over 65 reusable Markdown-defined skills, model integrations via MCP, a persistent research wiki, and deterministic figure generation.
  • The framework implements a cross-model adversarial collaboration mechanism where an executor model drives progress while a reviewer from a distinct model family critiques intermediate artifacts and requests revisions, effectively breaking self-review blind spots without the coordination overhead of larger multi-agent committees.
  • The system incorporates an explicit assurance stack for integrity auditing and cross-platform portability, supporting reproducible end-to-end research workflows that span idea generation, experimentation, and manuscript preparation across multiple host environments.

Introduction

Automating the end-to-end machine learning research pipeline holds significant promise for accelerating scientific discovery, yet long-horizon autonomous tasks remain highly susceptible to hallucinations, correlated errors, and misreported results. Prior autonomous research agents typically rely on same-model self-refinement and tightly coupled workflows, which fail to catch shared inductive biases, lack modular resumption capabilities, and offer minimal system-level verification of experimental integrity. To address these gaps, the authors propose ARIS, a framework that treats single-agent execution as inherently unreliable and decomposes the research process into modular, state-preserving stages. The authors leverage an adversarial cross-family executor-reviewer pairing with explicit assurance checks at critical milestones, ensuring that each research artifact is independently validated by a distinct model family before advancing to the next workflow phase.

Dataset

  • Dataset composition and sources: The authors compile a skill inventory that catalogs core framework skills from a current release, as outlined in Table 5.
  • Subset details: The provided excerpt does not define distinct subsets, so specific sizes, origins, or filtering criteria are not documented.
  • Usage and training configuration: The authors reference the inventory to establish a baseline skill framework. The text does not outline training splits, mixture ratios, or data blending strategies.
  • Processing and metadata: No cropping methods, metadata generation steps, or additional preprocessing pipelines are described in the available content.

Method

The ARIS framework is structured around three primary architectural layers—execution, orchestration, and assurance—that collectively address the challenges of long-horizon, autonomous machine learning research. The execution layer provides a modular foundation through more than 65 reusable skills, each defined as a plain-text Markdown file (SKILL.md) that specifies inputs, outputs, procedural steps, and quality gates. These skills are coordinated via versionable artifact contracts, enabling checkpoint-based recovery and auditability. The orchestration layer manages five end-to-end workflows—idea discovery, experiment bridge, auto-review loop, paper writing, and rebuttal—chained through these contracts and grouped into four research phases: Discovery, Experimentation, Manuscript, and Post-Submission. This layer supports adjustable effort settings and configurable routing to reviewer models, allowing users to scale depth and breadth while maintaining core review invariants. The assurance layer implements a multi-stage process to detect and mitigate plausible unsupported success, including evidence integrity verification, result-to-claim mapping, and claim auditing against raw evidence and a claim ledger. It also includes a five-pass scientific-editing pipeline, mathematical-proof checks, and visual inspection of rendered PDFs. A prototype meta-optimization loop records research traces and proposes harness improvements that are only adopted after reviewer approval, enabling iterative refinement of the system itself. The overall architecture is designed to enforce independent assurance by default, leveraging cross-model adversarial collaboration between an executor and a reviewer drawn from different model families, thereby reducing shared inductive biases and enhancing critical evaluation.

Experiment

The evaluation relies on observational deployment tracking and a single overnight operational run to assess ARIS under realistic conditions. These experiments validate the system's practical capability to autonomously prune unsupported claims and iteratively refine manuscripts through automated review cycles, alongside demonstrating substantial ecosystem expansion across multiple technical domains. Because the reported outcomes remain observational, they confirm operational feasibility rather than establishing causal advantages for specific reviewer architectures or model configurations. A future controlled benchmark protocol will be required to isolate the impact of algorithmic design from external variables like researcher expertise and task difficulty.

The authors present a comparison of different systems, including their own, across several capabilities such as cross-family review, adversarial review, composability, E2E research workflows, assurance stack, and cross-platform portability. The system developed by the authors, ARIS, demonstrates a comprehensive set of features, particularly in composability and E2E research workflows, and supports cross-platform portability. ARIS supports cross-platform portability and E2E research workflows, which are not supported by other systems. ARIS includes composable skills and an assurance stack, features absent in most compared systems. ARIS implements a default cross-family policy, whereas other systems either lack this capability or use partial or none policies.

The evaluation compares ARIS against existing systems across multiple capability dimensions, validating its performance in cross-family and adversarial reviews, composability, end-to-end research workflows, assurance stacks, and cross-platform portability. ARIS demonstrates superior integration by uniquely supporting full end-to-end workflows and cross-platform deployment while introducing composable skills and a dedicated assurance stack absent in competing tools. Additionally, it establishes a comprehensive default cross-family policy that addresses the partial or missing coverage found in alternative approaches.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
ARIS: 적대적 멀티 에이전트 협력을 통한 자율 연구 | 문서 | HyperAI초신경