HyperAIHyperAI

Command Palette

Search for a command to run...

MiroEval: 프로세스 및 산출물에 대한 멀티모달 딥 리서치 에이전트 벤치마킹

초록

심층 연구 시스템의 최근 발전은 매우 인상적이지만, 평가 체계는 여전히 실제 사용자 요구를 따라가지 못하고 있습니다. 기존 벤치마크는 대부분 고정된 채점 기준을 사용하여 최종 보고서만을 평가할 뿐, 그 이면의 연구 과정을 평가하지 못합니다. 또한 대부분의 벤치마크는 다중 모달 커버리지가 제한적이고, 실제 세계의 쿼리 복잡성을 반영하지 않는 합성 작업에 의존하며, 지식이 진화함에 따라 업데이트할 수 없다는 한계를 지닙니다. 이러한 격차를 해소하기 위해 우리는 심층 연구 시스템을 위한 벤치마크이자 평가 프레임워크인 MiroEval 을 제안합니다. 본 벤치마크는 100 개의 작업으로 구성되며(70 개는 텍스트 전용, 30 개는 다중 모달), 모든 작업은 실제 사용자 요구에 기반하여 구축되었고, 주기적 업데이트를 지원하는 이중 경로 파이프라인을 통해 실시간으로 진화하는 환경을 구현합니다.제안된 평가suite는 심층 연구 시스템을 세 가지 상호보완적 차원에서 평가합니다: 작업별 채점 기준을 활용한 적응형 종합 품질 평가, 웹 소스 및 다중 모달 첨부 자료에 대한 능동적 검색과 추론을 통한 에이전트 사실성 검증, 그리고 시스템이 조사를 수행하는 전 과정에 걸친 검색, 추론, 정제 방식을 점검하는 과정 중심 평가 감사입니다. 13 개 시스템을 대상으로 한 평가 결과 세 가지 주요 결론이 도출되었습니다. 첫째, 세 가지 평가 차원은 시스템 능력의 상호보완적 측면을 포착하며, 각 차원은 시스템 간 뚜렷한 강점과 약점을 드러냅니다. 둘째, 과정의 품질은 전체 결과의 신뢰할 수 있는 예측 변수로 작용할 뿐만 아니라, 출력 수준 지표에서는 보이지 않는 약점을 드러냅니다. 셋째, 다중 모달 작업은 훨씬 더 큰 도전을 제기하며, 대부분의 시스템이 3~10 점 하락하는 경향을 보입니다.MiroThinker 시리즈는 가장 균형 잡힌 성능을 보였으며, 특히 MiroThinker-H1 은 두 가지 설정 모두에서 전체적으로 가장 높은 순위를 차지했습니다. 인간 검증 및 견고성 실험 결과는 본 벤치마크와 평가 프레임워크의 신뢰성을 확인시켜 주었습니다. MiroEval 은 차세대 심층 연구 에이전트를 위한 종합적인 진단 도구로서 기능할 것입니다.

One-sentence Summary

The MiroMind Team introduces MiroEval, a dynamic benchmark for deep research agents that uniquely evaluates adaptive synthesis, agentic factuality, and research processes across text and multimodal tasks, revealing that process quality predicts outcomes while exposing significant challenges in multimodal reasoning.

Key Contributions

  • The paper introduces MiroEval, a benchmark comprising 100 tasks grounded in real user needs and constructed through curated authentic queries and an automated pipeline based on real-time web trends to ensure temporal relevance.
  • A multi-layered evaluation framework is presented that assesses deep research agents through adaptive synthesis quality rubrics, agentic factuality verification against live sources, and process-centric audits of research trajectories across five intrinsic dimensions.
  • Experiments across 13 leading systems demonstrate that process quality serves as a reliable predictor of overall outcomes while revealing weaknesses invisible to output-level metrics, such as insufficient analytical depth and significant traceability gaps.

Introduction

The rapid shift from passive text generation to agentic systems capable of autonomous deep research has created a critical need for reliable evaluation in high-stakes domains like finance and healthcare. Current benchmarks fall short because they primarily assess final reports without auditing the underlying research process, lack robust multimodal support, and rely on synthetic queries that fail to capture real-world complexity. To address these gaps, the authors introduce MiroEval, a dynamic benchmark featuring 100 real-world tasks that evaluates systems across three layers: adaptive synthesis quality, agentic factuality verification against live sources, and a process-centric audit of research trajectories.

Dataset

  • Dataset Composition and Sources The authors introduce MiroEval, a benchmark of 100 deep research tasks grounded in real user needs. The dataset is constructed via a dual-path pipeline to ensure diversity and temporal relevance, comprising 70 text-only queries and 30 multimodal queries that span 12 domains and 10 task types.

  • Key Details for Each Subset

    • User-Derived Subset (65 queries): This set includes 35 text-only and 30 multimodal tasks inspired by patterns from internal system testing. It covers all 8 evaluation features with balanced difficulty tiers (Easy, Medium, Hard) and requires handling attachments like images, PDFs, and spreadsheets.
    • Automated Subset (35 queries): This set consists entirely of text-only tasks generated through a trend-grounded pipeline using real-time web data. It targets 12 topics and 36 subtopics to ensure the queries reflect current events and require external investigation beyond parametric knowledge.
  • Data Usage and Processing The benchmark serves as a holistic evaluation framework rather than a training set, assessing 13 systems across three dimensions: adaptive synthesis quality, agentic factuality verification, and process-centric auditing. The authors employ a three-stage filtering process for the automated subset, including search validation, deep-research necessity checks, and inverse quality assessment to ensure queries cannot be answered by the model alone.

  • Privacy, Metadata, and Construction Strategies

    • Privacy-Preserving Rewriting: No original user queries are used directly. The authors apply strict anonymization to replace all named entities with realistic substitutes and filter out sensitive content before rewriting.
    • Metadata Construction: Each query is annotated with domain labels, task types, and source-specific metadata such as feature vectors, difficulty tiers, and baseline quality scores.
    • Temporal Refresh: The dual-path design allows for periodic re-execution, enabling the benchmark to incorporate new user patterns and latest web trends to prevent staleness.

Method

The MiroEval framework establishes a multi-layered, agentic evaluation pipeline to provide a rigorous diagnostic of deep research systems. This methodology decouples the research artifact from the underlying investigative procedure, allowing for a holistic assessment across three critical dimensions. The framework dynamically constructs evaluation rubrics tailored to the specific constraints and modalities of each task.

The evaluation pipeline is structured into three main components as illustrated in the figure below:

Comprehensive Adaptive Synthesis Quality Evaluation Deep research systems generate long-form reports through multi-step retrieval and reasoning. To capture synthesis quality across varying domains and modalities, the framework employs an adaptive evaluation dimension space D=DfixedDdynamic(Q)D = D_{\text{fixed}} \cup D_{\text{dynamic}}(Q)D=DfixedDdynamic(Q). The fixed component includes universal aspects such as Coverage, Insight, and Clarity. The dynamic component adapts to the query type. For text-only queries, an LLM generates 1–3 task-specific expertise dimensions. For attachment-augmented queries, a Grounding dimension is added to assess whether reports faithfully leverage provided materials. An upstream module extracts key facts from attachments to form verifiable factual anchors, which guide the generation of precise grounding criteria. The evaluator derives dimension-level weights WdW_{d}Wd and criterion-level weights wd,cw_{d,c}wd,c to compute the final quality score: Squality=dDWdcwd,csd,cS_{\text{quality}} = \sum_{d \in D} W_{d} \sum_{c} w_{d,c} s_{d,c}Squality=dDWdcwd,csd,c where sd,cs_{d,c}sd,c is the score assigned by the LLM for a specific criterion.

Agentic Factuality Evaluation This component assesses whether claims in the generated report are supported by reliable evidence from heterogeneous sources. The system decomposes the report into a set of verifiable statements S(Q,R)S(Q, R)S(Q,R). For each statement, an evaluation agent retrieves supporting or refuting evidence from external web resources and task-provided attachments. The framework supports multimodal attachment querying through Native Multimodal Processing for directly interpretable formats and Retrieval-Augmented Processing for formats requiring segmentation. The agent evaluates the consistency between each statement and its evidence set, assigning a factuality label ψ(s){RIGHT,WRONG,CONFLICT,UNKNOWN}\psi(s) \in \{\text{RIGHT}, \text{WRONG}, \text{CONFLICT}, \text{UNKNOWN}\}ψ(s){RIGHT,WRONG,CONFLICT,UNKNOWN}. The CONFLICT label explicitly captures cases where evidence from different sources leads to inconsistent conclusions.

Process-Centric Evaluation Beyond the final artifact, the framework evaluates the quality of the underlying research process. The raw process record is transformed into a structured representation of atomic units, such as information acquisition and planning. Intrinsic process quality is evaluated along dimensions including Search Breadth, Analytical Depth, Progressive Refinement, Critical Thinking, and Efficiency. Furthermore, the framework evaluates the alignment between process-level key findings and report-level key findings. This includes Process→Report (P→R) checks to ensure findings are realized in the report, Report→Process (R→P) checks to verify report conclusions are supported by the process, and Contradiction Detection to assess how conflicts are handled. The overall process score is defined as: Sprocess=αSintrinsic(P)+(1α)Salign(P,R)S_{\text{process}} = \alpha S_{\text{intrinsic}}(P) + (1 - \alpha) S_{\text{align}}(P, R)Sprocess=αSintrinsic(P)+(1α)Salign(P,R) where SintrinsicS_{\text{intrinsic}}Sintrinsic denotes the intrinsic process quality score and SalignS_{\text{align}}Salign denotes the alignment score.

Experiment

  • Evaluated 13 deep research systems across text-only and multimodal settings to validate that process quality reliably predicts overall outcome, revealing that strong research processes correlate with better synthesis and factuality.
  • Demonstrated that synthesis quality and factuality are distinct capabilities, showing that polished reports do not guarantee factual accuracy and that systems often trade analytical depth for factual precision or vice versa.
  • Identified that multimodal tasks significantly degrade performance, particularly in synthesis and process dimensions, while factual precision remains relatively stable, highlighting visual understanding as a primary bottleneck.
  • Revealed that current systems struggle with analytical depth and efficiency, often retrieving broadly but failing to investigate deeply, and exhibit a traceability gap where report content frequently cannot be traced back to the research process.
  • Confirmed that the MiroThinker series achieves consistent competitiveness across all dimensions by balancing high claim volume with low error rates and maintaining robust performance in both text-only and multimodal environments.
  • Validated the evaluation framework through robustness checks and human studies, confirming that automated rankings align with expert judgment and remain stable across different judge models and prompt configurations.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp