HyperAIHyperAI

Command Palette

Search for a command to run...

이질적인 과학 기반 모델 협력

Zihao Li Jiaru Zou Feihao Fang Xuying Ning Mengting Ai Tianxin Wei Sirui Chen Xiyuan Yang Jingrui He

초록

에이전틱 대규모 언어 모델 시스템은 뚜렷한 능력을 입증해 왔다. 그러나 이러한 시스템이 언어를 범용 인터페이스로 의존한다는 점은, 특히 자연어를 넘어선 전문적 작업을 처리하기 위해 도메인 특화 파운데이션 모델들이 개발된 과학 분야를 비롯해 많은 현실 세계 문제들에 대한 적용성을 근본적으로 제한한다. 본 연구에서는 언어 중심 시스템을 보다 광범위한 과학 파운데이션 모델군으로 확장하도록 설계된 이종 에이전틱 프레임워크인 ‘Eywa’를 제안한다. Eywa의 핵심 아이디어는 언어 모델 기반의 추론 인터페이스를 도메인 특화 파운데이션 모델에 추가하여, 언어 모델이 비언어적 데이터 모달리티에 대한 추론을 유도할 수 있도록 하는 것이다. 이러한 설계는 일반적으로 특수화된 데이터와 작업에 최적화된 예측형 파운데이션 모델이 에이전틱 시스템 내 고수준의 추론 및 의사결정 과정에 참여할 수 있게 한다. Eywa는 단일 에이전트 파이프라인에서 기존 구성요소를 대체하여 사용할 수 있는 단일 에이전트 버전인 ‘EywaAgent’ 또는 기존 다중 에이전트 시스템에서 전통적 에이전트를 전문 에이전트로 교체하여 통합할 수 있는 ‘EywaMAS’ 형태로 제공된다. 또한, 우리는 계획 기반 오케스트레이션 프레임워크인 ‘EywaOrchestra’를 조사하는데, 이는 계획자(planner)가 이종 데이터 모달리티에 걸친 복잡 작업을 해결하기 위해 전통적 에이전트와 Eywa 에이전트를 동적으로 조정한다. 우리는 물리학, 생명과학, 사회과학 등 다양한 과학 분야를 아우르는 광범위한 평가에서 Eywa의 성능을 검증하였다. 실험 결과는 Eywa가 구조화되고 도메인 특화된 데이터를 다루는 작업에서 성능을 향상시키며, 전문 파운데이션 모델과의 효과적인 협력을 통해 언어 기반 추론에 대한 의존도를 감소시킴을 보여준다.

One-sentence Summary

This work introduces Eywa, a heterogeneous agentic framework that extends language-centric systems to a broader class of scientific foundation models by augmenting domain-specific models with a language-model-based reasoning interface, enabling predictive foundation models to participate in higher-level reasoning and decision-making across physical, life, and social sciences while improving performance on structured and domain-specific data and reducing reliance on language-based reasoning.

Key Contributions

  • The paper introduces Eywa, a heterogeneous agentic framework that augments domain-specific foundation models with a language-model-based reasoning interface to enable inference over non-linguistic data modalities. This design allows predictive foundation models optimized for specialized tasks to participate in higher-level reasoning and decision-making processes within agentic systems.
  • The framework functions as a drop-in replacement for single-agent pipelines via EywaAgent or integrates into existing multi-agent systems as EywaMAS. A planning-based orchestration framework named EywaOrchestra dynamically coordinates traditional agents and Eywa agents to solve complex tasks across heterogeneous data modalities.
  • Experimental results demonstrate that Eywa improves performance on tasks involving structured and domain-specific data across physical, life, and social sciences. The evaluation confirms reduced reliance on language-based reasoning through effective collaboration with specialized foundation models.

Introduction

Agentic large language model systems excel at reasoning but rely heavily on natural language, which limits their effectiveness in scientific domains requiring structured or non-linguistic data. While domain-specific foundation models exist for tasks like protein folding or weather forecasting, they typically lack native language interfaces, hindering their direct integration into agentic workflows. The authors introduce Eywa, a heterogeneous framework that augments these specialized models with language-based reasoning interfaces to enable modality-native collaboration across single-agent, multi-agent, and orchestration modes. Experiments across physical, life, and social sciences demonstrate that Eywa improves task utility while reducing token consumption compared to language-only baselines.

Dataset

Dataset Composition and Sources

  • The authors construct EywaBench from four primary datasets: DeepPrinciple, MMLU-Pro, fev-bench, and TabArena.
  • These sources cover three modalities including natural language, time series, and tabular data.
  • The taxonomy spans three parent domains (physical, life, and social science) and nine sub-domains.
  • Underlying the four main benchmarks are 67 distinct source datasets to ensure high diversity and avoid domain collapse.

Subset Details and Coverage

  • The released EywaBench-V1 split contains 200 task instances selected for balanced coverage.
  • DeepPrinciple contributes 1,125 questions across chemistry, materials, biology, and physics.
  • MMLU-Pro provides 6,978 scientific questions converted from multiple-choice to open-ended formats.
  • fev-bench includes 96 time-series datasets with up to 30,000 covariate series each.
  • TabArena offers 51 tabular datasets for classification and regression tasks with up to 150,000 rows.
  • Domain distribution is near-uniform with physical, life, and social sciences represented at 32.0%, 30.0%, and 38.0% respectively.
  • Modality distribution covers natural language, time series, and tabular data at 41.0%, 39.0%, and 20.0% respectively.

Data Usage and Evaluation

  • The benchmark is used exclusively for evaluation rather than model training.
  • The 200-sample subset allows for manual configuration and validation of agentic systems without prohibitive cost.
  • Performance is measured using a unified utility score between 0 and 1 for all modalities.
  • Results are aggregated as unweighted means per domain and across all tasks.
  • The pipeline is parametric and scalable, allowing for additional instances via resampling without changing the schema.

Processing and Metadata Construction

  • Data is stored as a dictionary-encoded Parquet file with Snappy compression.
  • Financial time series inputs undergo timestamp anonymization to prevent model memorization of historical trends.
  • MMLU-Pro questions are reformatted to require direct open-ended answers instead of multiple-choice selection.
  • Time series and tabular tasks support scalable instance generation by sampling different windows or data subsets.
  • Metrics include soft-match scores for text and normalized error combinations (sMAPE and MAAPE) for numerical predictions.

Method

The Eywa framework addresses the inherent trade-offs between Large Language Models (LLMs) and Domain-Specific Foundation Models (FMs). While LLMs possess strong general-purpose reasoning capabilities, they often lack the specialized knowledge required for scientific or domain-specific tasks. Conversely, FMs provide faithful domain-specific computation but typically lack a native language interface. The authors propose a unified architecture that integrates these components through a modular design. Refer to the framework diagram below, which illustrates the conceptual analogy drawn from the Avatar universe and its technical realization within the Hugging Face ecosystem.

The architecture is organized into three hierarchical levels. The top level conceptualizes the relationship between Na'vi (representing LLM agents) and specialized species (representing FMs) connected via a neural bond. The bottom level translates this into three technical modules: EywaAgent for reasoning-augmented domain models, EywaMAS for collaborative multi-agent systems, and EywaOrchestra for unified orchestration among heterogeneous experts.

The core unit of this framework is the EywaAgent, which augments a foundation model with a language-based reasoning interface. The mechanism enabling this integration is the "Tsaheylu" bond, designed to establish a robust communication channel between the language model and the specialist foundation model. As shown in the figure below, this interface ensures that the LLM can correctly configure the control input conditioned on the task state while delegating specialized computation to the foundation model.

Formally, the Tsaheylu interface is defined as a bidirectional communication pair (ϕk,ψk)(\phi_k, \psi_k)(ϕk,ψk) for a domain kkk. The query compiler ϕk\phi_kϕk translates the task state from the LLM into a structured FM invocation, while the response adapter ψk\psi_kψk converts the specialist output into a planner-consumable representation. This pipeline allows the system to dynamically decide whether to invoke the foundation model based on task requirements, effectively subsuming language-only agents as a special case while strictly expanding the class of computable functions.

Building upon the single-agent unit, the framework extends to multi-agent settings to enable complex and heterogeneous collaborations. EywaMAS generalizes the EywaAgent to a distributed multi-agent setting, allowing multiple specialized agents to interact. Unlike traditional multi-agent systems that rely solely on language-based agents, EywaMAS enables seamless integration of both language-only agents and EywaAgents within a unified framework. Refer to the comparison diagram below to see how EywaMAS generalizes existing multi-agent topologies.

The system supports flexible composition across different language models, different foundation models across domains, and mixed agent types. This plug-and-play property allows domain-specific foundation models to be incorporated into existing agentic systems without redesigning communication protocols. The communication topology G\mathcal{G}G governs the interactions, ensuring that the specialized capabilities of the EywaAgents propagate through the network to solve complex tasks.

To handle the diversity of real-world tasks, the framework introduces EywaOrchestra, a dynamic orchestration mechanism. A conductor, instantiated as a large language model, maps the input task to a system configuration. This involves selecting the role and type of each agent, the backbone language model, the attached foundation model, and the communication topology. To support this heterogeneous interaction, the authors adopt a unified prompt template that specifies the task role, tool context, structured input field, and expected output format. The structure of this general prompt is detailed in the figure below.

The template includes placeholders for task identity, model descriptions, and modality-specific instructions. This design allows different task types to share the same high-level prompting interface while preserving modality-specific instructions through specialized input tags and output constraints. By combining adaptive orchestration with a standardized interface, the framework achieves enhanced expressivity and improved task performance across various scientific domains.

Experiment

Experiments on EywaBench evaluate Eywa variants against single-agent and multi-agent language model baselines across diverse scientific domains. Results demonstrate that integrating domain-specific foundation models significantly enhances utility and efficiency, as language-only agents often fail to perform accurate numerical computations on structured data. Additionally, adaptive orchestration in EywaOrchestra achieves performance comparable to expert-designed systems while reducing costs, confirming that cross-modality heterogeneity is more critical than combining heterogeneous language models alone.

The authors evaluate EywaAgent, EywaMAS, and EywaOrchestra across diverse scientific domains using different LLM backbones. Results demonstrate that integrating domain-specific foundation models significantly enhances utility and efficiency compared to language-only baselines. EywaMAS achieves the strongest overall performance, while EywaOrchestra provides a cost-effective alternative by dynamically adapting the system configuration. EywaMAS achieves the highest overall utility across physical, life, and social science domains compared to single-agent and homogeneous multi-agent baselines. EywaOrchestra matches the performance of the fixed multi-agent system while substantially reducing token consumption and inference latency through dynamic configuration selection. Upgrading the LLM backbone from gpt-4.1-nano to gpt-5-nano yields significant performance improvements, whereas further scaling to gpt-5-mini shows diminishing returns in several sub-domains.

The authors evaluate the EywaAgent system using three distinct LLM backends to analyze the impact of model capability on performance. Results show a clear upward trend in overall utility as the backbone strength increases, with the strongest model achieving the best scores. While the largest model offers the highest utility, it demonstrates a trade-off where token consumption is lower than the mid-tier model but inference time is slightly higher. Increasing the capability of the LLM backbone leads to consistent improvements in overall utility across all scientific domains. The strongest model configuration achieves the highest utility while consuming fewer tokens than the mid-tier configuration. The smallest backbone model is the most efficient in terms of time and token usage but yields the lowest utility.

The authors evaluate their proposed methods against various single-agent and multi-agent baselines across scientific domains using a unified protocol. Results demonstrate that integrating domain-specific foundation models significantly improves task utility and computational efficiency compared to language-only approaches. EywaAgent improves average utility and reduces token consumption compared to the single-agent baseline. EywaMAS achieves the highest overall utility while using fewer tokens than other multi-agent baselines. EywaOrchestra reaches utility levels similar to expert-designed systems but with lower latency and token costs.

The authors evaluate EywaAgent, EywaMAS, and EywaOrchestra across diverse scientific domains using various LLM backbones to compare performance against single-agent and language-only baselines. Results indicate that integrating domain-specific foundation models significantly enhances utility and efficiency, with EywaMAS achieving the highest overall performance and EywaOrchestra providing a cost-effective alternative through dynamic configuration. Additionally, upgrading the LLM backbone yields significant performance improvements, whereas further scaling shows diminishing returns in several sub-domains.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp