2달 전

Zonglin Yang Lidong Bing

초록

대규모 언어 모델(LLM) 이 과학적 발견에 큰 잠재력을 보여주지만, 기존 연구는 주로 추론(inference) 또는 피드백 기반 학습에 집중하고 있어, 발견의 생성적 추론 과정인 $P(\text{hypothesis}|\text{background})$ ( $P(h|b)$ ) 을 직접 모델링하는 접근은 아직 탐구되지 않은 채 남아 있습니다. 우리는 방대한 지식베이스에서 영감을 검색하고 조합하는 과정에서 내재된 조합적 복잡성( $O(N^k)$ ) 으로 인해 $P(h|b)$ 를 직접 학습하는 것이 수학적으로 처리 불가능(intractable) 함을 규명했습니다. 이러한 한계를 극복하기 위해 우리는 MOOSE-Star 를 제안합니다. 이는 처리 가능한 학습과 확장 가능한 추론을 가능하게 하는 통합 프레임워크입니다. 최적의 경우 MOOSE-Star 는 (1) 발견의 확률 방정식에서 유도된 분해된 서브태스크를 기반으로 학습하고, (2) 동기 부여에 기반한 계층적 탐색을 적용하여 로그 시간( $O(\log N)$ ) 내 검색을 실현하고 관련 없는 부분 공간을 가지치기하며, (3) 제한된 조합(bounded composition) 을 활용하여 검색 노이즈에 대한 강건성을 확보함으로써 복잡성을 지수적 수준에서 로그 수준으로 낮춥니다. 이를 지원하기 위해 108,717 개의 분해된 논문을 포함한 TOMATO-Star 데이터셋을 공개하였으며, 해당 데이터셋 학습에는 총 38,400 GPU 시간이 소요되었습니다. 또한, 무차별 대입 샘플링은 '복잡성의 벽(complexity wall)'에 직면하는 반면, MOOSE-Star 는 추론 시에도 지속적인 확장성(test-time scaling) 을 보임을 입증했습니다.

One-sentence Summary

Researchers from the MOOSE-STAR project introduce MOOSE-STAR, a unified framework that enables tractable training for scientific discovery by reducing computational complexity from exponential to logarithmic through hierarchical search and bounded composition, facilitating continuous test-time scaling where brute-force methods fail.

Key Contributions

Existing research on LLMs for scientific discovery overlooks the direct modeling of the generative reasoning process $P(h|b)$ because retrieving and composing inspirations from a vast knowledge base creates mathematically intractable combinatorial complexity.
The proposed MOOSE-STAR framework overcomes this barrier by decomposing the objective into subtasks, employing motivation-guided hierarchical search, and utilizing bounded composition to reduce complexity from exponential to logarithmic.
To support this approach, the authors release the TOMADO-STAR dataset containing over 108,000 decomposed papers and demonstrate that the method enables continuous test-time scaling where brute-force sampling fails.

Introduction

Large language models hold significant potential for scientific discovery, yet current research primarily relies on inference strategies or feedback-driven training rather than directly modeling the core generative reasoning process. Existing approaches struggle because they depend on external feedback to refine hypotheses instead of learning to generate high-quality ideas directly from research backgrounds, and a theoretical analysis reveals that directly training this probability is mathematically intractable due to exponential combinatorial complexity. To overcome this barrier, the authors introduce MOOSE-STAR, a unified framework that enables tractable training and scalable inference by decomposing tasks, employing motivation-guided hierarchical search to reduce complexity to logarithmic levels, and utilizing bounded composition for robustness.

Dataset

Dataset Composition and Sources: The authors construct TOMATO-Star, a large-scale dataset derived from 108,717 open-access scientific papers sourced from the NCBI database. The corpus spans biology, chemistry, and cognitive science, covering publications from January 2020 to October 2025.
Key Details for Each Subset:
- Training Set: Includes papers published between January 2020 and September 2025.
- Test Set: Consists of papers published in October 2025 to ensure a strict temporal split and prevent data contamination.
- Filtering Rules: Every sample undergoes four automated quality checks to verify information necessity, sufficiency, disjointness between background and inspirations, and non-redundancy of extracted inspirations.
Model Usage and Processing:
- Preprocessing: Raw PDF documents are converted to Markdown using MinerU.
- Decomposition: Locally deployed reasoning models (DeepSeek-R1 and R1-distilled-Qwen-32b) decompose each paper into a structured tuple of Research Background, Hypothesis, and Inspirations.
- Hypothesis Structure: Hypotheses are formatted as "Delta Hypotheses" where each inspiration maps to a specific delta containing Motivation, Mechanism, and Methodology levels.
- Inspirations: Ground-truth inspirations are identified from source citations and augmented with full titles and abstracts retrieved via Semantic Scholar.
Additional Processing Details: The pipeline enforces a strict one-to-one mapping between inspirations and hypothesis deltas. The authors ensure the background section remains strictly independent of the inspirations and hypothesis to maintain logical integrity during training.

Method

The authors address the computational intractability of directly modeling the marginal likelihood $P(h \mid b)$ , which scales exponentially as $O(N^k)$ due to the combinatorial search over the global knowledge base $\mathcal{I}$ . To resolve this, they introduce the MOOSE-STAR framework, which operationalizes a probabilistic decomposition theory. This approach transforms the monolithic generation task into a sequence of $k$ manageable subtasks: Inspiration Retrieval and Hypothesis Composition. By decoupling the search from the composition, the complexity is reduced from exponential to linear $O(k \times N)$ .

To further optimize the linear retrieval term $O(N)$ , the framework employs Bounded Composition. Instead of requiring the model to retrieve the exact ground-truth inspiration $i^*$ from the entire database, the authors introduce a semantic tolerance space. This allows the composition module to function robustly even when provided with a proxy inspiration $i$ that is semantically similar to $i^*$ .

As illustrated in the figure above, the global knowledge space $|I| = N$ contains the exact inspiration $i^*$ . The effective bounded space $|I_{i^*}| = M$ represents a semantic neighborhood centered on $i^*$ , defined by concentric similarity thresholds. By training the model to compose hypotheses using inspirations within this bounded window, the retrieval complexity is effectively reduced to $O(N/M)$ , while the composition cost increases only linearly with $M$ . Since $N \gg M$ , this trade-off yields a significant net reduction in total complexity.

Building on this, the authors implement Hierarchical Search to replace the linear scan of the knowledge base. They construct a semantic search tree via bottom-up clustering of paper embeddings. During inference, a Best-First Search strategy navigates this tree, pruning irrelevant branches and achieving logarithmic complexity $O(\log N)$ in the best-case scenario. Finally, Motivation Planning is introduced as a high-level generative root. By appending a motivation variable $m$ to the research background, the search is guided towards specific semantic subspaces, further reducing the effective search space to $N_m < N$ . The entire framework is trained using a teacher-based Rejection Sampling Fine-Tuning pipeline on the TOMADO-STAR dataset, which contains over 100,000 processed scientific papers.

Experiment

Decomposed Sequential Training validates that fine-tuning specialized models for inspiration retrieval and hypothesis composition significantly outperforms baselines, with exposure to bounded training data further enhancing reasoning robustness against noisy inputs.
Bounded Composition experiments confirm that incorporating data generated from noisy inspirations improves hypothesis quality across all levels of semantic similarity to ground truth.
Hierarchical Search demonstrates superior efficiency over exhaustive baselines by reducing inference calls by approximately three times while maintaining high retrieval accuracy through effective pruning of irrelevant branches.
Motivation Planning analysis shows that detailed, strategic directives derived from delta hypotheses significantly improve search efficiency compared to simple requirement translations.
Scaling studies reveal that decomposing the task into retrieval and composition sub-problems overcomes the training deadlock inherent in end-to-end brute-force sampling, enabling high success rates even for complex multi-step discoveries.
Data scaling experiments indicate that while retrieval models improve log-linearly, hypothesis composition requires a minimum data threshold to achieve significant gains, yet both tasks support scalable training paradigms.
Test-time scaling results highlight that the guided, structured approach of the proposed method achieves near-perfect coverage as compute increases, whereas unguided brute-force sampling fails catastrophically as problem complexity rises due to combinatorial explosion.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

2달 전

Zonglin Yang Lidong Bing

초록

One-sentence Summary

Key Contributions

Existing research on LLMs for scientific discovery overlooks the direct modeling of the generative reasoning process $P(h|b)$ because retrieving and composing inspirations from a vast knowledge base creates mathematically intractable combinatorial complexity.
The proposed MOOSE-STAR framework overcomes this barrier by decomposing the objective into subtasks, employing motivation-guided hierarchical search, and utilizing bounded composition to reduce complexity from exponential to logarithmic.
To support this approach, the authors release the TOMADO-STAR dataset containing over 108,000 decomposed papers and demonstrate that the method enables continuous test-time scaling where brute-force sampling fails.

Introduction

Dataset

Dataset Composition and Sources: The authors construct TOMATO-Star, a large-scale dataset derived from 108,717 open-access scientific papers sourced from the NCBI database. The corpus spans biology, chemistry, and cognitive science, covering publications from January 2020 to October 2025.
Key Details for Each Subset:
- Training Set: Includes papers published between January 2020 and September 2025.
- Test Set: Consists of papers published in October 2025 to ensure a strict temporal split and prevent data contamination.
- Filtering Rules: Every sample undergoes four automated quality checks to verify information necessity, sufficiency, disjointness between background and inspirations, and non-redundancy of extracted inspirations.
Model Usage and Processing:
- Preprocessing: Raw PDF documents are converted to Markdown using MinerU.
- Decomposition: Locally deployed reasoning models (DeepSeek-R1 and R1-distilled-Qwen-32b) decompose each paper into a structured tuple of Research Background, Hypothesis, and Inspirations.
- Hypothesis Structure: Hypotheses are formatted as "Delta Hypotheses" where each inspiration maps to a specific delta containing Motivation, Mechanism, and Methodology levels.
- Inspirations: Ground-truth inspirations are identified from source citations and augmented with full titles and abstracts retrieved via Semantic Scholar.
Additional Processing Details: The pipeline enforces a strict one-to-one mapping between inspirations and hypothesis deltas. The authors ensure the background section remains strictly independent of the inspirations and hypothesis to maintain logical integrity during training.

Method

Experiment

Decomposed Sequential Training validates that fine-tuning specialized models for inspiration retrieval and hypothesis composition significantly outperforms baselines, with exposure to bounded training data further enhancing reasoning robustness against noisy inputs.
Bounded Composition experiments confirm that incorporating data generated from noisy inspirations improves hypothesis quality across all levels of semantic similarity to ground truth.
Hierarchical Search demonstrates superior efficiency over exhaustive baselines by reducing inference calls by approximately three times while maintaining high retrieval accuracy through effective pruning of irrelevant branches.
Motivation Planning analysis shows that detailed, strategic directives derived from delta hypotheses significantly improve search efficiency compared to simple requirement translations.
Scaling studies reveal that decomposing the task into retrieval and composition sub-problems overcomes the training deadlock inherent in end-to-end brute-force sampling, enabling high success rates even for complex multi-step discoveries.
Data scaling experiments indicate that while retrieval models improve log-linearly, hypothesis composition requires a minimum data threshold to achieve significant gains, yet both tasks support scalable training paradigms.
Test-time scaling results highlight that the guided, structured approach of the proposed method achieves near-perfect coverage as compute increases, whereas unguided brute-force sampling fails catastrophically as problem complexity rises due to combinatorial explosion.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

MOOSE-Star: 복잡성 장벽을 극복하여 과학적 발견을 위한 처리 가능한 학습을 열어가는 방법

Zonglin Yang Lidong Bing

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

MOOSE-Star: 복잡성 장벽을 극복하여 과학적 발견을 위한 처리 가능한 학습을 열어가는 방법

Zonglin Yang Lidong Bing

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

MOOSE-Star: 복잡성 장벽을 극복하여 과학적 발견을 위한 처리 가능한 학습을 열어가는 방법

Zonglin Yang Lidong Bing

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters