HyperAIHyperAI

Command Palette

Search for a command to run...

PaperSearchQA: RLVR를 활용한 과학 논문에 대한 검색 및 추론 학습

James Burgess Jan N. Hansen Duo Peng Yuhui Zhang Alejandro Lozano Min Woo Sun Emma Lundberg Serena Yeung-Levy

초록

검색 에이전트는 질문에 답하기 위해 언어 모델(LM)이 추론하고 지식 기반 또는 웹을 탐색하는 시스템이다. 최근의 방법들은 검증 가능한 보상(reward)을 활용한 강화학습(RLVR)을 통해 최종 답변의 정확도만을 감독한다. 대부분의 RLVR 검색 에이전트는 일반 도메인 질문 응답(QA) 문제에 집중하여, 과학, 공학, 의학 분야의 기술적 AI 시스템에 대한 적용 가능성에 한계가 있다. 본 연구에서는 과학 논문을 대상으로 검색 및 추론을 수행하도록 에이전트를 훈련하는 방안을 제안한다. 이는 기술적 질문 응답 능력을 시험하는 동시에 실제 과학자들에게 직접적인 관련성을 가지며, 향후 AI 과학자 시스템의 핵심 역량이 될 것이다. 구체적으로, 1,600만 건의 생물의학 논문 초록으로 구성된 검색 코퍼스를 공개하고, 이 코퍼스에서 답변이 가능한 6만 개의 샘플을 포함한 도전적인 사실형 QA 데이터셋인 PaperSearchQA를 구축했다. 또한 관련 벤치마크도 함께 제공한다. 이 환경에서 훈련된 검색 에이전트는 비-RL 기반 검색 기반선을 능가하는 성능을 보이며, 추가적인 정량적 분석을 통해 계획 수립, 추론, 자기 검증과 같은 흥미로운 에이전트 행동 패턴을 관찰할 수 있었다. 본 연구에서 제공하는 코퍼스, 데이터셋 및 벤치마크는 RLVR 훈련에 널리 사용되는 Search-R1 코드베이스와 호환되며, https://huggingface.co/collections/jmhb/papersearchqa에서 공개한다. 마지막으로, 본 연구의 데이터 생성 방법은 확장 가능하며, 다른 과학 분야로 쉽게 확장할 수 있다.

One-sentence Summary

Stanford and Chan Zuckerberg Biohub researchers propose PaperSearchQA, a 60k-sample biomedical QA dataset with 16M abstracts, enabling RLVR-trained search agents to outperform retrieval baselines via planning and self-verification—advancing AI Scientist systems for technical domains.

Key Contributions

  • We introduce a new scientific QA environment for training search agents, featuring a corpus of 16 million biomedical abstracts and a 60k-sample factoid dataset (PaperSearchQA), designed to test technical reasoning and support real-world scientific workflows.
  • We demonstrate that RLVR-trained agents outperform non-RL baselines on this task, while also revealing emergent behaviors such as planning, self-verification, and strategic query rewriting through quantitative and qualitative analysis.
  • Our datasets and benchmarks are compatible with the Search-R1 codebase and publicly released on Hugging Face, with scalable data construction methods extendable to other scientific domains like chemistry or materials science.

Introduction

The authors leverage reinforcement learning with verifiable rewards (RLVR) to train language models that search and reason over scientific papers — a capability critical for AI systems in science, engineering, and medicine. Prior RLVR work focused on general-domain trivia, which lacks the technical depth needed for real scientific workflows. Existing scientific QA systems rely on scaffolding or supervised fine-tuning, limiting their ability to generalize. The authors’ main contribution is a new training environment including a 16M-abstract biomedical corpus, a 60k-sample factoid QA dataset (PaperSearchQA), and benchmarks — all compatible with the Search-R1 codebase — enabling RL-trained agents to outperform non-RL baselines and exhibit emergent behaviors like planning and self-verification.

Dataset

  • The authors use PaperSearchQA, a biomedical QA dataset built from 16M PubMed abstracts, to train and evaluate retrieval-augmented agents. It contains 54,907 training and 5,000 test samples, each with a factoid answer, category label, and paraphrase flag.

  • Questions are generated via GPT-4.1 using a structured prompt that enforces unambiguous, single-entity answers and avoids acronyms or document-specific phrasing. Each abstract yields 3 QAs. Half are paraphrased via LLM to reduce keyword matching bias.

  • Ten expert-defined categories guide QA generation, including “Experimental & computational methods” (27%) and “Therapeutics, indications & clinical evidence.” Categories were derived by synthesizing human brainstorming and LLM analysis of BioASQ questions.

  • Synonyms for each ground truth answer are generated using GPT-4.1 to support exact-match reward modeling. All samples include PubMed ID, category, and paraphrase status. The dataset is CC-BY-4.0 licensed and available on Hugging Face.

  • For evaluation, the authors use BioASQ’s factoid subset (1,609 samples), which they augment with answer synonyms using the same LLM method. BioASQ questions are human-written and cover broader question types, but only factoid QAs are used in this work.

  • The retrieval corpus consists of 16M PubMed abstracts (mean 245 words), indexed with BM25 (2.6GB) and e5 (93GB). At inference, e5 requires two A100 GPUs. The corpus and BioASQ data inherit CC-BY-2.5 licensing from BioASQ.

  • The full pipeline cost ~$600 via OpenRouter. Prompts and code are publicly available; the dataset is designed for RLVR training where reward models verify exact match or synonym match without requiring reasoning traces or retrieved document annotations.

Method

The authors leverage a reinforcement learning framework with verifiable rewards (RLVR) to train search-capable language models, enabling agents to iteratively reason, issue search queries, and synthesize answers based on retrieved evidence. The training pipeline begins with constructing domain-specific QA datasets, followed by indexing a search corpus, and culminates in policy optimization via RLVR. The core interaction loop is governed by a minimal system prompt that instructs the model to encapsulate reasoning within -thinking- tokens, issue search queries via tags, and deliver final answers within tags. This design intentionally avoids prescribing detailed reasoning strategies, allowing the agent to discover effective behaviors through reward-driven exploration.

Refer to the framework diagram, which illustrates a typical agent trajectory: the model first performs internal reasoning to identify key components of the question, then issues a search query to retrieve relevant documents, integrates the retrieved information into its reasoning trace, and finally produces a verified answer. The retrieved documents are appended to the context but excluded from gradient computation during training, ensuring the policy learns to generate useful queries and synthesize answers rather than memorize retrieval outputs.

The training objective maximizes the expected reward over trajectories generated by the policy LLM, πθ\pi_{\theta}πθ, conditioned on a retriever R\mathcal{R}R and a QA dataset D\mathcal{D}D:

maxπθExD,yπθ(x;R)[rϕ(x,y)]βDKL[πθ(yx;R)πref(yx;R)]\begin{array} { r l } & { \underset { \pi _ { \theta } } { \operatorname* { m a x } } \mathbb { E } _ { x \sim \mathcal { D } , y \sim \pi _ { \theta } ( \cdot \mid x ; \mathcal { R } ) } \left[ r _ { \phi } ( x , y ) \right] } \\ & { \qquad - \beta \mathbb { D } _ { \mathrm { K L } } \left[ \pi _ { \theta } ( y \mid x ; \mathcal { R } ) \mid \mid \pi _ { \mathrm { r e f } } ( y \mid x ; \mathcal { R } ) \right] } \end{array}πθmaxExD,yπθ(x;R)[rϕ(x,y)]βDKL[πθ(yx;R)∣∣πref(yx;R)]

Here, the reward model rϕ(x,y)r_{\phi}(x,y)rϕ(x,y) extracts the final answer from the generated sequence and assigns a binary reward (1 if correct, 0 otherwise). The KL penalty term prevents excessive deviation from the reference policy, which is initialized to the pre-trained LLM state. Optimization is performed using Group Relative Policy Optimization (GRPO), which computes advantages within groups of rollouts to stabilize training. The GRPO objective incorporates clipping and group-normalized advantages to reduce variance and improve sample efficiency.

As shown in the figure below, the training data generation process involves two stages: first, the LLM proposes QA categories from an existing dataset (BioASQ), which are then refined by domain experts; second, the model generates new QA pairs from sampled scientific papers, which are paraphrased and stored in the final dataset. This synthetic data pipeline ensures coverage of diverse scientific domains while maintaining factual grounding.

The agent’s behavior during training is characterized by three distinct reasoning modes: explicit planning and keyword extraction, reasoning before search, and verification of in-parameter knowledge. In the first mode, the agent decomposes the question into subtasks and issues targeted search queries. In the second, it performs preliminary reasoning to narrow the scope before retrieving external information. In the third, it leverages internal knowledge to hypothesize an answer and uses search only for verification. These modes are not hard-coded but emerge from the RLVR training process, as the agent learns to allocate effort between internal reasoning and external retrieval based on reward feedback.

Experiment

  • RLVR training significantly boosts performance on scientific QA tasks compared to baseline methods like direct inference, chain-of-thought, and RAG, especially for factoid questions.
  • RLVR-trained models (Search-R1) outperform RAG by 9.6–14.5 points on PaperSearchQA and 5.5–9.3 points on BioASQ, with gains increasing with model size.
  • Retrieval method (BM25 vs e5) shows minimal performance difference, suggesting keyword-based retrieval suffices in scientific domains due to technical terminology.
  • LLMs retain substantial parametric knowledge of scientific facts, but retrieval remains essential as memorization is incomplete.
  • Paraphrasing questions during dataset construction increases difficulty and better tests generalization, as non-paraphrased questions are easier to answer.
  • Training dynamics mirror general QA settings; base models require more time to converge and are more stable than instruct models under GRPO.
  • Qualitatively, trained agents favor explicit keyword extraction and search planning; early reasoning before search and verification of known answers also occur but diminish with training.
  • After retrieval, models typically answer immediately with little explicit reasoning, possibly due to simplified comprehension needs or RL-induced parameter tuning.
  • Performance varies by category, with “Biomarkers & diagnostics” and “Protein function & signalling” being easiest and “Genetic mutations” most challenging.

The authors use reinforcement learning with verifiable rewards (RLVR) to train LLMs on scientific question answering, and results show this approach consistently outperforms baseline methods including direct inference, chain-of-thought, and retrieval-augmented generation. Performance gains are more pronounced with larger models, and the method proves effective across both in-domain and external benchmarks. The improvements suggest RLVR enhances the model’s ability to leverage parametric knowledge rather than relying solely on retrieval or reasoning scaffolds.

The authors use RLVR training to improve LLM performance on scientific question answering, showing consistent gains over baseline methods like RAG and chain-of-thought across both 3B and 7B models. Results show that retrieval-augmented approaches significantly outperform retrieval-free ones, and model size correlates with better performance, suggesting parametric knowledge plays a key role. Per-category analysis reveals that performance varies by domain, with “Bioinformatics databases” being easiest and “Genetic mutations” most challenging, while RLVR training consistently delivers the strongest results across categories.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp