HyperAIHyperAI

Command Palette

Search for a command to run...

DR-Venus: 단 10K개의 오픈 데이터만으로 구현하는 프런티어 엣지 규모(Edge-Scale)의 딥 리서치 agent를 향하여

초록

소형 언어 모델(Small Language Models) 기반의 에지 규모(Edge-scale) 딥 리서치 agent는 비용, 지연 시간(latency), 그리고 개인정보 보호 측면의 이점 덕분에 실제 환경 배포에 매우 매력적입니다. 본 연구에서는 데이터 품질과 데이터 활용도를 모두 개선함으로써, 제한된 오픈 데이터 환경에서 강력한 소형 딥 리서치 agent를 학습시키는 방법을 연구합니다.우리는 전체를 오픈 데이터로 구축한 에지 규모 배포용 최첨단 4B 딥 리서치 agent인 DR-Venus를 제안합니다. 우리의 학습 레시피는 두 단계로 구성됩니다. 첫 번째 단계에서는 agentic supervised fine-tuning (SFT)을 사용하여 기본적인 agentic 능력을 구축하며, 엄격한 데이터 정제와 장기 궤적(long-horizon trajectories)의 재샘플링을 결합하여 데이터 품질과 활용도를 높입니다. 두 번째 단계에서는 agentic reinforcement learning (RL)을 적용하여 장기적인 딥 리서치 작업에서의 실행 신뢰도를 더욱 향상시킵니다.이러한 환경에서 소형 agent를 위한 RL을 효과적으로 만들기 위해, 우리는 IGPO를 기반으로 정보 이득(information gain)과 포맷 인식 규제화(format-aware regularization)에 기반한 턴 단위 보상(turn-level rewards)을 설계하였으며, 이를 통해 감독 밀도(supervision density)와 턴 단위 크레딧 할당(turn-level credit assignment)을 강화했습니다. 약 1만 개의 오픈 데이터로만 구축된 DR-Venus-4B는 다양한 딥 리서치 벤치마크에서 9B 파라미터 미만의 기존 agentic 모델들을 크게 능가하는 성능을 보였으며, 동시에 훨씬 더 큰 30B급 시스템과의 격차도 좁혔습니다.추가 분석 결과, 4B 규모의 agent가 이미 놀라울 정도로 강력한 성능 잠재력을 보유하고 있음을 확인했으며, 이는 소형 모델의 배포 가능성과 더불어 이 환경에서 테스트 시간 스케일링(test-time scaling)의 가치를 강조합니다. 우리는 에지 규모 딥 리서치 agent에 대한 재현 가능한 연구를 지원하기 위해 모델, 코드 및 핵심 레시피를 공개합니다.

One-sentence Summary

The Venus Team introduces DR-Venus, a frontier 4B deep research agent designed for edge-scale deployment that utilizes only 10K open data samples through a two-stage training recipe consisting of agentic supervised fine-tuning and agentic reinforcement learning based on information gain and format-aware regularization to significantly outperform prior models under 9B parameters and approach the performance of 30B-class systems.

Key Contributions

  • The paper introduces DR-Venus, a 4B parameter deep research agent designed for edge-scale deployment that is trained entirely on approximately 10K open-source data samples.
  • The researchers develop a two-stage training recipe consisting of agentic supervised fine-tuning (SFT) using cleaned and resampled long-horizon trajectories, followed by agentic reinforcement learning (RL) that utilizes turn-level rewards based on information gain and format-aware regularization.
  • Experimental results demonstrate that DR-Venus-4B significantly outperforms existing agentic models under 9B parameters on multiple deep research benchmarks and narrows the performance gap compared to much larger 30B-class systems.

Introduction

Deep research agents capable of iterative search and evidence synthesis are essential for complex information-seeking tasks, yet deploying them on edge devices requires small language models to optimize for cost, latency, and privacy. Current state-of-the-art research agents typically rely on massive parameter counts or closed-source datasets, leaving a gap in high-performance, small-scale agents trained on open data. Small models are particularly vulnerable to noisy training trajectories and face significant challenges during reinforcement learning, where sparse rewards often lead to training instability. The authors address these issues by introducing DR-Venus, a 4B parameter agent trained through a two-stage process of agentic supervised fine-tuning with trajectory resampling and agentic reinforcement learning using turn-level rewards. This approach significantly improves data utilization and supervision density, allowing the 4B model to outperform larger models under 9B parameters and narrow the performance gap with 30B-class systems.

Dataset

The authors developed the DR-Venus dataset to train edge-scale deep research agents using a highly curated selection of open-source data. The dataset details are as follows:

  • Dataset Composition and Sources: The authors start with 10,001 raw REDSearcher trajectories. After a multi-stage cleaning and refinement process, the final training set consists of 18,745 instances.
  • Data Processing and Filtering:
    • Environment Alignment: All trajectories are converted into a standardized interaction format, including specific message schemas, system prompts, and tool-call/response protocols to ensure consistency between training and online inference.
    • Tool Pruning and Deduplication: To match the deployment environment, the authors remove all tool interactions except for search and browse. They prune disallowed tool calls at the turn level rather than discarding entire trajectories. They also remove 15,728 duplicate tool interactions, primarily redundant browse events. This stage leaves 10,000 valid trajectories.
    • Correctness Filtering: The authors use Qwen3-235B-A22B-Instruct-2507 as a judge model to verify the accuracy of the final answers. This step retains 9,365 high-quality trajectories.
  • Training Strategy and Resampling: To emphasize the long-horizon planning required for deep research, the authors employ a turn-aware resampling strategy during SFT. They assign sampling weights of 1x for trajectories with 0 to 50 turns, 2x for 51 to 100 turns, and 5x for trajectories exceeding 100 turns. This process expands the final training set to 18,745 instances and significantly increases the proportion of long-horizon trajectories (those with over 100 turns) from 13.29% to 33.21%.

Method

The authors leverage a two-stage training framework to develop DR-Venus, a deep research agent capable of solving complex information-seeking tasks through long-horizon interaction with external environments. The overall approach is structured around two complementary objectives: improving training data quality and enhancing data utilization. The framework begins with a formulation of the deep research task as a long-horizon reasoning-and-acting problem, where the agent must iteratively reason, invoke tools, gather evidence, and produce a final answer. This process is grounded in a formal environment E\mathcal{E}E equipped with executable actions A\mathcal{A}A, including search, browse, and answer actions. At each turn ttt, the agent generates a turn output ut=(τt,at)u_t = (\tau_t, a_t)ut=(τt,at), where τt\tau_tτt represents the intermediate reasoning and ata_tat is the corresponding tool or answer action. The interaction history h<th_{<t}h<t is updated based on the agent's output and the environment's response, forming a trajectory HHH that captures the full sequence of reasoning, actions, and observations.

The first stage of the training pipeline is agentic supervised fine-tuning (SFT), which initializes the model with basic agentic capabilities. The authors use trajectories from REDSearcher, a previously established system, but apply a multi-stage data filtering and construction pipeline to mitigate redundancy, structural mismatches, and noisy supervision. The cleaned trajectories are serialized into autoregressive sequences, and the model is trained using a standard next-token prediction objective, with the loss computed only over the agent-generated tokens—specifically, the reasoning traces τt\tau_tτt and actions ata_tat—while masking out environment observations oto_tot. This ensures that the model learns the structured interaction pattern of reasoning, tool use, and answer generation without being distracted by irrelevant or noisy environmental data. The SFT stage provides a stable foundation for long-horizon interaction, enabling the model to effectively utilize limited open-data supervision.

The second stage employs agentic reinforcement learning (RL) to refine the model’s performance and address residual failure modes such as formatting errors, redundant reasoning, and inefficient tool use. To overcome the scarcity of high-quality open-source RL data, the authors adopt Information Gain-based Policy Optimization (IGPO), an algorithm that constructs dense turn-level reward signals to improve data efficiency. The core of this approach is the information gain (IG) reward, which evaluates each turn based on how much it increases the model's probability of generating the ground truth answer. Formally, the IG reward for turn ttt is defined as the difference in the log probability of the ground truth sequence under the current policy before and after the turn, capturing the extent to which the turn contributes to improved confidence in the correct answer. This reward is computed only for non-answer turns, with the answer turn relying on outcome rewards based on correctness.

To further enhance the reward design, the authors introduce a browse-aware IG assignment strategy, which computes IG rewards exclusively on browse turns and propagates them to all preceding search turns since the last browse action. This reflects the differing informational value of browse and search actions in deep research tasks. Additionally, a turn-level format penalty is applied to ensure consistent formatting across all turns, avoiding the coarse-grained penalties that can unfairly penalize correctly formatted turns in long trajectories. The format-adjusted rewards are then normalized within each rollout group to balance the scales of turn-level IG and outcome rewards, and an optional IG-Scale mechanism adaptively rescales the IG rewards to maintain balance in the presence of weak outcome supervision.

The final reward signal is a discounted cumulative reward R~i,t\tilde{R}_{i,t}R~i,t, which incorporates future reward information by applying a discount factor γ\gammaγ to the normalized and scaled turn-level rewards. This dense supervision signal is assigned to every token in the policy-generated output at turn ttt, enabling effective credit assignment across the interaction horizon. The policy optimization is performed using the IGPO objective, which combines a GRPO-style approach with turn-level credit assignment. The objective function includes a clipped surrogate term to ensure stable updates and a KL divergence penalty to prevent excessive deviation from the reference policy. This comprehensive reward and optimization framework enables DR-Venus to achieve improved execution reliability and push performance toward the frontier of long-horizon agentic tasks.

Experiment

The researchers evaluated DR-Venus across six benchmarks covering deep research, web browsing, and multi-step information seeking to validate the effectiveness of an agentic SFT and RL training pipeline. Results demonstrate that high-quality supervised fine-tuning on open data allows a small 4B model to establish a strong baseline that rivals much larger agents, while agentic RL further enhances reliability, tool-use stability, and long-horizon performance. Ultimately, the findings suggest that effective data utilization and dense reward designs can compensate for model scale, enabling edge-scale agents to achieve highly competitive research capabilities.

The authors compare DR-Venus, a small-model agent, against various baselines on two benchmarks, showing that DR-Venus-4B-SFT establishes a strong baseline and DR-Venus-4B-RL further improves performance. The results indicate that effective training on open data and reinforcement learning can enable small models to achieve competitive performance, often surpassing larger models, particularly in long-horizon tasks requiring deep research and tool use. DR-Venus-4B-SFT outperforms several 4B–9B agents and matches or exceeds larger 30B-scale models on multiple benchmarks. Agentic RL improves performance across most benchmarks, with gains linked to better formatting and more reliable tool use. DR-Venus-4B-RL achieves high performance on BrowseComp-ZH, surpassing larger models and demonstrating strong capability under test-time scaling.

The authors compare the performance of DR-Venus models with different training configurations on two benchmarks, highlighting the impact of resampling and reinforcement learning. Results show that adding resampling during supervised fine-tuning improves performance over the baseline, and that reinforcement learning with IGPO further enhances results, particularly on BrowseComp. The improvements are attributed to better tool use and more reliable long-horizon execution. Resampling during supervised fine-tuning improves performance over the baseline on both benchmarks. Reinforcement learning with IGPO leads to consistent gains, while GRPO shows little to no improvement. The model's performance is strongly correlated with effective tool use, particularly browsing, which is enhanced by reinforcement learning.

The authors analyze the tool usage behavior of their agent across multiple benchmarks, focusing on the browse ratio for correct and incorrect trajectories under supervised fine-tuning (SFT) and reinforcement learning (RL) checkpoints. Results show that successful trajectories consistently exhibit higher browse ratios than failed ones, indicating that deeper evidence inspection is crucial for task success. Reinforcement learning enhances this pattern by increasing the overall browse ratio and strengthening the alignment between browsing behavior and task correctness, particularly in scenarios where the SFT model previously showed counterintuitive trends. Correct trajectories consistently show higher browse ratios than wrong trajectories across all benchmarks, indicating deeper evidence gathering is linked to success. Reinforcement learning increases the overall browse ratio and strengthens the gap between correct and wrong trajectories, improving tool use alignment with task outcomes. RL corrects counterintuitive patterns in tool use, such as wrong trajectories browsing more than correct ones, by promoting more effective evidence gathering.

The authors present a comprehensive evaluation of DR-Venus, a small-scale deep research agent, comparing its performance against larger foundation models and trained agents across multiple benchmarks. Results show that DR-Venus-4B-SFT establishes a strong baseline, outperforming several 30B-scale models on individual benchmarks, while DR-Venus-4B-RL further improves performance, achieving competitive results with larger models and demonstrating the effectiveness of reinforcement learning in enhancing long-horizon task reliability. The analysis highlights that high-quality training data and effective data utilization can compensate for model scale limitations, and that reinforcement learning enhances tool use, particularly browsing, to improve evidence gathering and task success. DR-Venus-4B-SFT outperforms several 30B-scale agents on individual benchmarks, demonstrating that model scale is not the sole determinant of performance. DR-Venus-4B-RL improves upon the SFT baseline across most benchmarks, with gains attributed to better formatting accuracy, tool use, and execution reliability. Reinforcement learning increases the browse ratio, particularly for correct trajectories, indicating improved evidence gathering and more effective tool use.

The authors evaluate the DR-Venus small-model agent against various baselines and larger foundation models across multiple benchmarks to validate the impact of supervised fine-tuning and reinforcement learning. The results demonstrate that effective training on open data and agentic reinforcement learning allow small models to achieve competitive performance, often surpassing much larger models in long-horizon research tasks. Specifically, reinforcement learning improves task success by enhancing tool use reliability and promoting deeper evidence gathering through increased browsing behavior.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp