HyperAIHyperAI

Command Palette

Search for a command to run...

DARE: 分布意識型検索による LLM Agent と R 統計生態系の整合化

Maojun Sun Yue Wu Yifei Xie Ruijian Han Binyan Jiang Defeng Sun Yancheng Yuan Jian Huang

概要

大規模言語モデル(LLM)エージェントはデータサイエンスワークフローの自動化を可能にするが、R 言語で実装された多くの厳密な統計手法は、LLM が統計知識やツールの検索に課題を抱えているため、十分に活用されていない。既存の検索拡張アプローチは関数レベルのセマンティクスに焦点を当て、データ分布を無視しているため、最適なマッチングが得られない。本研究では、R パッケージ検索においてデータ分布情報を関数表現に統合する軽量なプラグアンドプレイ型検索モデル「DARE(Distribution-Aware Retrieval Embedding)」を提案する。主な貢献は以下の通りである:(i) 8,191 件の高品質な CRAN パッケージから構築されたキュレーション済み R パッケージ知識ベース「RPKB」;(ii) 分布特徴と関数メタデータを融合し、検索関連性を向上させる埋め込みモデル「DARE」;(iii) 信頼性の高い R コード生成と統計分析タスクのスイートを提供し、現実的な分析シナリオにおいて LLM エージェントを体系的に評価するための R 特化型 LLM エージェント「RCodingAgent」。実証評価の結果、DARE は NDCG@10 で 93.47% を達成し、パッケージ検索において既存の最先端オープンソース埋め込みモデルを最大 17% 上回る性能を示しつつ、はるかに少ないパラメータ数で動作する。DARE を RCodingAgent に統合することで、下流の分析タスクにおいて顕著な性能向上が得られた。本研究は、LLM による自動化と成熟した R 統計エコシステムの間のギャップを埋めることに寄与する。

One-sentence Summary

Researchers from multiple institutions propose DARE, a lightweight retrieval model that uniquely integrates data distribution features with function metadata to enhance R package search. This approach significantly outperforms existing methods and powers RCodingAgent, bridging the gap between LLM automation and the mature R statistical ecosystem.

Key Contributions

  • LLM agents often fail to utilize rigorous R statistical methods because existing retrieval approaches ignore data distribution characteristics, leading to suboptimal tool matches and code generation errors.
  • The authors introduce DARE, a lightweight embedding model that fuses data distribution features with function metadata to improve retrieval relevance, alongside RPKB, a curated knowledge base of 8,191 high-quality CRAN packages.
  • Empirical results show DARE achieves an NDCG@10 of 93.47%, outperforming state-of-the-art models by up to 17% on package retrieval, while integration into the RCodingAgent boosts downstream analysis task performance by up to 56.25%.

Introduction

Large Language Model agents are increasingly used to automate data science workflows, yet they struggle to leverage the rigorous statistical methods available in the R ecosystem due to training data biases toward Python and a lack of understanding regarding statistical tool compatibility. Prior retrieval-augmented approaches fail because they rely solely on semantic similarity between queries and function descriptions, ignoring critical data distribution characteristics such as sparsity or dimensionality that determine whether a statistical method is applicable. To address this, the authors introduce DARE, a lightweight retrieval model that fuses data distribution features with function metadata to improve R package selection, alongside the RPKB knowledge base and the RCodingAgent framework for end-to-end statistical analysis.

Dataset

  • Dataset Composition and Sources The authors constructed a specialized knowledge base called RPKB by curating R packages from the Comprehensive R Archive Network (CRAN). The final repository contains 8,191 high-quality R functions indexed in ChromaDB, focusing strictly on core statistical primitives and computational algorithms while excluding generic utility functions or those with vague descriptions.

  • Key Details for Each Subset The dataset is organized at the function level and enriched with synthesized metadata. Each entry includes granular details such as function descriptions, usage, arguments, and return values. Crucially, the authors used an LLM to generate a "data profile" for every function, inferring attributes like data modality, distribution assumptions, dimensionality, and specific constraints (e.g., handling missing values or data types).

  • Model Usage and Training Strategy The authors utilize this corpus to train a semantic search engine for statistical programming. They employed a data augmentation strategy where an LLM generated 30 diverse user-style search queries for each function. These prompts were designed to describe data problems and constraints without revealing the function or package names, ensuring the model learns to retrieve tools based on analytical intent rather than keyword matching.

  • Processing and Evaluation Framework For evaluation, the team created a benchmark of 16 representative statistical analysis tasks covering domains like hypothesis testing and survival analysis. They extracted real R scripts from the repository, executed them to verify ground-truth outputs, and then prompted LLMs to generate natural language queries paired with these verified results. The evaluation queries enforce strict constraints, such as requiring a specific random seed and mandating the printing of a specific ground-truth metric to ensure reproducibility and accurate assessment of agent performance.

Method

The proposed framework, DARE, addresses the limitations of standard semantic retrieval by incorporating structured data profiles alongside natural language descriptions. As illustrated in the framework diagram, the system contrasts with traditional approaches that rely solely on function descriptions. Instead, DARE conditions the retrieval process on both the user's natural language query and a structured data profile derived from the dataset characteristics.

The core of the method utilizes a bi-encoder architecture with shared weights, initialized from a pre-trained sentence transformer. The authors define the shared encoder network as ε()\varepsilon(\cdot)ε(), which maps input texts into a shared vector space. For the query side, the system concatenates the natural language request qqq with the query-side data profile cqc_qcq to generate a query embedding eq=ε([q;cq])\mathbf{e}_q = \varepsilon([q; c_q])eq=ε([q;cq]). Similarly, for the function database, each candidate function is represented by its documentation ddd and its inherent data profile cdc_dcd, resulting in a function embedding ef=ε([d;cd])\mathbf{e}_f = \varepsilon([d; c_d])ef=ε([d;cd]). The relevance score is computed using cosine similarity between these representations:

s(eq,ef)=cos(eq,ef)=eqefeq2ef2.s \big ( \mathbf { e } _ { q } , \, \mathbf { e } _ { f } \big ) = \cos ( \mathbf { e } _ { q } , \, \mathbf { e } _ { f } ) = \frac { \mathbf { e } _ { q } ^ { \top } \mathbf { e } _ { f } } { \| \mathbf { e } _ { q } \| _ { 2 } \, \| \mathbf { e } _ { f } \| _ { 2 } } .s(eq,ef)=cos(eq,ef)=eq2ef2eqef.

This factorization enables efficient retrieval via Maximum Inner Product Search over precomputed function embeddings. To train the model, the authors employ the InfoNCE objective with in-batch negatives. Given a mini-batch of size NNN, the loss function for the iii-th sample treats the paired function as the positive sample and all other functions in the batch as negatives:

Li=logexp(cos(eqi,efi)/τ)j=1Nexp(cos(eqi,efj)/τ),\mathcal { L } _ { i } = - \log \frac { \exp \left( \cos ( \mathbf { e } _ { q _ { i } } , \mathbf { e } _ { f _ { i } } ) / \tau \right) } { \displaystyle \sum _ { j = 1 } ^ { N } \exp \left( \cos ( \mathbf { e } _ { q _ { i } } , \mathbf { e } _ { f _ { j } } ) / \tau \right) } ,Li=logj=1Nexp(cos(eqi,efj)/τ)exp(cos(eqi,efi)/τ),

where τ\tauτ is a learnable temperature parameter. As shown in the figure below, this process involves encoding both user queries and function documentation into embeddings, calculating an in-batch similarity matrix, and optimizing the objective to maximize similarity for matched pairs while minimizing it for unmatched pairs.

To facilitate end-to-end statistical analysis, the retrieval module is integrated into an LLM-based agent named RCodingAgent. The agent operates by first invoking DARE to retrieve candidate R packages and functions that satisfy both analytical intent and data compatibility constraints. The retrieved functions are returned with structured metadata, including argument specifications and usage examples, which are injected into the LLM context. This enables the agent to perform iterative reasoning, tool retrieval, and R code generation. The workflow demonstrates how the agent processes a user query, retrieves the top relevant functions, and executes coding steps to produce final observations.

The system is evaluated using a benchmark pipeline that collects R packages and synthesizes queries to test retrieval and execution capabilities. The pipeline involves collecting functions from a database, executing code to generate ground truth outputs, and filtering results based on criteria such as dataset provision, answer consistency, and numerical priority. The benchmark covers a diverse range of statistical tasks, including hypothesis testing, density estimation, and data transformation, ensuring comprehensive evaluation of the retrieval system's ability to handle complex data science workflows.

Experiment

  • Synthetic query generation and retrieval benchmarking validate that DARE achieves state-of-the-art performance in identifying and ranking statistical functions, significantly outperforming larger general-purpose embedding models by effectively distinguishing between statistically similar yet distributionally distinct tools.
  • Efficiency analysis demonstrates that DARE delivers superior throughput and ultra-low latency compared to heavy baselines, confirming its suitability for real-time agentic workflows where rapid tool retrieval is critical.
  • Integration experiments across diverse LLM agents on statistical analysis tasks reveal that DARE substantially improves end-to-end success rates, effectively bridging the gap in tool utilization for both lightweight and frontier models by providing precise, distribution-aware retrieval signals.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています