HyperAIHyperAI

Command Palette

Search for a command to run...

Console

RecGPT-V2 技的報告

Abstract

大規模言語モデル(LLM)は、推薦システムを暗黙の行動パターンマッチングから明示的な意図推論へと変革する上で、顕著な可能性を示している。RecGPT-V1は、LLMを用いた推論をユーザの関心抽出およびアイテムタグ予測に統合することで、このパラダイムの先駆的実現を果たしたが、以下の4つの根本的な課題を抱えている:(1)複数の推論経路における計算効率の低さと認知的重複;(2)固定テンプレートによる生成による説明の多様性不足;(3)教師あり学習枠組み下での一般化能力の限界;(4)人間の基準に適合しない単純な成果志向評価。これらの課題に対処するため、本研究では4つの主要な革新を導入したRecGPT-V2を提案する。第一に、階層的マルチエージェントシステムにより、協調的連携によって意図推論の構造を再設計し、認知的重複を排除しつつ、多様な意図カバレッジを実現する。これに加え、ユーザ行動文脈を圧縮するハイブリッド表現推論を組み合わせることで、GPU消費量を60%削減し、排他的リコール率を9.39%から10.99%まで向上させた。第二に、メタプロンプトフレームワークを導入し、文脈に応じて動的に適応するプロンプトを生成することで、説明の多様性を+7.3%向上させた。第三に、制約付き強化学習を用いて複数の報酬間の矛盾を緩和し、タグ予測精度において+24.1%、説明の受容率において+13.0%の向上を達成した。第四に、エージェントをジャッジとするフレームワークにより、評価を複数ステップの推論に分解し、人間の好みとの整合性を向上させた。淘宝(Taobao)におけるオンラインA/Bテストの結果、CTRが+2.98%、IPVが+3.71%、TVが+2.19%、NERが+11.46%と顕著な向上が確認された。RecGPT-V2は、大規模言語モデルを活用した意図推論をスケールアウトして実装する技術的実現可能性と商業的実用性を確立し、認知的探索と産業的実用性のギャップを埋める画期的な成果をもたらした。

One-sentence Summary

The authors propose RecGPT-V2, which overcomes RecGPT-V1's limitations through a Hierarchical Multi-Agent System reducing GPU consumption by 60% while improving exclusive recall from 9.39% to 10.99%, Meta-Prompting for +7.3% explanation diversity, constrained reinforcement learning achieving +24.1% tag prediction gains, and Agent-as-a-Judge evaluation, demonstrating significant online improvements including +2.98% CTR and +3.71% IPV in Taobao deployments.

Key Contributions

  • RecGPT-V2 addresses computational inefficiency and redundant reasoning in multi-route architectures by introducing a Hierarchical Multi-Agent System that coordinates intent reasoning and Hybrid Representation Inference to compress user-behavior contexts, reducing GPU consumption by 60% and improving exclusive recall from 9.39% to 10.99%.
  • To overcome homogenized explanations and weak temporal adaptation from static templates, the framework implements Meta-Prompting for dynamically generating context-aware prompts and preference-aware reinforcement learning, boosting explanation diversity by 7.3% and capturing seasonal trends like Halloween and winter products in live deployments.
  • The Agent-as-a-Judge framework replaces simplistic evaluation with multi-step reasoning for human-aligned quality assessment, resolving reward conflicts through constrained reinforcement learning and achieving significant online improvements including +24.1% tag prediction accuracy, +13.0% explanation acceptance, and +2.98% CTR in Taobao A/B tests.

Introduction

Recommendation systems require personalized explanations to boost user engagement with suggested items, but prior template-based approaches like RecGPT-V1 suffer from critical flaws. These include low information density with repetitive generic phrases, inability to adapt to seasonal trends or context, and monotonous stylistic outputs due to static prompt templates and insufficient evaluation frameworks. The authors address these by developing Meta-Prompting to dynamically synthesize context-aware templates and preference-aware reinforcement learning that optimizes generation through multi-reward modeling. Together, these innovations shift explanation generation from rigid templating to adaptive reasoning, significantly improving engagement and satisfaction.

Dataset

The provided text does not contain dataset composition, sources, subset details, or processing information required for the description. It solely describes a system instruction for generating verification questions about product title embeddings. No dataset characteristics or usage methodology from a research paper are present in the given material.

Method

The authors leverage a comprehensive, multi-component architecture in RecGPT-V2 to overcome the computational inefficiency, cognitive redundancy, and evaluation limitations of its predecessor. The system is structured around three core innovations: Agentic Intent Reasoning, Dynamic Explanation Generation, and an Agentic Judge Framework, all operating on a compressed hybrid representation of user context.

The foundational layer of the system is the Hybrid Representation Inference, which addresses the token explosion inherent in processing long user behavior sequences. Instead of feeding raw text into the LLM, RecGPT-V2 employs Atomized Entity Compression. This technique encodes item descriptions and query histories into dense vector representations using pretrained embedding models (e.g., BGE, Qwen3-Embedding). These vectors are then projected into the LLM’s input space via a lightweight, trainable adaptor network, replacing multi-token descriptions with a single atomic token, denoted as [entity]. As shown in the figure below, this process reduces a 21,349-token user profile to a 5,158-token hybrid context, achieving a 76% token reduction while preserving user attributes and temporal metadata in natural language. This compression is critical for enabling efficient inference.

Building upon this efficient representation, the Agentic Intent Reasoning module restructures the intent decomposition process. RecGPT-V1’s parallel, isolated LLM routes are replaced by a Hierarchical Multi-Agent System (HMAS) comprising a Global Planner, Distributed Experts, and a Decision Arbiter. The Global Planner receives the compressed hybrid context—comprising user behavior, profile, and real-time environmental signals (e.g., weather, trends)—and performs a single, holistic analysis to decompose the user’s intent into a set of specialized personas. This eliminates the redundant full-context encoding performed by each route in RecGPT-V1. Each persona is then assigned to a dedicated Expert agent, which operates under that persona to predict a set of item tags. The Decision Arbiter synthesizes the outputs from all experts, performing joint reasoning over the entire candidate pool to select the final, non-redundant set of tags for downstream retrieval. This coordinated, three-tier architecture is illustrated in the figure below, contrasting the isolated routes of RecGPT-V1 with the collaborative flow of RecGPT-V2.

To generate personalized and contextually adaptive explanations, RecGPT-V2 introduces a Meta-Prompting framework. This two-stage process first synthesizes a stylistic guideline based on user interests, item attributes, and situational signals. The guideline specifies the desired tone, rhetorical devices, and emotional resonance. In the second stage, the model generates the final explanation conditioned on this guideline, enabling role-playing across diverse stylistic personas. This approach moves beyond RecGPT-V1’s fixed templates, significantly improving explanation diversity.

Finally, the Agentic Judge Framework addresses the limitations of outcome-focused evaluation. Instead of a single LLM predicting a quality score, RecGPT-V2 employs a multi-agent evaluation system. A set of specialized sub-evaluators assesses the generated content across multiple dimensions (e.g., Relevance, Timeliness, Factuality). A Senior Reviewer Agent then aggregates these dimension-specific scores into a holistic judgment using a three-tier S-A-B scheme (Superior, Average, Bad). This process mirrors human cognitive evaluation and provides more nuanced, interpretable feedback. The figure below illustrates this multi-dimension sub-evaluator and three-tier judgment process for both item tag prediction and explanation generation.

To enable continuous improvement, the system incorporates a Judge-as-a-Reward component. This distills the discrete S-A-B judgments from the Agent-as-a-Judge into a continuous, differentiable reward signal using a listwise learning-to-rank approach. This dense reward signal is then used to optimize the policy model via reinforcement learning, creating a self-reinforcing flywheel effect that aligns model behavior with human quality standards without requiring recurring human annotation.

Experiment

  • Conducted two-week online A/B test on Taobao's "Guess What You Like" scenario, comparing RecGPT-V2 against RecGPT-V1 with 1% traffic allocation per group across item and feed recommendation scenarios.
  • Validated significant improvements in short-term engagement: item scenario achieved +3.26% IPV, +3.01% CTR, +2.11% TV, +3.39% GMV, and +3.47% ATC; feed scenario showed +1.50% CTR and +1.53% GMV gains.
  • Confirmed enhanced long-term retention with +11.46% novelty exposure rate (NER) in item scenario and +4.49% in feed scenario, validating reduced filter bubble effects, alongside improved 14-day (+0.04%) and 30-day (+0.05%) retention rates.
  • Demonstrated 60% GPU consumption reduction while maintaining superior generation quality during large-scale deployment, with overall online gains including +11.46% NER.

The authors use RecGPT-V2 in a two-week A/B test on Taobao, comparing it against RecGPT-V1 across item and feed recommendation scenarios. Results show consistent improvements in both short-term engagement and long-term retention, with the item scenario achieving +3.64% IPV and +11.46% NER, while the feed scenario shows +1.50% CTR and +4.49% NER, alongside modest but meaningful gains in 14- and 30-day retention.

The authors use two variants of RecGPT-V2 with different reward modeling approaches to evaluate tag prediction and explanation quality against the RecGPT-V1 baseline. Results show that RecGPT-V2 with list-wise reward modeling achieves the highest performance, improving HR@30 (Tag) to 32.60% and Quality (Explanation) to 40.73%, outperforming both the baseline and the point-wise variant.

The authors evaluate RecGPT-V2 against V1 across item tag prediction and explanation generation tasks, showing consistent improvements in both accuracy and F1 score for all tested models. For item tag prediction, Qwen3-SFT achieves the highest gains, with accuracy rising from 0.8210 to 0.8248 and F1 from 0.8095 to 0.8228. In explanation generation, Qwen3-SFT also leads, improving accuracy from 0.6885 to 0.7006 and F1 from 0.6787 to 0.7307, indicating enhanced generation quality in V2.

The authors use RecGPT-V2 to improve recommendation diversity and generation quality compared to RecGPT-V1, as shown in the table. Results show RecGPT-V2 achieves higher diversity (0.677 vs. 0.631) and quality (40.73% vs. 36.03%), indicating enhanced recommendation effectiveness and output reliability.

The authors use HR@30 to evaluate long-term user retention across different model configurations, comparing RecGPT-V1 with RecGPT-V2 variants including Base, SFT, GRPO (SUM), and GRPO (CRS). Results show that RecGPT-V2 with GRPO (CRS) achieves the highest retention at 32.60%, outperforming RecGPT-V1 (26.29%) and all other V2 variants, indicating that the CRS-based reinforcement learning strategy most effectively sustains user engagement over time.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています
RecGPT-V2 技的報告 | Papers | HyperAI超神経