HyperAIHyperAI

Command Palette

Search for a command to run...

Console

Rapport technique de RecGPT-V2

Abstract

Les grands modèles linguistiques (LLM) ont démontré un potentiel remarquable pour transformer les systèmes de recommandation, en passant d’un apprentissage implicite des schémas de comportement à un raisonnement explicite des intentions. Bien que RecGPT-V1 ait ouvert la voie à ce paradigme en intégrant un raisonnement basé sur les LLM à l’extraction des intérêts utilisateur et à la prédiction des étiquettes d’articles, il présente quatre limites fondamentales : (1) une inefficacité computationnelle et une redondance cognitive au sein de multiples chemins de raisonnement ; (2) une diversité insuffisante des explications produites par génération selon des modèles fixes ; (3) une généralisation limitée dans le cadre d’un apprentissage supervisé ; et (4) une évaluation trop simpliste centrée sur les résultats, qui ne correspond pas aux standards humains. Pour relever ces défis, nous présentons RecGPT-V2, fondé sur quatre innovations clés. Premièrement, un système hiérarchique multi-agents restructure le raisonnement d’intention grâce à une collaboration coordonnée, éliminant les doublons cognitifs tout en assurant une couverture diversifiée des intentions. Associé à une inférence hybride de représentations, qui compresse les contextes de comportement utilisateur, notre cadre réduit la consommation de GPU de 60 % et améliore le rappel exclusif de 9,39 % à 10,99 %. Deuxièmement, un cadre de méta-prompting génère dynamiquement des prompts adaptés au contexte, augmentant la diversité des explications de +7,3 %. Troisièmement, un apprentissage par renforcement contraint atténue les conflits entre plusieurs récompenses, réalisant une amélioration de +24,1 % pour la prédiction des étiquettes et de +13,0 % pour l’acceptation des explications. Quatrièmement, un cadre Agent-as-a-Judge décompose l’évaluation en raisonnement multi-étapes, améliorant ainsi l’alignement avec les préférences humaines. Des tests A/B en ligne sur Taobao montrent des améliorations significatives : +2,98 % de CTR, +3,71 % d’IPV, +2,19 % de TV et +11,46 % de NER. RecGPT-V2 établit à la fois la faisabilité technique et la viabilité commerciale du déploiement à grande échelle du raisonnement d’intention alimenté par les LLM, comblant ainsi le fossé entre exploration cognitive et utilité industrielle.

One-sentence Summary

The authors propose RecGPT-V2, which overcomes RecGPT-V1's limitations through a Hierarchical Multi-Agent System reducing GPU consumption by 60% while improving exclusive recall from 9.39% to 10.99%, Meta-Prompting for +7.3% explanation diversity, constrained reinforcement learning achieving +24.1% tag prediction gains, and Agent-as-a-Judge evaluation, demonstrating significant online improvements including +2.98% CTR and +3.71% IPV in Taobao deployments.

Key Contributions

  • RecGPT-V2 addresses computational inefficiency and redundant reasoning in multi-route architectures by introducing a Hierarchical Multi-Agent System that coordinates intent reasoning and Hybrid Representation Inference to compress user-behavior contexts, reducing GPU consumption by 60% and improving exclusive recall from 9.39% to 10.99%.
  • To overcome homogenized explanations and weak temporal adaptation from static templates, the framework implements Meta-Prompting for dynamically generating context-aware prompts and preference-aware reinforcement learning, boosting explanation diversity by 7.3% and capturing seasonal trends like Halloween and winter products in live deployments.
  • The Agent-as-a-Judge framework replaces simplistic evaluation with multi-step reasoning for human-aligned quality assessment, resolving reward conflicts through constrained reinforcement learning and achieving significant online improvements including +24.1% tag prediction accuracy, +13.0% explanation acceptance, and +2.98% CTR in Taobao A/B tests.

Introduction

Recommendation systems require personalized explanations to boost user engagement with suggested items, but prior template-based approaches like RecGPT-V1 suffer from critical flaws. These include low information density with repetitive generic phrases, inability to adapt to seasonal trends or context, and monotonous stylistic outputs due to static prompt templates and insufficient evaluation frameworks. The authors address these by developing Meta-Prompting to dynamically synthesize context-aware templates and preference-aware reinforcement learning that optimizes generation through multi-reward modeling. Together, these innovations shift explanation generation from rigid templating to adaptive reasoning, significantly improving engagement and satisfaction.

Dataset

The provided text does not contain dataset composition, sources, subset details, or processing information required for the description. It solely describes a system instruction for generating verification questions about product title embeddings. No dataset characteristics or usage methodology from a research paper are present in the given material.

Method

The authors leverage a comprehensive, multi-component architecture in RecGPT-V2 to overcome the computational inefficiency, cognitive redundancy, and evaluation limitations of its predecessor. The system is structured around three core innovations: Agentic Intent Reasoning, Dynamic Explanation Generation, and an Agentic Judge Framework, all operating on a compressed hybrid representation of user context.

The foundational layer of the system is the Hybrid Representation Inference, which addresses the token explosion inherent in processing long user behavior sequences. Instead of feeding raw text into the LLM, RecGPT-V2 employs Atomized Entity Compression. This technique encodes item descriptions and query histories into dense vector representations using pretrained embedding models (e.g., BGE, Qwen3-Embedding). These vectors are then projected into the LLM’s input space via a lightweight, trainable adaptor network, replacing multi-token descriptions with a single atomic token, denoted as [entity]. As shown in the figure below, this process reduces a 21,349-token user profile to a 5,158-token hybrid context, achieving a 76% token reduction while preserving user attributes and temporal metadata in natural language. This compression is critical for enabling efficient inference.

Building upon this efficient representation, the Agentic Intent Reasoning module restructures the intent decomposition process. RecGPT-V1’s parallel, isolated LLM routes are replaced by a Hierarchical Multi-Agent System (HMAS) comprising a Global Planner, Distributed Experts, and a Decision Arbiter. The Global Planner receives the compressed hybrid context—comprising user behavior, profile, and real-time environmental signals (e.g., weather, trends)—and performs a single, holistic analysis to decompose the user’s intent into a set of specialized personas. This eliminates the redundant full-context encoding performed by each route in RecGPT-V1. Each persona is then assigned to a dedicated Expert agent, which operates under that persona to predict a set of item tags. The Decision Arbiter synthesizes the outputs from all experts, performing joint reasoning over the entire candidate pool to select the final, non-redundant set of tags for downstream retrieval. This coordinated, three-tier architecture is illustrated in the figure below, contrasting the isolated routes of RecGPT-V1 with the collaborative flow of RecGPT-V2.

To generate personalized and contextually adaptive explanations, RecGPT-V2 introduces a Meta-Prompting framework. This two-stage process first synthesizes a stylistic guideline based on user interests, item attributes, and situational signals. The guideline specifies the desired tone, rhetorical devices, and emotional resonance. In the second stage, the model generates the final explanation conditioned on this guideline, enabling role-playing across diverse stylistic personas. This approach moves beyond RecGPT-V1’s fixed templates, significantly improving explanation diversity.

Finally, the Agentic Judge Framework addresses the limitations of outcome-focused evaluation. Instead of a single LLM predicting a quality score, RecGPT-V2 employs a multi-agent evaluation system. A set of specialized sub-evaluators assesses the generated content across multiple dimensions (e.g., Relevance, Timeliness, Factuality). A Senior Reviewer Agent then aggregates these dimension-specific scores into a holistic judgment using a three-tier S-A-B scheme (Superior, Average, Bad). This process mirrors human cognitive evaluation and provides more nuanced, interpretable feedback. The figure below illustrates this multi-dimension sub-evaluator and three-tier judgment process for both item tag prediction and explanation generation.

To enable continuous improvement, the system incorporates a Judge-as-a-Reward component. This distills the discrete S-A-B judgments from the Agent-as-a-Judge into a continuous, differentiable reward signal using a listwise learning-to-rank approach. This dense reward signal is then used to optimize the policy model via reinforcement learning, creating a self-reinforcing flywheel effect that aligns model behavior with human quality standards without requiring recurring human annotation.

Experiment

  • Conducted two-week online A/B test on Taobao's "Guess What You Like" scenario, comparing RecGPT-V2 against RecGPT-V1 with 1% traffic allocation per group across item and feed recommendation scenarios.
  • Validated significant improvements in short-term engagement: item scenario achieved +3.26% IPV, +3.01% CTR, +2.11% TV, +3.39% GMV, and +3.47% ATC; feed scenario showed +1.50% CTR and +1.53% GMV gains.
  • Confirmed enhanced long-term retention with +11.46% novelty exposure rate (NER) in item scenario and +4.49% in feed scenario, validating reduced filter bubble effects, alongside improved 14-day (+0.04%) and 30-day (+0.05%) retention rates.
  • Demonstrated 60% GPU consumption reduction while maintaining superior generation quality during large-scale deployment, with overall online gains including +11.46% NER.

The authors use RecGPT-V2 in a two-week A/B test on Taobao, comparing it against RecGPT-V1 across item and feed recommendation scenarios. Results show consistent improvements in both short-term engagement and long-term retention, with the item scenario achieving +3.64% IPV and +11.46% NER, while the feed scenario shows +1.50% CTR and +4.49% NER, alongside modest but meaningful gains in 14- and 30-day retention.

The authors use two variants of RecGPT-V2 with different reward modeling approaches to evaluate tag prediction and explanation quality against the RecGPT-V1 baseline. Results show that RecGPT-V2 with list-wise reward modeling achieves the highest performance, improving HR@30 (Tag) to 32.60% and Quality (Explanation) to 40.73%, outperforming both the baseline and the point-wise variant.

The authors evaluate RecGPT-V2 against V1 across item tag prediction and explanation generation tasks, showing consistent improvements in both accuracy and F1 score for all tested models. For item tag prediction, Qwen3-SFT achieves the highest gains, with accuracy rising from 0.8210 to 0.8248 and F1 from 0.8095 to 0.8228. In explanation generation, Qwen3-SFT also leads, improving accuracy from 0.6885 to 0.7006 and F1 from 0.6787 to 0.7307, indicating enhanced generation quality in V2.

The authors use RecGPT-V2 to improve recommendation diversity and generation quality compared to RecGPT-V1, as shown in the table. Results show RecGPT-V2 achieves higher diversity (0.677 vs. 0.631) and quality (40.73% vs. 36.03%), indicating enhanced recommendation effectiveness and output reliability.

The authors use HR@30 to evaluate long-term user retention across different model configurations, comparing RecGPT-V1 with RecGPT-V2 variants including Base, SFT, GRPO (SUM), and GRPO (CRS). Results show that RecGPT-V2 with GRPO (CRS) achieves the highest retention at 32.60%, outperforming RecGPT-V1 (26.29%) and all other V2 variants, indicating that the CRS-based reinforcement learning strategy most effectively sustains user engagement over time.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp