تقرير فني عن RecGPT-V2
تقرير فني عن RecGPT-V2
Abstract
نموذجات اللغة الكبيرة (LLMs) أظهرت إمكانات مميزة في تحويل أنظمة التوصية من التعرف على الأنماط السلوكية الضمنية إلى الاستدلال على النوايا الصريحة. وعلى الرغم من أن RecGPT-V1 قد قاد هذا النموذج الجديد من خلال دمج الاستدلال القائم على نماذج اللغة الكبيرة في استخلاص اهتمامات المستخدم وتنبؤ تسميات العناصر، إلا أنها تعاني من أربع قيود جوهرية: (1) عدم الكفاءة الحسابية والتكرار المعرفي عبر مسارات استدلال متعددة؛ (2) قلة تنوع التفسيرات في النمط الثابت للإنتاج؛ (3) ضعف القدرة على التعميم ضمن النماذج القائمة على التعلم المراقب؛ و(4) تقييم بسيط يركز فقط على النتائج، مما لا يتماشى مع المعايير البشرية. للتغلب على هذه التحديات، نقدم RecGPT-V2 مع أربع ابتكارات رئيسية. أولاً، نظام وكالات متعددة متسلسل (Hierarchical Multi-Agent System) يعيد هيكلة عملية الاستدلال على النوايا من خلال تعاون منسق، مما يزيل التكرار المعرفي ويتيح تغطية متنوعة للنوايا. وبالاقتران مع الاستدلال التمثيلي الهجين (Hybrid Representation Inference) الذي يُقلل من حجم سياقات سلوك المستخدم، يقلل الإطار العملي من استهلاك وحدات معالجة الرسوميات (GPU) بنسبة 60%، ويعزز الاسترجاع الحصري من 9.39% إلى 10.99%. ثانيًا، إطار عمل Meta-Prompting يقوم بإنشاء أوامر توليدية تكيفية حسب السياق، مما يحسن تنوع التفسيرات بنسبة +7.3%. ثالثًا، التعلم المعزز المقيد (constrained reinforcement learning) يخفف من تعارض المكافآت المتعددة، ويحقق تحسنًا بنسبة +24.1% في تنبؤ التسميات، و+13.0% في قبول التفسيرات. رابعًا، إطار عمل Agent-as-a-Judge يقوم بتفكيك عملية التقييم إلى استدلال متعدد المراحل، مما يعزز التوافق مع تفضيلات البشر. أظهرت اختبارات A/B عبر الإنترنت على منصة تاوبار (Taobao) تحسينات كبيرة: +2.98% في معدل النقر (CTR)، +3.71% في عدد الزيارات الفريدة (IPV)، +2.19% في قيمة التصفح (TV)، و+11.46% في معدل التفاعل الإيجابي (NER). يُثبت RecGPT-V2 الجدوى التقنية والجدوى التجارية لتطبيق الاستدلال القائم على نماذج اللغة الكبيرة على نطاق واسع، ويُغلق الفجوة بين الاستكشاف المعرفي والقيمة الصناعية.
One-sentence Summary
The authors propose RecGPT-V2, which overcomes RecGPT-V1's limitations through a Hierarchical Multi-Agent System reducing GPU consumption by 60% while improving exclusive recall from 9.39% to 10.99%, Meta-Prompting for +7.3% explanation diversity, constrained reinforcement learning achieving +24.1% tag prediction gains, and Agent-as-a-Judge evaluation, demonstrating significant online improvements including +2.98% CTR and +3.71% IPV in Taobao deployments.
Key Contributions
- RecGPT-V2 addresses computational inefficiency and redundant reasoning in multi-route architectures by introducing a Hierarchical Multi-Agent System that coordinates intent reasoning and Hybrid Representation Inference to compress user-behavior contexts, reducing GPU consumption by 60% and improving exclusive recall from 9.39% to 10.99%.
- To overcome homogenized explanations and weak temporal adaptation from static templates, the framework implements Meta-Prompting for dynamically generating context-aware prompts and preference-aware reinforcement learning, boosting explanation diversity by 7.3% and capturing seasonal trends like Halloween and winter products in live deployments.
- The Agent-as-a-Judge framework replaces simplistic evaluation with multi-step reasoning for human-aligned quality assessment, resolving reward conflicts through constrained reinforcement learning and achieving significant online improvements including +24.1% tag prediction accuracy, +13.0% explanation acceptance, and +2.98% CTR in Taobao A/B tests.
Introduction
Recommendation systems require personalized explanations to boost user engagement with suggested items, but prior template-based approaches like RecGPT-V1 suffer from critical flaws. These include low information density with repetitive generic phrases, inability to adapt to seasonal trends or context, and monotonous stylistic outputs due to static prompt templates and insufficient evaluation frameworks. The authors address these by developing Meta-Prompting to dynamically synthesize context-aware templates and preference-aware reinforcement learning that optimizes generation through multi-reward modeling. Together, these innovations shift explanation generation from rigid templating to adaptive reasoning, significantly improving engagement and satisfaction.
Dataset
The provided text does not contain dataset composition, sources, subset details, or processing information required for the description. It solely describes a system instruction for generating verification questions about product title embeddings. No dataset characteristics or usage methodology from a research paper are present in the given material.
Method
The authors leverage a comprehensive, multi-component architecture in RecGPT-V2 to overcome the computational inefficiency, cognitive redundancy, and evaluation limitations of its predecessor. The system is structured around three core innovations: Agentic Intent Reasoning, Dynamic Explanation Generation, and an Agentic Judge Framework, all operating on a compressed hybrid representation of user context.
The foundational layer of the system is the Hybrid Representation Inference, which addresses the token explosion inherent in processing long user behavior sequences. Instead of feeding raw text into the LLM, RecGPT-V2 employs Atomized Entity Compression. This technique encodes item descriptions and query histories into dense vector representations using pretrained embedding models (e.g., BGE, Qwen3-Embedding). These vectors are then projected into the LLM’s input space via a lightweight, trainable adaptor network, replacing multi-token descriptions with a single atomic token, denoted as [entity]. As shown in the figure below, this process reduces a 21,349-token user profile to a 5,158-token hybrid context, achieving a 76% token reduction while preserving user attributes and temporal metadata in natural language. This compression is critical for enabling efficient inference.

Building upon this efficient representation, the Agentic Intent Reasoning module restructures the intent decomposition process. RecGPT-V1’s parallel, isolated LLM routes are replaced by a Hierarchical Multi-Agent System (HMAS) comprising a Global Planner, Distributed Experts, and a Decision Arbiter. The Global Planner receives the compressed hybrid context—comprising user behavior, profile, and real-time environmental signals (e.g., weather, trends)—and performs a single, holistic analysis to decompose the user’s intent into a set of specialized personas. This eliminates the redundant full-context encoding performed by each route in RecGPT-V1. Each persona is then assigned to a dedicated Expert agent, which operates under that persona to predict a set of item tags. The Decision Arbiter synthesizes the outputs from all experts, performing joint reasoning over the entire candidate pool to select the final, non-redundant set of tags for downstream retrieval. This coordinated, three-tier architecture is illustrated in the figure below, contrasting the isolated routes of RecGPT-V1 with the collaborative flow of RecGPT-V2.

To generate personalized and contextually adaptive explanations, RecGPT-V2 introduces a Meta-Prompting framework. This two-stage process first synthesizes a stylistic guideline based on user interests, item attributes, and situational signals. The guideline specifies the desired tone, rhetorical devices, and emotional resonance. In the second stage, the model generates the final explanation conditioned on this guideline, enabling role-playing across diverse stylistic personas. This approach moves beyond RecGPT-V1’s fixed templates, significantly improving explanation diversity.
Finally, the Agentic Judge Framework addresses the limitations of outcome-focused evaluation. Instead of a single LLM predicting a quality score, RecGPT-V2 employs a multi-agent evaluation system. A set of specialized sub-evaluators assesses the generated content across multiple dimensions (e.g., Relevance, Timeliness, Factuality). A Senior Reviewer Agent then aggregates these dimension-specific scores into a holistic judgment using a three-tier S-A-B scheme (Superior, Average, Bad). This process mirrors human cognitive evaluation and provides more nuanced, interpretable feedback. The figure below illustrates this multi-dimension sub-evaluator and three-tier judgment process for both item tag prediction and explanation generation.

To enable continuous improvement, the system incorporates a Judge-as-a-Reward component. This distills the discrete S-A-B judgments from the Agent-as-a-Judge into a continuous, differentiable reward signal using a listwise learning-to-rank approach. This dense reward signal is then used to optimize the policy model via reinforcement learning, creating a self-reinforcing flywheel effect that aligns model behavior with human quality standards without requiring recurring human annotation.
Experiment
- Conducted two-week online A/B test on Taobao's "Guess What You Like" scenario, comparing RecGPT-V2 against RecGPT-V1 with 1% traffic allocation per group across item and feed recommendation scenarios.
- Validated significant improvements in short-term engagement: item scenario achieved +3.26% IPV, +3.01% CTR, +2.11% TV, +3.39% GMV, and +3.47% ATC; feed scenario showed +1.50% CTR and +1.53% GMV gains.
- Confirmed enhanced long-term retention with +11.46% novelty exposure rate (NER) in item scenario and +4.49% in feed scenario, validating reduced filter bubble effects, alongside improved 14-day (+0.04%) and 30-day (+0.05%) retention rates.
- Demonstrated 60% GPU consumption reduction while maintaining superior generation quality during large-scale deployment, with overall online gains including +11.46% NER.
The authors use RecGPT-V2 in a two-week A/B test on Taobao, comparing it against RecGPT-V1 across item and feed recommendation scenarios. Results show consistent improvements in both short-term engagement and long-term retention, with the item scenario achieving +3.64% IPV and +11.46% NER, while the feed scenario shows +1.50% CTR and +4.49% NER, alongside modest but meaningful gains in 14- and 30-day retention.

The authors use two variants of RecGPT-V2 with different reward modeling approaches to evaluate tag prediction and explanation quality against the RecGPT-V1 baseline. Results show that RecGPT-V2 with list-wise reward modeling achieves the highest performance, improving HR@30 (Tag) to 32.60% and Quality (Explanation) to 40.73%, outperforming both the baseline and the point-wise variant.

The authors evaluate RecGPT-V2 against V1 across item tag prediction and explanation generation tasks, showing consistent improvements in both accuracy and F1 score for all tested models. For item tag prediction, Qwen3-SFT achieves the highest gains, with accuracy rising from 0.8210 to 0.8248 and F1 from 0.8095 to 0.8228. In explanation generation, Qwen3-SFT also leads, improving accuracy from 0.6885 to 0.7006 and F1 from 0.6787 to 0.7307, indicating enhanced generation quality in V2.

The authors use RecGPT-V2 to improve recommendation diversity and generation quality compared to RecGPT-V1, as shown in the table. Results show RecGPT-V2 achieves higher diversity (0.677 vs. 0.631) and quality (40.73% vs. 36.03%), indicating enhanced recommendation effectiveness and output reliability.

The authors use HR@30 to evaluate long-term user retention across different model configurations, comparing RecGPT-V1 with RecGPT-V2 variants including Base, SFT, GRPO (SUM), and GRPO (CRS). Results show that RecGPT-V2 with GRPO (CRS) achieves the highest retention at 32.60%, outperforming RecGPT-V1 (26.29%) and all other V2 variants, indicating that the CRS-based reinforcement learning strategy most effectively sustains user engagement over time.

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.