Sommes-nous prêts à l'RL dans la génération text-to-3D ? Une investigation progressive
Sommes-nous prêts à l'RL dans la génération text-to-3D ? Une investigation progressive
Résumé
L’apprentissage par renforcement (RL), déjà démontré efficace dans les modèles de langage à grande échelle et multimodaux, a récemment été étendu avec succès pour améliorer la génération d’images 2D. Toutefois, son application à la génération 3D reste largement explorée, en raison de la complexité spatiale accrue des objets 3D, qui exigent une géométrie globalement cohérente ainsi que des textures locales à très haute résolution. Cela rend la génération 3D particulièrement sensible aux conceptions de récompenses et aux algorithmes de RL. Pour relever ces défis, nous menons la première étude systématique du RL pour la génération autoregressive texte-vers-3D à travers plusieurs dimensions. (1) Conception des récompenses : Nous évaluons différentes dimensions de récompense et des choix de modèles, en montrant que l’alignement avec les préférences humaines est crucial, et que les modèles multimodaux généraux fournissent un signal robuste pour les attributs 3D. (2) Algorithmes de RL : Nous étudions des variantes de GRPO, soulignant l’efficacité de l’optimisation au niveau des tokens, et explorons plus en profondeur l’impact de l’échelle des données d’entraînement et du nombre d’itérations. (3) Jeux de données pour la génération 3D : Étant donné que les benchmarks existants échouent à mesurer les capacités implicites de raisonnement des modèles de génération 3D, nous introduisons MME-3DR. (4) Paradigmes avancés de RL : Inspirés par la hiérarchie naturelle de la génération 3D, nous proposons Hi-GRPO, un cadre qui optimise la génération 3D hiérarchique (du global au local) via des ensembles de récompenses dédiés. À partir de ces découvertes, nous développons AR3D-R1, le premier modèle de génération texte-vers-3D amélioré par RL, expert dans la réduction progressive de la forme brute jusqu’à la finition des textures. Nous espérons que cette étude apportera des perspectives sur le raisonnement piloté par le RL pour la génération 3D. Le code source est disponible à l’adresse suivante : https://github.com/Ivan-Tang-3D/3DGen-R1.
Dépôts de code
One-sentence Summary
Researchers from Northwestern Polytechnical University, Peking University, and The Hong Kong University of Science and Technology propose AR3D-R1, the first reinforcement learning-enhanced text-to-3D autoregressive model, introducing Hi-GRPO for hierarchical global-to-local optimization and MME-3DR as a new benchmark, significantly advancing 3D generation through improved reward design and token-level RL strategies.
Key Contributions
- The paper identifies the challenge of applying reinforcement learning (RL) to text-to-3D generation due to 3D objects' high spatial complexity and the need for global consistency and local detail, and conducts the first systematic study on RL for autoregressive 3D generation, evaluating reward designs, RL algorithms, and benchmarking needs.
- It introduces Hi-GRPO, a hierarchical RL framework that leverages dedicated reward ensembles to optimize 3D generation in a coarse-to-fine manner, and develops AR3D-R1, the first RL-enhanced text-to-3D model, which improves generation by guiding shape and texture refinement through token-level optimization.
- To address the lack of reasoning-focused evaluation, the authors propose MME-3DR, a new benchmark with 249 annotated 3D objects across five reasoning-intensive categories, and validate their approach on Toys4K, showing significant improvements in both generation quality and implicit reasoning after RL training.
Introduction
Reinforcement learning (RL) has proven effective in enhancing reasoning and generation in large language and 2D image models, but its application to text-to-3D generation remains underexplored due to the increased spatial complexity and need for globally consistent geometry and fine-grained textures in 3D objects. Prior 3D generation methods rely on pre-training or fine-tuning, lacking the reasoning-driven refinement seen in 2D RL-enhanced models, while existing benchmarks fail to assess models’ implicit reasoning capabilities. The authors conduct the first systematic investigation of RL in text-to-3D autoregressive generation, evaluating reward models, RL algorithms, and training dynamics, and introduce MME-3DR—a new benchmark targeting five reasoning-intensive categories. They propose Hi-GRPO, a hierarchical RL framework that optimizes coarse-to-fine 3D generation using dedicated reward ensembles, and develop AR3D-R1, the first RL-enhanced text-to-3D model, which achieves state-of-the-art performance by improving both structural coherence and texture fidelity.
Dataset
-
The authors use a combination of four datasets: Objaverse-XL, HSSD, ABO, and Toys4K, with the first three used for training and the last for evaluation.
-
Objaverse-XL serves as a primary training source, containing over 10 million 3D objects collected from platforms like GitHub, Thingiverse, Sketchfab, and Polycam. It undergoes strict deduplication and rendering validation to ensure quality and diversity across categories and attributes.
-
HSSD contributes approximately 18,656 real-world object models from 211 high-quality synthetic indoor scenes, with emphasis on indoor layouts, semantic structure, and object relationships.
-
ABO provides around 8,000 3D models with rich annotations, including material properties, geometry, and attributes, drawn from nearly 147,000 product listings and 400,000 catalog images of household items.
-
Toys4K is used exclusively for evaluation and contains about 4,000 3D object instances across 105 categories, offering diverse shapes and significant variation in form.
-
During training, prompts are sampled from Objaverse-XL, HSSD, and ABO, combined into a mixture fed to the ShapeLLM-Omni base model. The training runs for 1,200 steps with a batch size of 1 per GPU across 8 devices, using gradient accumulation over 2 steps to simulate a larger batch.
-
The model is trained with a learning rate of 1e-6, a beta value of 0.01, and a group size of 8, with a configurable loss weight λ set to 1.0 for supervising global planning via final quality.
-
Reward models are served through the vLLM API framework, supporting the training process, though no cropping or explicit metadata construction beyond the original dataset curation is mentioned.
Method
The authors leverage a hierarchical reinforcement learning paradigm, Hi-GRPO, to decompose text-to-3D generation into two distinct, sequential stages: global geometric planning followed by local appearance refinement. This architecture explicitly models the coarse-to-fine nature of human 3D perception and generation, enabling the model to first establish structural integrity before enriching surface details.
In Step 1, the model receives a 3D textual prompt and a high-level semantic instruction. It generates a sequence of semantic reasoning tokens si={si,1,…,si,∣si∣} that encode global planning directives—such as object subcategory identification, spatial layout of components, and disambiguation of vague terms. These tokens, along with the original prompt and a mesh start token, are fed into the 3D autoregressive model to generate a sequence of coarse 3D tokens ti={ti,1,…,ti,M}, where M is the number of compressed grid cells. These tokens are decoded via a VQVAE decoder into a triangular mesh Mi(1), representing the initial geometric structure. As shown in the framework diagram, this step ensures the output adheres to global constraints, such as balanced proportions and correct component placement, before any fine details are considered.

Step 2 conditions on the original prompt, the previously generated semantic reasoning, and a low-level visual instruction. The model then produces visual reasoning tokens vi={vi,1,…,vi,∣vi∣} that focus on local attributes: detailed textures, component interactions, symmetry, and element counts. These tokens guide the generation of refined 3D tokens oi={oi,1,…,oi,M}, which are decoded into the final mesh Mi(2). This stage explicitly refines surface properties such as color gradients, material realism, and fine geometric details, as illustrated in the qualitative results for both mechanical and organic objects.
The training process employs a multi-expert reward ensemble designed to evaluate different aspects of 3D quality across both steps. For Step 1, rewards focus on global alignment and geometric consistency, while Step 2 emphasizes local refinement, appearance coherence, and component completeness. The ensemble includes a Human Preference Model (HPM), a Unified Reward Model for prompt alignment and aesthetic scoring, a 2D Large Multi-modal Model (Qwen2.5-VL) for multi-view consistency, and a 3D Large Multi-modal Model (ShapeLLM) for direct component detection from point clouds. Each reward is dimension-normalized to ensure balanced contribution, and the final reward from Step 2 is backpropagated to Step 1 via a configurable weight λ, allowing the final output quality to supervise the initial global planning.

The policy optimization follows a modified GRPO algorithm. For each step, the model computes token-level log probabilities for both reasoning and mesh tokens, and the loss is computed independently per step using a clipped surrogate objective with asymmetric clipping thresholds to encourage exploration. A KL regularization term with coefficient β=0.01 prevents excessive deviation from a reference policy. The total loss is the sum of the losses from both steps: Ltotal=L(1)+L(2). Advantages are normalized within prompt groups to mitigate reward scale variance across different prompts, ensuring stable training dynamics.
Experiment
Respond strictly in English.
The authors use step-specific reward functions in a hierarchical RL framework, applying distinct rewards for coarse geometry (Step 1) and fine texture (Step 2). Results show that combining both step rewards yields the highest CLIP Score (29.3) and lowest KD_incep (0.156), indicating superior alignment and structural coherence. Step 1 rewards alone improve geometry but degrade texture fidelity, confirming the necessity of hierarchical reward design.

The authors use textual reasoning as a prior step before 3D token generation, finding that this approach improves CLIP Score from 22.7 in the base model to 24.0 when reasoning is applied, outperforming the version without reasoning (23.4). Results show that incorporating textual reasoning enhances the model’s ability to generate more semantically aligned and visually coherent 3D objects under reinforcement learning.

The authors evaluate GRPO variants for 3D autoregressive generation, finding that combining Dynamic Sampling, Token-level Loss Aggregation, and Decoupled Clip yields the highest CLIP Score (26.5) and lowest KD_incep (0.210). Results show token-level strategies outperform sequence-level optimization, and removing the KL penalty degrades performance, indicating controlled exploration is essential.

The authors evaluate different reward model combinations for 3D autoregressive generation using CLIP Score and KD_incep metrics. Results show that combining HPS v2.1 with UnifiedReward and LMM_3D yields the highest CLIP Score (25.2) and lowest KD_incep (0.228), outperforming individual or partial combinations. HPS v2.1 alone provides the strongest baseline improvement, while adding specialized or LMM-based rewards further enhances performance.

The authors evaluate AR3D-R1 against prior 3D generation models on MME-3DR and Toys4K benchmarks, showing consistent improvements across CLIP Score, Inception-based metrics, and Fréchet Distance. Results indicate AR3D-R1 outperforms ShapeLLM-Omni and Trellis, particularly in structural fidelity and prompt alignment, with the largest gains observed in CLIP Score and KD_incep. These improvements reflect the effectiveness of the proposed hierarchical RL framework and reward ensemble strategy.

Construire l'IA avec l'IA
De l'idée au lancement — accélérez votre développement IA avec du co-codage IA gratuit, un environnement prêt à l'emploi et les meilleurs prix GPU.