Command Palette
Search for a command to run...
SpatialEvo : Intelligence spatiale auto-évolutive via des environnements géométriques déterministes
SpatialEvo : Intelligence spatiale auto-évolutive via des environnements géométriques déterministes
Résumé
Voici la traduction de votre texte en français, en respectant les standards de rigueur scientifique et technologique demandés :Le raisonnement spatial sur des scènes tridimensionnelles est une capacité fondamentale pour l'intelligence incarnée (embodied intelligence), pourtant l'amélioration continue des modèles reste entravée par le coût de l'annotation géométrique. Le paradigme de l'auto-évolution (self-evolving paradigm) offre une voie prometteuse, mais sa dépendance au consensus du modèle pour construire des pseudo-labels entraîne un renforcement des erreurs géométriques propres au modèle plutôt que leur correction. Nous identifions une propriété unique au raisonnement spatial 3D qui permet de contourner cette limitation : la vérité terrain (ground truth) est une conséquence déterministe de la géométrie sous-jacente, calculable avec précision à partir des nuages de points (point clouds) et des poses de caméra sans aucune intervention du modèle. En nous appuyant sur cette intuition, nous présentons SpatialEvo, un framework d'auto-évolution pour le raisonnement spatial 3D, centré sur l'Environnement Géométrique Déterministe (DGE, Deterministic Geometric Environment). Le DGE formalise 16 catégories de tâches de raisonnement spatial sous des règles de validation géométrique explicites et convertit les scènes 3D non annotées en oracles interactifs sans bruit, remplaçant ainsi le consensus du modèle par un feedback physique objectif. Une politique unique à paramètres partagés co-évolue à travers les rôles de poseur de questions (questioner) et de solveur (solver) sous les contraintes du DGE : le questioner génère des questions spatiales physiquement valides basées sur les observations de la scène, tandis que le solver déduit des réponses précises par rapport à la ground truth vérifiée par le DGE. Un ordonnanceur adaptatif aux tâches (task-adaptive scheduler) concentre de manière endogène l'entraînement sur les catégories les plus faibles du modèle, produisant ainsi un curriculum dynamique sans conception manuelle. Des expérimentations sur neuf benchmarks démontrent que SpatialEvo atteint le score moyen le plus élevé aux échelles 3B et 7B, avec des gains constants sur les benchmarks de raisonnement spatial et aucune dégradation de la compréhension visuelle générale.
One-sentence Summary
The authors propose SpatialEvo, a self-evolving framework for 3D spatial reasoning that utilizes a Deterministic Geometric Environment to replace error-prone model consensus with objective physical feedback, enabling a shared-parameter policy to co-evolve across questioner and solver roles using zero-noise interactive oracles in unannotated 3D scenes.
Key Contributions
- The paper introduces SpatialEvo, a self-evolving framework for 3D spatial reasoning that replaces error-prone model consensus with deterministic physical feedback.
- This work develops the Deterministic Geometric Environment (DGE), which formalizes 16 spatial reasoning task categories and uses point clouds and camera poses to convert unannotated scenes into zero-noise interactive oracles.
- The method employs a single shared-parameter policy that co-evolves as both a questioner and a solver, a process that demonstrates significant performance gains across multiple spatial reasoning benchmarks.
Introduction
Effective 3D spatial reasoning is essential for embodied intelligence, yet progress is often hindered by the high cost of geometric annotations and the limitations of static datasets. Existing self-evolution methods typically rely on model consensus to generate pseudo-labels, which can reinforce a model's own geometric errors rather than correcting them. The authors leverage the deterministic nature of 3D geometry to overcome this, introducing SpatialEvo. This framework utilizes a Deterministic Geometric Environment (DGE) to compute exact ground truth from point clouds and camera poses, replacing unreliable model voting with objective physical feedback. By using a single policy that co-evolves as both a questioner and a solver, SpatialEvo creates a dynamic, task-adaptive curriculum that improves spatial reasoning without manual intervention.
Dataset

Dataset Overview
The authors utilize a pre-filtered multi-source visual context pool designed for online Reinforcement Learning (RL). The dataset is structured as follows:
-
Composition and Sources
- The pool consists of 4,365 total contexts derived from the training splits of ScanNet, ScanNet++, and ARKitScenes.
- Data is organized into three distinct modalities: scene-level multi-frame contexts, image-pair contexts, and single-image contexts.
-
Filtering and Quality Control
- Scene-level contexts: Filtered to ensure high grounded visible object counts and low zero-visibility ratios.
- Image-pair contexts: Required to contain at least three shared visible objects across frames and a minimum of five visible objects per frame.
- Single-image contexts: Required to include at least six visible objects.
-
Data Usage and Training Strategy
- Mixture Ratios: The context pool is balanced by modality based on the number of supported task types, resulting in an approximate 6:7:3 ratio for scene-level, image-pair, and single-image inputs.
- Sampling Logic: To prevent data redundancy, the authors sample a limited number of contexts per video, specifically no more than three per modality.
- Online Generation: During training, the policy model receives raw image contexts as input, while both question and answer generation are performed online.
Method
The SpatialEvo framework, as illustrated in the figure below, introduces a novel architecture for spatial reasoning through a co-evolutionary paradigm that integrates a deterministic geometric environment with a shared vision-language policy model. The framework operates as a closed-loop system where a single policy model, parameterized by πθ, dynamically assumes two complementary roles: a Questioner and a Solver. The Questioner generates spatially grounded reasoning questions from visual observations, while the Solver predicts answers to these questions, with both roles operating under the hard constraints of geometric ground truth provided by the Deterministic Geometric Environment (DGE). This design establishes a continuous self-reinforcement loop, where the Questioner's exploration of spatial boundaries is corrected by the Solver's interaction with the DGE's absolute ground truth, thereby enabling mutual knowledge reinforcement and the emergence of robust spatial intelligence.

The core of this framework is the Deterministic Geometric Environment (DGE), which functions as a Geometric Oracle to provide noise-free feedback. The DGE receives natural language questions from the policy model and maps them to the underlying 3D scene assets—comprising dense point clouds and camera pose sequences—to perform objective verification and compute exact ground-truth answers. This process is implemented through a tightly coupled pipeline consisting of two primary components: task-specific geometric validation rule sets and an automated verification pipeline. The validation rule sets decompose each of the 16 spatial reasoning tasks into executable atomic criteria, ensuring that questions are valid along dimensions of premise consistency, inferential solvability, and geometric degeneracy filtering. For instance, a question about relative direction requires that the referenced frames are valid and that sufficient viewpoint disparity exists. The automated verification pipeline then executes this logic in three stages: first, it parses the free-form question using a lightweight LLM to extract structured entities; second, it validates the extracted entities against the task-specific rule set; and third, for valid questions, it performs precise geometric computation to synthesize the ground truth. This paradigm replaces unreliable model-based judgments with programmatic physical computation, ensuring that every gradient update for the policy model is anchored to objective physical laws.

The co-evolution of the Questioner and Solver is driven by a spatial-grounded policy co-evolution mechanism based on the GRPO algorithm. This mechanism employs a single policy model that alternates between the two roles via role-conditioned prompting. The task scheduler, which is a lightweight component, dynamically adjusts the training curriculum by sampling tasks based on the Solver's historical performance. It first infers the feasible task set for the current scene and then assigns sampling weights inversely proportional to the historical effective accuracy of each task category, ensuring that the model focuses on its current cognitive weak spots. This creates a fully adaptive, endogenously driven curriculum. The training procedure involves the Questioner generating a batch of candidate questions, which are then verified by the DGE. Valid questions are passed to the Solver, which independently generates answers and receives rewards based on accuracy. Invalid questions also contribute to learning, as the Solver is required to generate an explanation for the rejection reason, which is scored by a lightweight LLM judge. The reward functions are carefully designed to promote high-quality, valid reasoning. For the Questioner, the reward combines format compliance with a coupled term of geometric validity and visual observation quality, which acts as a critical gating mechanism. For the Solver, the reward is structured to provide meaningful signals for both valid and invalid questions, ensuring that the model learns not only to answer correctly but also to understand the rules and constraints that define valid spatial queries.

The framework's design includes several key components to ensure robustness and interpretability. The DGE's automated verification pipeline includes a deduplication-aware statistics system that maintains a weighted count of unique semantic question signatures to preserve curriculum consistency. The questioner prompt templates are task-conditioned, with scene-level, single-image, and image-pair templates that guide the model to generate observations with a global-to-local flow. The invalid-question explanation judge prompt, which is used to score the Solver's explanations for rejected questions, is designed to prefer the simulator's authoritative failure reason over fluent but unsupported explanations. This ensures that the learning signal for invalid questions is anchored to the DGE's structured rejection evidence, teaching the model which questions should not be asked and why. All auxiliary language model calls, including entity extraction and explanation judging, are unified to a single GPT-OSS-120B backend to control system complexity and ensure consistency. This comprehensive design enables the model to develop a deep, grounded understanding of spatial relationships through continuous interaction with a physically consistent environment.
Experiment
SpatialEvo is evaluated across nine benchmarks to validate its ability to improve 3D spatial reasoning through a self-evolving reinforcement learning framework. The experiments compare the proposed method against static data tuning and existing self-supervised approaches, while ablation studies isolate the benefits of the Deterministic Geometric Environment and the adaptive task scheduler. The results demonstrate that providing exact physical feedback through programmatic verification enables superior spatial intelligence and emergent curriculum learning without degrading general visual capabilities.
The the the table compares different training paradigms for spatial reasoning, showing that the online reinforcement learning method achieves the highest average score across multiple task categories. The results highlight the effectiveness of the proposed method in improving performance on numerical and multiple-choice questions compared to static data tuning approaches. The online reinforcement learning method outperforms static data tuning methods across all task categories. The proposed method achieves the highest average score, indicating superior performance in spatial reasoning tasks. Static data tuning methods show lower performance, particularly in numerical and multiple-choice question categories.

The the the table lists key hyperparameters used in the training process, including settings for gradient accumulation, learning rate, and data processing. These parameters are part of the reinforcement learning configuration for the model's training pipeline. Training uses gradient accumulation with a step count of 4 and a learning rate of 1e-6. The model employs flash attention for efficient computation and processes images with a maximum pixel size of 150,528. Training involves 4 epochs and uses tensor parallelism with a size of 2.

The figures illustrate the training dynamics of SpatialEvo, showing the evolution of questioner and solver rewards and the adaptive curriculum development. Results show that the questioner quickly learns to generate valid questions, while the solver's accuracy improves and invalid responses decrease. The adaptive scheduler dynamically adjusts task sampling rates, focusing on harder categories as training progresses. The questioner reward stabilizes near 1.0, indicating rapid learning of valid question generation. Solver accuracy improves and the invalid ratio declines, reflecting internalization of geometric reasoning. The adaptive scheduler up-weights harder tasks and down-weights easier ones, creating an endogenous curriculum.

The the the table presents a breakdown of input modalities across three 3D scene datasets: ScanNet, ScanNet++, and ARKitScenes. It shows the number of scene-level, image-pair, and single-image inputs for each dataset, along with their totals, indicating the scale and distribution of data sources used in the experiments. The datasets differ in the number of scene-level and image-pair inputs, with ScanNet having the highest counts in both categories. ARKitScenes contributes more single-image inputs compared to the other datasets. The total number of inputs across all modalities and datasets is 4,365, with ScanNet having the largest contribution overall.

Results show that SpatialEvo achieves the highest average score across multiple benchmarks for both model sizes, outperforming all baselines. The framework demonstrates consistent gains in spatial reasoning tasks while maintaining competitive performance on general visual understanding benchmarks. SpatialEvo achieves the highest average score on all evaluated benchmarks for both model scales. SpatialEvo outperforms all baselines on spatial reasoning benchmarks, with notable improvements on VSI-Bench and EmbSpatial. SpatialEvo maintains competitive performance on general visual understanding tasks, showing no degradation compared to baseline models.

The evaluation compares various training paradigms and benchmarks to validate the effectiveness of the SpatialEvo framework in enhancing spatial reasoning. Results demonstrate that the online reinforcement learning method significantly outperforms static data tuning across multiple task categories, particularly in numerical and multiple-choice reasoning. Furthermore, the adaptive curriculum development successfully facilitates the internalization of geometric reasoning while maintaining competitive performance on general visual understanding tasks.