HyperAIHyperAI

Command Palette

Search for a command to run...

il y a 4 heures
Multimodal
Agent

Optimisation de la politique exploratoire d'Agent pour le raisonnement agentique multimodal

Minki Kang Shizhe Diao Ryo Hachiuma Sung Ju Hwang Pavlo Molchanov Yu-Chiang Frank Wang Byung-Kwan Lee

Résumé

Les modèles vision-langue dotés d'un raisonnement étendu réussissent à résoudre des problèmes complexes, mais de nombreux problèmes réels nécessitent des outils externes que le raisonnement interne seul ne parvient souvent pas à traiter. Le raisonnement agentif entrelace ainsi deux comportements caractérisés par une asymétrie structurelle : la réflexion (par défaut, autonome) et l'utilisation d'outils (une action auxiliaire à forte variance). Nous désignons cette asymétrie par le terme Thinking-Acting Gap. Dans le cadre des méthodes d'apprentissage par renforcement (RL) standard telles que GRPO, cet écart se manifeste par deux symptômes diagnostiques lors de l'entraînement : l'utilisation d'outils n'est tentée que sur environ 30 % des rollouts, et lorsqu'elle l'est, les rollouts faisant appel à des outils au sein d'un même groupe sont entièrement incorrects pour environ 40 % des questions, ce qui supprime le signal d'apprentissage au niveau des appels d'outils qui en avaient précisément besoin. Nous proposons AXPO (Agent eXplorative Policy Optimization) : pour chaque sous-groupe de rollouts utilisant des outils et entièrement incorrect, AXPO conserve le préfixe de réflexion et rééchantillonne l'appel d'outil ainsi que sa suite, en association avec une sélection de préfixe fondée sur l'incertitude. Sur neuf benchmarks multimodaux et trois tailles de Qwen3-VL-Thinking, SFT+AXPO surpasse SFT+GRPO en moyenne (+1,8 pp pour Pass@1 et +1,8 pp pour Pass@4 à l'échelle 8B en moyenne) ; à l'échelle 8B, SFT+AXPO dépasse le modèle de base 32B sur Pass@4 avec quatre fois moins de paramètres.

One-sentence Summary

AXPO (Agent eXplorative Policy Optimization) bridges the Thinking-Acting Gap in multimodal agentic reasoning by fixing thinking prefixes and resampling tool calls paired with uncertainty-based prefix selection, enabling an 8B Qwen3-VL-Thinking model trained with SFT + AXPO to outperform SFT + GRPO by an average of 1.8 percentage points on Pass@1 and Pass@4 across nine benchmarks and surpass a 32B base model on Pass@4 using four times fewer parameters.

Key Contributions

  • AXPO (Agent eXplorative Policy Optimization) resolves the Thinking-Acting Gap in agentic vision-language models by identifying all-wrong tool-using subgroups and resampling their tool calls and continuations while freezing the preceding reasoning prefix.
  • This mechanism operates directly at the tool-call boundary rather than after tool observation, which restores the suppressed learning signal that standard group-relative policy optimization typically misses during uniform sampling.
  • Evaluations across nine multimodal benchmarks using Qwen3-VL-Thinking demonstrate that pairing supervised fine-tuning with AXPO increases average Pass@1 and Pass@4 scores by 1.8 percentage points at the 8B scale and enables an 8B model to surpass a 32B base model on Pass@4 using four times fewer parameters.

Introduction

Vision-language models with extended reasoning have made significant strides, yet real-world applications frequently require external tools for live data retrieval, complex computation, and fine-grained visual analysis. Multimodal agentic reasoning addresses this need by interleaving internal thought processes with tool execution, but standard post-training pipelines struggle with a structural asymmetry known as the Thinking-Acting Gap. Under conventional reinforcement learning methods like GRPO, tool calls remain rare and highly volatile, frequently causing entire rollout groups to fail and effectively erasing the learning signal exactly when the model needs to improve its acting behavior. To resolve this bottleneck, the authors leverage AXPO (Agent eXplorative Policy Optimization), a targeted framework that anchors successful thinking prefixes and resamples only the high-variance tool calls and their continuations when trajectories fail. By concentrating exploration precisely at the tool-call boundary and applying uncertainty-driven prefix selection, AXPO restores robust learning signals for acting, delivering substantial performance gains across multimodal benchmarks while outperforming significantly larger baseline models.

Dataset

Dataset Composition and Sources: The authors construct their training corpus by aggregating trajectories from three established datasets: ViRL, fvqa, and PyVision-RL. Every initial trajectory is synthesized using the Qwen3-VL-32B-Thinking model as the teacher.

Subset Details:

  • Supervised Fine-Tuning collection: 64,274 trajectories in total. Roughly 25 percent incorporate at least one tool call, while the remaining 75 percent rely exclusively on internal reasoning. The authors apply a strict correctness filter, retaining only trajectories that produce the correct final answer against ground truth labels.
  • Reinforcement Learning collection: Approximately 37,000 problems. This subset combines 15,591 questions filtered from the SFT pool with 22,000 additional hard questions sourced from MMFinetReason-hard. The authors remove SFT questions that the supervised checkpoint solves perfectly across four rollouts, as well as questions the teacher model fails on all four rollouts, effectively pruning trivial and unreachable tasks.

Training Usage and Processing: The cleaned SFT trajectories form the initial training split. The authors then transition to reinforcement learning, where the policy generates up to three turns per problem with a group size of eight rollouts per question. Data is routed through the verl and rllm libraries to apply policy gradient updates, utilizing asymmetric clipping ranges and a fixed reference policy for KL regularization.

Additional Processing Notes: The provided documentation does not outline specific image cropping methods or custom metadata schemas. Instead, the authors rely on trajectory length constraints, rollout-based sampling, and answer verification to structure and validate the training data.

Method

The authors present AXPO, a reinforcement learning algorithm designed to address the Thinking-Acting Gap in group-based agentic reasoning, where tool use is under-attempted and tool-using subgroups frequently fail entirely, leading to non-positive advantage on tool-call tokens. The framework builds upon the standard Group Relative Policy Optimization (GRPO) pipeline, which trains a vision-language model (VLM) policy πθ\pi_\thetaπθ to generate sequences of thinking segments, tool calls, and observations, culminating in an answer. In GRPO, a batch of NNN rollouts is sampled per input, and rewards are normalized within the group to compute advantages, which are then used to update the policy via a PPO-clip surrogate.

The core innovation of AXPO is tool-call resampling, which targets the under-trained behavior of tool use by concentrating exploration on the continuation after a confirmed tool call. This is achieved by fixing a thinking prefix t1srct_1^{\text{src}}t1src that has already crossed the tool call boundary, ensuring that all subsequent continuations are tool-using by construction. From this fixed prefix, KKK continuations are drawn from the policy πθ(x,t1src)\pi_\theta(\cdot \mid x, t_1^{\text{src}})πθ(x,t1src), executed, and rolled forward. Each resampled trajectory shares the same prefix as the source rollout, thereby concentrating stochasticity on the tool call and its immediate aftermath. This approach provably dominates standard sampling at recovering correct tool-using rollouts, as it eliminates the waste of sampling on non-tool rollouts, a key limitation of scaling NNN in GRPO.

As shown in the figure below, AXPO operates only on groups where the tool-using subgroup is non-empty and entirely incorrect, identified as the primary source of non-positive advantage. This ensures that resampling is applied where it delivers the maximum gradient lift. To manage computational cost, AXPO caps the extra resampling budget at rBNr \cdot BNrBN per step, with r=0.25r = 0.25r=0.25 in practice, and allocates resources breadth-first across all triggered questions. Within these groups, candidate prefixes are ranked by their uncertainty, measured as the mean policy probability assigned to the tool-call tokens in the source rollout. This confidence proxy, which is a tractable alternative to predictive entropy, allows AXPO to prioritize resampling the most uncertain prefixes first, as they are more likely to contain a correct continuation.

The advantage calculation in AXPO is designed to avoid gradient conflicts arising from shared prefixes. For each selected prefix, the KKK resampled continuations form an independent advantage group, and their per-token advantages A^kres\hat{A}_k^{\text{res}}A^kres are computed based on their group-normalized rewards. These advantages are applied only to the continuation tokens, with the prefix tokens masked. The source trajectory's prefix tokens are updated using a separate recovery reward rprefixr^{\text{prefix}}rprefix, which is 1 if at least one resampled continuation is correct, and 0 otherwise. This recovery reward replaces the original source rollout's reward in the group's normalization, yielding a per-prefix advantage A^prefix\hat{A}^{\text{prefix}}A^prefix that is applied to the prefix tokens. This mechanism ensures that the prefix is credited positively whenever resampling succeeds, converting the coverage gain into a reinforcing gradient signal. The final AXPO loss for a selected prefix combines the clipped surrogate losses for both the source prefix and the resampled continuations.

Experiment

The evaluation employs Qwen3-VL models across nine multimodal benchmarks to assess agentic tool use in reasoning, perception, and search tasks. Main experiments validate that AXPO bridges the thinking-acting gap by dynamically resampling tool calls during reinforcement learning, which sustains tool adoption and recovers correct trajectories from initially failed reasoning prefixes. Ablations and comparative analyses confirm that these improvements stem from strategic compute allocation and precise advantage calculation rather than increased rollout budgets or reward shaping. Ultimately, the methodology enables smaller models to match or exceed larger baselines by simultaneously expanding the policy's reachable correct trajectories and enhancing conditional tool-use reliability.

The authors evaluate the performance of their AXPO method against several baselines across multiple model sizes and benchmarks, focusing on multimodal reasoning, perception, and search tasks. Results show that AXPO consistently improves over SFT + GRPO, particularly in perception tasks, and achieves gains in both tool-use frequency and correctness. The method effectively addresses the thinking-acting gap by increasing tool utilization and reducing the frequency of all-wrong tool-using subgroups during training. AXPO improves over SFT + GRPO across all model sizes, with the largest gains in perception tasks and on Pass@4 metrics. AXPO increases tool-use frequency and reduces the all-wrong rate in tool-using subgroups, indicating better learning signal on tool calls. The method outperforms alternative RL recipes and ablation studies show that all components of AXPO are necessary for its performance gains.

The authors compare their proposed AXPO method against prior agentic VLM systems on five benchmarks, showing that AXPO achieves higher performance on four of the five benchmarks and on the average across all five. The results indicate that AXPO outperforms previous methods, particularly on math-over-image tasks, and delivers a larger improvement over its base model compared to prior methods over their respective bases. AXPO outperforms prior agentic VLM systems on four of five benchmarks and on the average across all five. AXPO achieves a larger improvement over its base model compared to prior methods over their bases. AXPO performs particularly well on math-over-image benchmarks, where prior methods have less investment.

The authors conduct an ablation study to evaluate the impact of individual components in the AXPO method on model performance across multiple benchmarks. Results show that removing any of the key design elements—prefix fixing, uncertainty-based prefix selection, prefix credit, or separate advantage grouping—leads to a consistent drop in performance, indicating that each component contributes to the overall effectiveness of the method. The full AXPO method achieves the highest scores across all evaluated metrics, demonstrating that the integration of these components is essential for optimal performance. Removing any component of AXPO leads to a measurable decrease in performance across all benchmarks. The full AXPO method outperforms all ablated versions, highlighting the importance of its integrated design. Each ablated version shows lower scores compared to the complete method, indicating that all components are necessary for optimal results.

The authors compare SFT+AXPO against SFT+GRPO and other baselines across multiple benchmarks, showing that AXPO improves both tool-use frequency and the quality of tool-using trajectories. Results show that AXPO consistently outperforms SFT+GRPO in Pass@1 and Pass@4 across all model sizes, with gains concentrated in perception tasks where tool use is critical. The method narrows the thinking-acting gap by increasing tool use and reducing the frequency of all-wrong tool-using subgroups during training. AXPO improves both tool-use frequency and the quality of tool-using trajectories compared to SFT+GRPO. AXPO achieves higher Pass@1 and Pass@4 scores across all model sizes, with gains most pronounced in perception tasks. AXPO reduces the all-wrong rate among tool-using subgroups and increases tool-use rate during training, reversing the symptoms of the thinking-acting gap.

{"summary": "The authors evaluate a method called AXPO, which enhances reinforcement learning in agentic reasoning by resampling tool calls to address the Thinking-Acting Gap. Results show that AXPO improves performance across multiple benchmarks, particularly in Perception tasks, by increasing tool-use frequency and correcting failures in tool-using subgroups. The method outperforms baselines and alternative training recipes, with gains driven by better coverage of under-explored tool-call trajectories.", "highlights": ["AXPO increases tool-use frequency and corrects failures in tool-using subgroups during training, leading to improved performance across benchmarks.", "AXPO outperforms SFT + GRPO and alternative RL recipes, with gains concentrated in Perception tasks where tool use is critical.", "Ablation studies confirm that all components of AXPO contribute to performance, particularly resampling at tool-call boundaries and advantage grouping."]

The authors evaluate AXPO against supervised fine-tuning with reinforcement learning and prior agentic vision-language models across multiple model sizes and benchmarks spanning multimodal reasoning, perception, and search tasks. The experiments validate that AXPO effectively bridges the thinking-acting gap by increasing meaningful tool utilization and correcting failure patterns in agent trajectories. Comparative results consistently demonstrate superior performance over existing methods, with particularly strong qualitative gains in perception and math-over-image domains. Ablation studies further confirm that every design component is essential, as removing any element leads to consistent performance degradation across all evaluated tasks.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp