Command Palette
Search for a command to run...
MMSkills : Vers des compétences multimodales pour des agents visuels généraux
MMSkills : Vers des compétences multimodales pour des agents visuels généraux
Résumé
Les compétences réutilisables sont devenues un substrat fondamental pour améliorer les capacités des agents, mais la plupart des packages de compétences existants encodent le comportement réutilisable principalement sous forme d’invites textuelles, de code exécutable ou de routines apprises. Pour les agents visuels, cependant, les connaissances procédurales sont intrinsèquement multimodales : la réutilisation dépend non seulement de l’opération à effectuer, mais aussi de la reconnaissance de l’état pertinent, de l’interprétation des preuves visuelles de progrès ou d’échec, et de la décision de la prochaine action. Nous formalisons cette exigence en tant que connaissances procédurales multimodales et abordons trois défis pratiques : (I) ce qu’un package de compétences multimodal devrait contenir ; (II) d’où ces packages peuvent être dérivés à partir de l’expérience d’interaction publique ; et (III) comment les agents peuvent consulter des preuves multimodales au moment de l’inférence sans contexte d’image excessif ni ancrage excessif aux captures d’écran de référence. Nous présentons MMSkills, un cadre pour représenter, générer et utiliser des procédures multimodales réutilisables pour la prise de décision visuelle en temps d’exécution. Chaque MMSkill est un package compact, conditionné par l’état, qui couple une procédure textuelle avec des cartes d’état en temps d’exécution et des images clés multi-vues. Pour construire ces packages, nous développons un Générateur de trajectoire vers compétence basé sur un agent qui transforme des trajectoires publiques non évaluatives en compétences multimodales réutilisables grâce au regroupement de flux de travail, à l’induction de procédures, à l’ancrage visuel et à l’audit guidé par des méta-compétences. Pour les utiliser, nous introduisons un agent de compétences multimodal à chargement par branche : des cartes d’état et des images clés sélectionnées sont inspectées dans une branche temporaire, alignées avec l’environnement en direct, et condensées en un guidage structuré pour l’agent principal. Les expériences menées sur des benchmarks d’agents visuels basés sur les interfaces graphiques (GUI) et les jeux montrent que MMSkills améliorent de manière constante les agents multimodaux de pointe et les plus petits, suggérant que les connaissances procédurales multimodales externes complètent les a priori internes au modèle.
One-sentence Summary
MMSkills equips visual agents with reusable multimodal procedural knowledge by coupling textual procedures with runtime state cards and multi-view keyframes, leveraging an agentic trajectory-to-skill Generator for package construction and a branch-loaded multimodal skill agent for live-environment alignment, which consistently improves both frontier and smaller multimodal agents across GUI and game-based visual-agent benchmarks.
Key Contributions
- MMSkills is introduced as a framework that formalizes multimodal procedural knowledge by encoding reusable behaviors into compact packages coupling textual procedures with runtime state cards and multi-view keyframes.
- An agentic trajectory-to-skill Generator automatically constructs these packages from public interaction logs through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing.
- A branch-loaded inference mechanism inspects selected evidence in a temporary branch to distill structured guidance, mitigating long-context degradation while consistently improving performance for frontier and smaller multimodal agents across GUI and game-based benchmarks.
Introduction
The authors address the growing reliance on reusable procedural knowledge to enhance multimodal AI agents operating in complex visual environments like desktop automation and interactive games. While prior skill representations effectively encode behaviors as text or code, they struggle to capture the visual state cues and conditional decision-making required when agents must interpret live screen evidence. Existing methods either produce verbose text-only instructions, rely on inflexible raw demonstrations, or overwhelm context windows by directly injecting full skill libraries, which often causes agents to anchor to reference screenshots rather than current observations. To bridge this gap, the authors introduce MMSkills, a framework that structures reusable knowledge into compact multimodal packages coupling textual procedures with runtime state cards and multi-view keyframes. They further develop an automated pipeline to distill these packages from public interaction trajectories and propose branch loading, a runtime mechanism that selectively aligns skill evidence with live observations to deliver precise, state-aware guidance without context saturation.
Dataset
- Dataset composition and sources: The authors use four visual-agent benchmarks spanning realistic graphical user interfaces and open game environments. Skill data is drawn from the OpenCUA trajectory dataset, official training splits, and multiple gameplay runs, with all source material strictly separated from evaluation sets.
- Subset details: OSWorld contains 360 Ubuntu desktop test cases covering browsers, office software, media applications, and system workflows. macOSWorld provides 143 cross-platform GUI cases focused on file management, productivity, and interface tasks. VAB-Minecraft evaluates item-acquisition tasks that require visual grounding, inventory tracking, and tool manipulation. Super Mario Bros is selected from LMGaME-Bench for its recurring visual scenarios that naturally support reusable skill extraction.
- Data usage and splits: The authors extract all multimodal skills exclusively from non-test trajectories. They evaluate frontier and smaller models across three conditions: no-skill, text-only skills, and MMSkills. Source trajectories remain completely disjoint from the final test cases to ensure unbiased evaluation and prevent data leakage.
- Processing and metadata construction: Agents plan directly from visual observations captured as desktop or game screenshots. For macOS skill extraction, raw OpenCUA trajectories undergo additional clustering and relevance filtering to align with benchmark categories. The resulting skill packages are structured into multimodal formats to guide agent decision-making.
Method
The MMSkills framework is structured around a modular architecture that enables visual agents to leverage reusable, multimodal procedural knowledge through a combination of skill representation, generation, and inference mechanisms. At its core, the framework consists of three main components: a multimodal skill package, a skill generation pipeline, and a branch-loaded multimodal skill agent. The overall system operates by first constructing a library of reusable skills from public interaction trajectories and then using a branch-loaded inference mechanism to consult these skills during task execution without directly embedding the full skill context into the main agent's reasoning.
A multimodal skill package encapsulates procedural knowledge as a state-conditioned procedure, represented as M=(D,P,S,K), where D is a compact descriptor, P is a textual procedure, S is a set of runtime state cards, and K is a set of keyframe bundles. Each state card Sj defines when a procedure should be applied or skipped, along with visible cues, verification cues, and available views, enabling the agent to make informed decisions about skill use. The keyframe bundles Kj provide multi-view visual evidence—such as full-frame, focus-crop, before, and after views—that grounds the skill in the environment. This representation allows the skill to be used both textually and visually, with the visual components serving as diagnostic references rather than direct action templates. The skill package is designed to be compact and reusable, with a text-only variant being the degenerate case where no visual evidence is included.
The skill generation pipeline transforms public, non-test trajectories into a domain-specific skill library. This process begins with embedding and clustering task instructions and trajectory metadata to form semantically focused clusters. For each cluster, a large language model (LLM)-based agent proposes atomic skills, defining workflow boundaries, completion conditions, and task coverage. These proposals are then merged and generalized into consolidated skill specifications, with overly broad skills being rejected. The next phase involves drafting the textual components of the skill—descriptor, procedure, and state cards—without referencing images. Finally, the skill is grounded by selecting keyframes, constructing multi-view bundles, and auditing the package to ensure consistency and relevance. This pipeline is controlled by a meta-skill that provides reusable scripts and quality gates, ensuring the generated skills are coherent and useful.
During inference, the main visual agent operates in a branch-loaded manner to avoid the contextual overload that would result from directly loading a full skill package. The agent maintains a short history and observes the current visual state, deciding at each step whether to act directly or consult a selected skill. When a skill is consulted, a temporary branch is activated, isolating the skill-environment grounding from the main trajectory. This branch operates in two stages. In the first stage, a gated view selector evaluates the current observation and recent history to determine whether visual evidence is needed and, if so, which state cards and view types to load. This decision is based on evidence goals such as locating a control, recognizing a pre-change state, or verifying a post-change result. The second stage, the planner guidance module, uses the selected evidence to generate a structured guidance tuple Gt=(applicablet,subgoalt,plant,do_not_dot,verifyt), which includes an applicability judgment, a local subgoal, a skill-conditioned plan, negative constraints, and a verification check. This guidance is returned to the main agent, which uses it as decision support while maintaining action grounding in the live observation.
The main agent’s decision to consult a skill is governed by a policy that evaluates the relevance of the skill’s hints and the current state of the task. The agent may consult a skill at most a limited number of times, and exhausted skills are removed from the available list. The branch-loaded design ensures that the agent receives compact, structured guidance without being distracted by irrelevant visual references, preserving the integrity of the live environment observation. This architecture allows the framework to be applied across diverse visual environments, including graphical user interfaces and video games, by adapting the skill representation and generation process to the specific domain.
Experiment
The evaluation tests multimodal procedural knowledge against no-skill and text-only baselines across desktop GUI and open-ended game environments to validate its impact on agent performance and decision dynamics. Component ablations confirm that combining runtime state discrimination cards with visual keyframes in a filtered branch-loading architecture is necessary for accurate skill retrieval and context preservation. Usage analysis reveals that multimodal guidance increases skill invocation frequency while effectively shortening task trajectories by reducing redundant exploration. Finally, behavioral tracking demonstrates that these skills fundamentally transform agent execution from trial-and-error clicking into structured, state-aware planning with stronger completion awareness.
The authors analyze the distribution of tasks and trajectory clusters across different platforms and domains, showing that the majority of tasks and clusters are concentrated on the macOS platform, particularly in productivity and system apps. The data reflects a higher volume of tasks and more diverse clustering in macOS compared to Ubuntu, indicating a broader range of evaluated scenarios in the macOS environment. The majority of tasks and trajectory clusters are concentrated on the macOS platform. Productivity and system apps on macOS have the highest number of tasks and clusters. Ubuntu shows fewer tasks and clusters, with a more limited distribution across domains.
The authors evaluate the impact of MMSkills on visual agents across multiple models and tasks, showing that the integration of multimodal procedural knowledge improves success rates and alters agent behavior. Results demonstrate consistent performance gains across different model families and task domains, with MMSkills leading to shorter trajectories, reduced repetitive actions, and more efficient decision-making. MMSkills improve success rates across all evaluated models and tasks, with the largest gains seen in weaker models and visually grounded game settings. The use of MMSkills reduces interaction length and repetitive actions, indicating more efficient and goal-directed behavior. MMSkills shift agent behavior toward structured input and better completion awareness, reducing exploratory actions and increasing verification of task completion.
The authors analyze the impact of MMSkills on visual agents across multiple benchmarks, showing that the integration of multimodal procedural knowledge improves task success rates and alters agent behavior. Results indicate that MMSkills reduce the number of low-level actions, decrease repetitive behaviors, and shift action patterns toward more structured and goal-directed interactions, particularly for models with high click usage. MMSkills reduce the number of low-level actions and repetitive behaviors, leading to more efficient task execution. Agents using MMSkills exhibit a shift from click-heavy behaviors toward more structured input and completion judgments. The integration of multimodal skills shortens interaction trajectories and decreases the frequency of repeated actions across multiple models.
The authors evaluate the impact of multimodal procedural skills on visual agents across multiple benchmarks, including desktop environments and video games. Results show that incorporating these skills consistently improves success rates and behavioral efficiency, particularly for models with limited internal visual reasoning capabilities, while also reducing task-solving steps and repetitive actions. MMSkills improve success rates across all evaluated models and benchmarks, with the largest gains observed in weaker visual agents. MMSkills reduce the number of low-level actions and repetitive behaviors, leading to more efficient and structured task execution. The use of MMSkills results in shorter trajectories and more effective state recognition, especially when combined with branch loading and visual evidence filtering.
The authors evaluate the impact of MMSkills on visual agents across multiple benchmarks, including desktop applications and game environments. Results show that MMSkills consistently improve performance across different models and domains, with gains observed in both success rates and behavioral efficiency. The effectiveness of MMSkills is attributed to their multimodal nature, which enables better state recognition and more structured task execution. MMSkills improve performance across diverse domains and models, including desktop applications and game environments. The multimodal nature of MMSkills leads to more efficient and goal-directed agent behavior, reducing unnecessary actions and repetitive patterns. MMSkills enable agents to better recognize relevant states and use external knowledge in a targeted manner, improving task-solving efficiency.
The experimental suite evaluates the integration of multimodal procedural skills into visual agents across various desktop platforms, application domains, and model architectures to validate their impact on task execution and behavioral efficiency. Results consistently demonstrate that incorporating these skills enhances overall success rates while fundamentally reshaping agent interactions by eliminating redundant actions and shortening interaction trajectories. Ultimately, the approach fosters more structured and goal-directed workflows with improved state recognition, delivering the most substantial behavioral improvements for models with weaker intrinsic visual reasoning capabilities.