HyperAIHyperAI

Command Palette

Search for a command to run...

MMSkills: 일반 시각적 에이전트를 위한 다중 모달 기술로 나아가기

초록

재사용 가능한 기술은 에이전트 능력을 향상시키는 핵심 기반이 되었으나, 기존 대부분의 기술 패키지는 재사용 가능한 행동을 주로 텍스트 프롬프트, 실행 가능한 코드, 또는 학습된 루틴으로 인코딩합니다. 그러나 시각적 에이전트의 경우, 절차적 지식은 본질적으로 다중 모달적입니다: 재사용은 수행해야 할 작업뿐만 아니라 관련 상태를 인식하고, 진행 상황이나 실패의 시각적 증거를 해석하며, 다음 행동을 결정하는 것에도 의존합니다. 우리는 이러한 요구사항을 다중 모달 절차적 지식으로 형식화하고, 세 가지 실용적 과제를 해결합니다: (I) 다중 모달 기술 패키지가 무엇을 포함해야 하는지, (II) 이러한 패키지가 공개 상호작용 경험에서何处에서 유래할 수 있는지, 그리고 (III) 에이전트가 과도한 이미지 컨텍스트나 참조 스크린샷에 대한 과도한 고정 없이 추론 시 다중 모달 증거를 어떻게 참조할 수 있는지. 우리는 런타임 시각적 의사결정을 위해 재사용 가능한 다중 모달 절차를 표현, 생성 및 사용하기 위한 프레임워크인 MMSkills를 소개합니다. 각 MMSkill은 텍스트 절차와 런타임 상태 카드, 다중 뷰 키프레임을 결합한 컴팩트한 상태 조건부 패키지입니다. 이러한 패키지를 구성하기 위해, 우리는 공개 비평가용 트래젝토리를 재사용 가능한 다중 모달 기술로 변환하는 워크플로우 그룹화, 절차 유도, 시각적 그라운딩, 그리고 메타-스킬 기반 감사 과정을 통해 에이전트 트래젝토리-투-스킬 생성기를 개발했습니다. 이를 사용하기 위해, 우리는 브랜치 로딩 다중 모달 기술 에이전트를 소개합니다: 선택된 상태 카드와 키프레임은 임시 브랜치에서 검사되어 라이브 환경과 정렬되며, 메인 에이전트를 위한 구조화된 가이드라인으로 압축됩니다. GUI 및 게임 기반 시각적 에이전트 벤치마크에 걸친 실험 결과는 MMSkills가 최전선 및 소규모 다중 모달 에이전트 모두에서 일관되게 성능을 향상시킨다는 것을 보여주며, 이는 외부 다중 모달 절차적 지식이 모델 내부 사전 지식과 보완적임을 시사합니다.

One-sentence Summary

MMSkills equips visual agents with reusable multimodal procedural knowledge by coupling textual procedures with runtime state cards and multi-view keyframes, leveraging an agentic trajectory-to-skill Generator for package construction and a branch-loaded multimodal skill agent for live-environment alignment, which consistently improves both frontier and smaller multimodal agents across GUI and game-based visual-agent benchmarks.

Key Contributions

  • MMSkills is introduced as a framework that formalizes multimodal procedural knowledge by encoding reusable behaviors into compact packages coupling textual procedures with runtime state cards and multi-view keyframes.
  • An agentic trajectory-to-skill Generator automatically constructs these packages from public interaction logs through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing.
  • A branch-loaded inference mechanism inspects selected evidence in a temporary branch to distill structured guidance, mitigating long-context degradation while consistently improving performance for frontier and smaller multimodal agents across GUI and game-based benchmarks.

Introduction

The authors address the growing reliance on reusable procedural knowledge to enhance multimodal AI agents operating in complex visual environments like desktop automation and interactive games. While prior skill representations effectively encode behaviors as text or code, they struggle to capture the visual state cues and conditional decision-making required when agents must interpret live screen evidence. Existing methods either produce verbose text-only instructions, rely on inflexible raw demonstrations, or overwhelm context windows by directly injecting full skill libraries, which often causes agents to anchor to reference screenshots rather than current observations. To bridge this gap, the authors introduce MMSkills, a framework that structures reusable knowledge into compact multimodal packages coupling textual procedures with runtime state cards and multi-view keyframes. They further develop an automated pipeline to distill these packages from public interaction trajectories and propose branch loading, a runtime mechanism that selectively aligns skill evidence with live observations to deliver precise, state-aware guidance without context saturation.

Dataset

  • Dataset composition and sources: The authors use four visual-agent benchmarks spanning realistic graphical user interfaces and open game environments. Skill data is drawn from the OpenCUA trajectory dataset, official training splits, and multiple gameplay runs, with all source material strictly separated from evaluation sets.
  • Subset details: OSWorld contains 360 Ubuntu desktop test cases covering browsers, office software, media applications, and system workflows. macOSWorld provides 143 cross-platform GUI cases focused on file management, productivity, and interface tasks. VAB-Minecraft evaluates item-acquisition tasks that require visual grounding, inventory tracking, and tool manipulation. Super Mario Bros is selected from LMGaME-Bench for its recurring visual scenarios that naturally support reusable skill extraction.
  • Data usage and splits: The authors extract all multimodal skills exclusively from non-test trajectories. They evaluate frontier and smaller models across three conditions: no-skill, text-only skills, and MMSkills. Source trajectories remain completely disjoint from the final test cases to ensure unbiased evaluation and prevent data leakage.
  • Processing and metadata construction: Agents plan directly from visual observations captured as desktop or game screenshots. For macOS skill extraction, raw OpenCUA trajectories undergo additional clustering and relevance filtering to align with benchmark categories. The resulting skill packages are structured into multimodal formats to guide agent decision-making.

Method

The MMSkills framework is structured around a modular architecture that enables visual agents to leverage reusable, multimodal procedural knowledge through a combination of skill representation, generation, and inference mechanisms. At its core, the framework consists of three main components: a multimodal skill package, a skill generation pipeline, and a branch-loaded multimodal skill agent. The overall system operates by first constructing a library of reusable skills from public interaction trajectories and then using a branch-loaded inference mechanism to consult these skills during task execution without directly embedding the full skill context into the main agent's reasoning.

A multimodal skill package encapsulates procedural knowledge as a state-conditioned procedure, represented as M=(D,P,S,K)M = (D, P, S, K)M=(D,P,S,K), where DDD is a compact descriptor, PPP is a textual procedure, SSS is a set of runtime state cards, and KKK is a set of keyframe bundles. Each state card SjS_jSj defines when a procedure should be applied or skipped, along with visible cues, verification cues, and available views, enabling the agent to make informed decisions about skill use. The keyframe bundles KjK_jKj provide multi-view visual evidence—such as full-frame, focus-crop, before, and after views—that grounds the skill in the environment. This representation allows the skill to be used both textually and visually, with the visual components serving as diagnostic references rather than direct action templates. The skill package is designed to be compact and reusable, with a text-only variant being the degenerate case where no visual evidence is included.

The skill generation pipeline transforms public, non-test trajectories into a domain-specific skill library. This process begins with embedding and clustering task instructions and trajectory metadata to form semantically focused clusters. For each cluster, a large language model (LLM)-based agent proposes atomic skills, defining workflow boundaries, completion conditions, and task coverage. These proposals are then merged and generalized into consolidated skill specifications, with overly broad skills being rejected. The next phase involves drafting the textual components of the skill—descriptor, procedure, and state cards—without referencing images. Finally, the skill is grounded by selecting keyframes, constructing multi-view bundles, and auditing the package to ensure consistency and relevance. This pipeline is controlled by a meta-skill that provides reusable scripts and quality gates, ensuring the generated skills are coherent and useful.

During inference, the main visual agent operates in a branch-loaded manner to avoid the contextual overload that would result from directly loading a full skill package. The agent maintains a short history and observes the current visual state, deciding at each step whether to act directly or consult a selected skill. When a skill is consulted, a temporary branch is activated, isolating the skill-environment grounding from the main trajectory. This branch operates in two stages. In the first stage, a gated view selector evaluates the current observation and recent history to determine whether visual evidence is needed and, if so, which state cards and view types to load. This decision is based on evidence goals such as locating a control, recognizing a pre-change state, or verifying a post-change result. The second stage, the planner guidance module, uses the selected evidence to generate a structured guidance tuple Gt=(applicablet,subgoalt,plant,do_not_dot,verifyt)G_t = (\text{applicable}_t, \text{subgoal}_t, \text{plan}_t, \text{do\_not\_do}_t, \text{verify}_t)Gt=(applicablet,subgoalt,plant,do_not_dot,verifyt), which includes an applicability judgment, a local subgoal, a skill-conditioned plan, negative constraints, and a verification check. This guidance is returned to the main agent, which uses it as decision support while maintaining action grounding in the live observation.

The main agent’s decision to consult a skill is governed by a policy that evaluates the relevance of the skill’s hints and the current state of the task. The agent may consult a skill at most a limited number of times, and exhausted skills are removed from the available list. The branch-loaded design ensures that the agent receives compact, structured guidance without being distracted by irrelevant visual references, preserving the integrity of the live environment observation. This architecture allows the framework to be applied across diverse visual environments, including graphical user interfaces and video games, by adapting the skill representation and generation process to the specific domain.

Experiment

The evaluation tests multimodal procedural knowledge against no-skill and text-only baselines across desktop GUI and open-ended game environments to validate its impact on agent performance and decision dynamics. Component ablations confirm that combining runtime state discrimination cards with visual keyframes in a filtered branch-loading architecture is necessary for accurate skill retrieval and context preservation. Usage analysis reveals that multimodal guidance increases skill invocation frequency while effectively shortening task trajectories by reducing redundant exploration. Finally, behavioral tracking demonstrates that these skills fundamentally transform agent execution from trial-and-error clicking into structured, state-aware planning with stronger completion awareness.

The authors analyze the distribution of tasks and trajectory clusters across different platforms and domains, showing that the majority of tasks and clusters are concentrated on the macOS platform, particularly in productivity and system apps. The data reflects a higher volume of tasks and more diverse clustering in macOS compared to Ubuntu, indicating a broader range of evaluated scenarios in the macOS environment. The majority of tasks and trajectory clusters are concentrated on the macOS platform. Productivity and system apps on macOS have the highest number of tasks and clusters. Ubuntu shows fewer tasks and clusters, with a more limited distribution across domains.

The authors evaluate the impact of MMSkills on visual agents across multiple models and tasks, showing that the integration of multimodal procedural knowledge improves success rates and alters agent behavior. Results demonstrate consistent performance gains across different model families and task domains, with MMSkills leading to shorter trajectories, reduced repetitive actions, and more efficient decision-making. MMSkills improve success rates across all evaluated models and tasks, with the largest gains seen in weaker models and visually grounded game settings. The use of MMSkills reduces interaction length and repetitive actions, indicating more efficient and goal-directed behavior. MMSkills shift agent behavior toward structured input and better completion awareness, reducing exploratory actions and increasing verification of task completion.

The authors analyze the impact of MMSkills on visual agents across multiple benchmarks, showing that the integration of multimodal procedural knowledge improves task success rates and alters agent behavior. Results indicate that MMSkills reduce the number of low-level actions, decrease repetitive behaviors, and shift action patterns toward more structured and goal-directed interactions, particularly for models with high click usage. MMSkills reduce the number of low-level actions and repetitive behaviors, leading to more efficient task execution. Agents using MMSkills exhibit a shift from click-heavy behaviors toward more structured input and completion judgments. The integration of multimodal skills shortens interaction trajectories and decreases the frequency of repeated actions across multiple models.

The authors evaluate the impact of multimodal procedural skills on visual agents across multiple benchmarks, including desktop environments and video games. Results show that incorporating these skills consistently improves success rates and behavioral efficiency, particularly for models with limited internal visual reasoning capabilities, while also reducing task-solving steps and repetitive actions. MMSkills improve success rates across all evaluated models and benchmarks, with the largest gains observed in weaker visual agents. MMSkills reduce the number of low-level actions and repetitive behaviors, leading to more efficient and structured task execution. The use of MMSkills results in shorter trajectories and more effective state recognition, especially when combined with branch loading and visual evidence filtering.

The authors evaluate the impact of MMSkills on visual agents across multiple benchmarks, including desktop applications and game environments. Results show that MMSkills consistently improve performance across different models and domains, with gains observed in both success rates and behavioral efficiency. The effectiveness of MMSkills is attributed to their multimodal nature, which enables better state recognition and more structured task execution. MMSkills improve performance across diverse domains and models, including desktop applications and game environments. The multimodal nature of MMSkills leads to more efficient and goal-directed agent behavior, reducing unnecessary actions and repetitive patterns. MMSkills enable agents to better recognize relevant states and use external knowledge in a targeted manner, improving task-solving efficiency.

The experimental suite evaluates the integration of multimodal procedural skills into visual agents across various desktop platforms, application domains, and model architectures to validate their impact on task execution and behavioral efficiency. Results consistently demonstrate that incorporating these skills enhances overall success rates while fundamentally reshaping agent interactions by eliminating redundant actions and shortening interaction trajectories. Ultimately, the approach fosters more structured and goal-directed workflows with improved state recognition, delivering the most substantial behavioral improvements for models with weaker intrinsic visual reasoning capabilities.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp