HyperAIHyperAI

Command Palette

Search for a command to run...

17 hours ago
Multimodal
Agent
LLM

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Haoyu Huang Jinfa Huang Zhongwei Wan Xiawu Zheng Rongrong Ji Jiebo Luo

Abstract

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

One-sentence Summary

Researchers from Xiamen University, University of Rochester, and The Ohio State University propose SpecEyes, an agentic-level speculative acceleration framework that employs a lightweight tool-free MLM as a planner to predict execution trajectories. This approach reduces sequential overhead in multimodal LLMs through cognitive gating and parallel funnels, achieving significant speedups while maintaining accuracy on complex reasoning benchmarks.

Key Contributions

  • The paper introduces SpecEyes, an agentic-level speculative acceleration framework that employs a lightweight, tool-free model to predict execution trajectories and bypass expensive tool chains for queries that do not require deep reasoning.
  • A cognitive gating mechanism based on answer separability is presented to quantify model confidence for self-verification, enabling reliable switching between the small speculative model and the large agentic model without requiring oracle labels.
  • Experiments on V* Bench, HR-Bench, and POPE demonstrate that the approach achieves a 1.1 to 3.35 times speedup over agentic baselines while preserving or improving accuracy, alongside increased serving throughput under concurrent workloads.

Introduction

Agentic multimodal LLMs achieve superior reasoning by iteratively invoking visual tools, yet this process creates a severe efficiency crisis where strict data dependencies between perception and reasoning steps cause latency to explode and prevent GPU batching. Prior optimization methods like token-level speculative decoding or token pruning fail to address this issue because they only accelerate individual steps within the fixed, serial tool-use loop rather than questioning the necessity of the loop itself. The authors leverage a lightweight, tool-free model to speculate on answers for queries that do not require deep tool interaction, introducing SpecEyes as the first framework to lift speculative acceleration from the token level to the agentic level. By employing a novel cognitive gating mechanism based on answer separability to verify confidence and a heterogeneous parallel funnel to mask serial execution, SpecEyes bypasses expensive tool chains for simple queries while preserving or improving accuracy.

Method

The authors formalize the agentic multimodal large language model (MLLM) as a stateful reasoning system where the model maintains a state trajectory over multiple reasoning steps. A critical property of this system is that subsequent tool selections depend causally on prior observations, creating a strict data dependency. This dependency renders the agentic pipeline inherently sequential, as step d+1d+1d+1 cannot begin until step ddd completes. Consequently, the end-to-end latency for a single query scales linearly with the agentic depth.

To address this bottleneck, the authors propose SpecEyes, a four-phase speculative acceleration framework designed to bypass expensive tool chains whenever a smaller, non-agentic model is sufficiently confident. The pipeline processes a batch of queries through a funnel that splits them into tool-free and tool-required paths.

The execution flow begins with Phase I, Tool-Use Judgment. The large agentic model ML\mathcal{M}_LML determines whether tool invocation is necessary by generating a single binary token. Queries judged as tool-free proceed to Phase II, Speculative Prediction. Here, a small stateless model MS\mathcal{M}_SMS generates an answer and the full output logit distribution without any tool execution. This inference is performed concurrently for all queries in the batch.

In Phase III, Cognitive Gating, the logits from the small model are passed to a gating function that quantifies answer confidence. The authors introduce an answer separability score SsepS_{\text{sep}}Ssep that measures the decision margin between the top prediction and its competitors, rather than relying on raw softmax probabilities. If the score exceeds a threshold τ\tauτ, the answer is accepted immediately. Otherwise, the query falls back to Phase IV, Agentic Fallback, where the full agentic model ML\mathcal{M}_LML executes the complete stateful perception-reasoning loop.

Beyond per-query latency reduction, the framework enables system-level throughput gains by organizing these phases into a heterogeneous parallel funnel. The front-end screening and speculative inference are stateless and fully batch-parallelizable, while the fallback remains sequential. This architecture decouples stateless concurrency from stateful execution, significantly reducing the number of queries that incur the full agentic cost.

The expected per-query latency is dominated by the lightweight front-end cost when the screening ratio and gate acceptance rate are high. The resulting throughput speedup is approximately 1/(1βα)1/(1-\beta\alpha)1/(1βα), where β\betaβ is the tool-free screening ratio and α\alphaα is the cognitive gate acceptance rate. This approach effectively converts per-query latency savings into system-level throughput gains while maintaining accuracy comparable to the agentic baseline.

Experiment

  • Experiments on V*, HR-Bench, and POPE benchmarks validate that SpecEyes achieves significant speedups while improving or maintaining accuracy compared to agentic baselines, with the most substantial gains observed in hallucination reduction and spatial reasoning tasks.
  • Qualitative analysis confirms that the min-token confidence aggregation strategy provides the best accuracy-speed trade-off by effectively distinguishing correct from incorrect answers, whereas other aggregation methods suffer from distribution overlap.
  • Ablation studies demonstrate that the gating threshold serves as a robust control knob for balancing efficiency and performance, while larger serving batch sizes improve throughput by amortizing the stateless speculative stage without affecting model accuracy.
  • Results indicate that SpecEyes generalizes across different agentic backbones and outperforms alternative speculative approaches that incur high token overhead, though high-resolution tasks remain a bottleneck due to the frequent necessity of tool-assisted inspection.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp