HyperAIHyperAI

Command Palette

Search for a command to run...

Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

Abdullah Hamdi Changchun Yang Xin Gao

Abstract

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .

One-sentence Summary

Researchers from King Abdullah University of Science and Technology introduce Colon-Bench, a comprehensive benchmark created via a novel multi-stage agentic workflow that overcomes prior data scarcity by providing dense spatiotemporal annotations for 14 lesion categories. This resource enables rigorous evaluation of Multimodal Large Language Models on complex colonoscopy tasks and demonstrates that a new colon-skill prompting strategy significantly boosts zero-shot performance without additional training.

Key Contributions

  • The paper introduces Colon-Bench, a comprehensive benchmark for evaluating Multimodal Large Language Models on full-procedure colonoscopy videos, which demonstrates that these models outperform specialized baselines like Endo-CLIP by 30% in lesion detection tasks.
  • A two-stage agentic workflow is presented that extracts cross-model error patterns to synthesize structured Colon-Skill prompts, resulting in training-free performance improvements of up to 9.7% on medical VQA tasks.
  • Extensive experiments establish that utilizing temporal context from multiple video frames significantly enhances segmentation quality and VQA accuracy compared to single-frame inputs, with results showing a mean IoU increase from 43.1% to 54.4% when expanding context from one to seven frames.

Introduction

No source text was provided to summarize. Please supply the abstract or body snippet of the research paper so I can generate the background summary with the required technical context, limitations, and contributions.

Dataset

Colon-Bench Dataset Overview

The authors introduce Colon-Bench, a comprehensive multi-task benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on full-procedure colonoscopy videos. The dataset addresses the scarcity of densely annotated, long-sequence medical video data by leveraging a novel agentic workflow.

  • Dataset Composition and Sources

    • The core data originates from 60 video sequences in the REAL-COLON dataset.
    • The final curated benchmark spans 528 verified video windows across 59 sequences, totaling 464,035 frames (approximately 12.89 hours).
    • It covers 14 distinct lesion categories, including sessile polyps, bleeding, ulcers, and erythematous lesions, with a long-tailed distribution where sessile polyps are the most frequent.
    • Annotations include over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of verified clinical text descriptions.
  • Key Details for Each Subset

    • Binary Classification: Comprises 790 clips (518 lesion-free and 272 lesion-positive) to test lesion presence detection.
    • Detection and Segmentation: Utilizes 272 and 264 lesion-positive clips respectively, providing 61,538 per-frame bounding boxes and 57,550 per-frame masks.
    • Visual Question Answering (VQA): Divided into two tiers:
      • Prompted VQA: 1,485 five-choice questions over 499 clips featuring bounding-box overlays on confirmed lesions.
      • Unprompted VQA: 2,740 questions over 918 clips using raw frames, including non-lesion windows to test open-ended reasoning.
  • Data Usage and Processing Strategy

    • Agentic Workflow: The authors employ a multi-stage pipeline starting with a vision-language model (Gemini-2.5-flash-lite) to identify 1,325 candidate lesion windows.
    • Filtering and Verification: Successive agents perform verification filtering, bounding-box tracking using EdgeTAM, and AI-driven visual confirmation (using Gemini-3 variants) to prune false positives.
    • Human-in-the-Loop: A final review by a surgeon rejected only 69 windows (11.6% of those presented), ensuring high-quality spatial and textual labels.
    • Debiasing: To prevent text-only shortcuts in VQA, the authors apply a two-stage debiasing process involving adversarial distractor regeneration and blind text-only stress tests.
  • Metadata and Annotation Construction

    • Spatial Annotations: The pipeline generates dense tracking data, establishing the first Open-Vocabulary Video Object Segmentation (OV-VOS) benchmark for colonoscopy.
    • Textual Descriptions: Free-form clinical descriptions are generated and verified, averaging 252.4 words per window, which are used to derive multi-label lesion categories via keyword matching.
    • Evaluation Setup: The benchmark evaluates MLLMs on lesion classification, OV-VOS, and VQA, utilizing 3-frame box detections to prompt the EdgeTAM tracker for segmentation tasks.

Experiment

  • Colon-Bench experiments demonstrate that top-tier MLLMs like Gemini 3 Pro and Flash outperform specialized models in lesion detection and segmentation, while open-weight models such as Seed 1.6 show strong overall performance despite some families struggling with classification tasks.
  • Ablation studies confirm that utilizing temporal context from video clips significantly improves VQA accuracy and segmentation quality compared to single-frame inputs, with increasing the number of detection frames yielding steady gains in downstream segmentation metrics.
  • The proposed Colon-Skill framework validates that injecting distilled domain knowledge into prompts enhances VQA performance for high-capacity models, whereas smaller models show limited benefit from this additional context.
  • Validation of the annotation pipeline reveals that verification filtering and tracking stages provide the most substantial precision improvements, while human review offers marginal but consistent refinements to the final dataset quality.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp