HyperAIHyperAI

Command Palette

Search for a command to run...

Colon-Bench: Ein agenter Workflow für die skalierbare dichte Läsionsannotation in Videos der vollständigen Koloskopie

Abdullah Hamdi Changchun Yang Xin Gao

Zusammenfassung

Die frühzeitige Erkennung mittels Koloskopie ist von entscheidender Bedeutung für die Prävention von Darmkrebs. Die Entwicklung robuster KI-Systeme in diesem Bereich wird jedoch durch das Fehlen von dicht annotierten, langsequenziellen Videodatensätzen behindert. Bestehende Datensätze konzentrieren sich vorrangig auf die Detektion von Polypen einer einzelnen Klasse und verfügen nicht über die reichhaltigen räumlichen, zeitlichen und linguistischen Annotationen, die für die Evaluierung moderner Multimodaler Large Language Models (MLLMs) erforderlich sind.Um diese kritische Lücke zu schließen, stellen wir Colon-Bench vor, das durch einen neuartigen, mehrstufigen agentic Workflow generiert wurde. Unsere Pipeline integriert nahtlos zeitliche Vorschläge (temporal proposals), die Verfolgung von Bounding Boxes, KI-gestützte visuelle Bestätigungen sowie eine menschliche Überprüfung (human-in-the-loop), um Videos kompletter Eingriffe skalierbar zu annotieren. Der resultierende verifizierte Benchmark ist in seinem Umfang beispiellos: Er umfasst 528 Videos, 14 verschiedene Läsionskategorien (einschließlich Polypen, Ulzera und Blutungen), über 300.000 Bounding Boxes, 213.000 Segmentierungsmasken sowie 133.000 Wörter klinischer Beschreibungen.Wir nutzen Colon-Bench, um state-of-the-art MLLMs rigoros in den Bereichen Läsionsklassifizierung, Open-Vocabulary Video Object Segmentation (OV-VOS) und Video Visual Question Answering (VQA) zu evaluieren. Die Ergebnisse der MLLMs zeigen im Vergleich zu SAM-3 eine überraschend hohe Lokalisierungsleistung im medizinischen Bereich. Abschließend analysieren wir häufige VQA-Fehler von MLLMs, um eine neuartige „colon-skill"-Prompting-Strategie einzuführen, die die Zero-Shot-Leistung von MLLMs bei den meisten Modellen um bis zu 9,7 % verbessert. Der Datensatz und der Code sind unter https://abdullahamdi.com/colon-bench verfügbar.

One-sentence Summary

Researchers from King Abdullah University of Science and Technology introduce Colon-Bench, a comprehensive benchmark created via a novel multi-stage agentic workflow that overcomes prior data scarcity by providing dense spatiotemporal annotations for 14 lesion categories. This resource enables rigorous evaluation of Multimodal Large Language Models on complex colonoscopy tasks and demonstrates that a new colon-skill prompting strategy significantly boosts zero-shot performance without additional training.

Key Contributions

  • The paper introduces Colon-Bench, a comprehensive benchmark for evaluating Multimodal Large Language Models on full-procedure colonoscopy videos, which demonstrates that these models outperform specialized baselines like Endo-CLIP by 30% in lesion detection tasks.
  • A two-stage agentic workflow is presented that extracts cross-model error patterns to synthesize structured Colon-Skill prompts, resulting in training-free performance improvements of up to 9.7% on medical VQA tasks.
  • Extensive experiments establish that utilizing temporal context from multiple video frames significantly enhances segmentation quality and VQA accuracy compared to single-frame inputs, with results showing a mean IoU increase from 43.1% to 54.4% when expanding context from one to seven frames.

Introduction

No source text was provided to summarize. Please supply the abstract or body snippet of the research paper so I can generate the background summary with the required technical context, limitations, and contributions.

Dataset

Colon-Bench Dataset Overview

The authors introduce Colon-Bench, a comprehensive multi-task benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on full-procedure colonoscopy videos. The dataset addresses the scarcity of densely annotated, long-sequence medical video data by leveraging a novel agentic workflow.

  • Dataset Composition and Sources

    • The core data originates from 60 video sequences in the REAL-COLON dataset.
    • The final curated benchmark spans 528 verified video windows across 59 sequences, totaling 464,035 frames (approximately 12.89 hours).
    • It covers 14 distinct lesion categories, including sessile polyps, bleeding, ulcers, and erythematous lesions, with a long-tailed distribution where sessile polyps are the most frequent.
    • Annotations include over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of verified clinical text descriptions.
  • Key Details for Each Subset

    • Binary Classification: Comprises 790 clips (518 lesion-free and 272 lesion-positive) to test lesion presence detection.
    • Detection and Segmentation: Utilizes 272 and 264 lesion-positive clips respectively, providing 61,538 per-frame bounding boxes and 57,550 per-frame masks.
    • Visual Question Answering (VQA): Divided into two tiers:
      • Prompted VQA: 1,485 five-choice questions over 499 clips featuring bounding-box overlays on confirmed lesions.
      • Unprompted VQA: 2,740 questions over 918 clips using raw frames, including non-lesion windows to test open-ended reasoning.
  • Data Usage and Processing Strategy

    • Agentic Workflow: The authors employ a multi-stage pipeline starting with a vision-language model (Gemini-2.5-flash-lite) to identify 1,325 candidate lesion windows.
    • Filtering and Verification: Successive agents perform verification filtering, bounding-box tracking using EdgeTAM, and AI-driven visual confirmation (using Gemini-3 variants) to prune false positives.
    • Human-in-the-Loop: A final review by a surgeon rejected only 69 windows (11.6% of those presented), ensuring high-quality spatial and textual labels.
    • Debiasing: To prevent text-only shortcuts in VQA, the authors apply a two-stage debiasing process involving adversarial distractor regeneration and blind text-only stress tests.
  • Metadata and Annotation Construction

    • Spatial Annotations: The pipeline generates dense tracking data, establishing the first Open-Vocabulary Video Object Segmentation (OV-VOS) benchmark for colonoscopy.
    • Textual Descriptions: Free-form clinical descriptions are generated and verified, averaging 252.4 words per window, which are used to derive multi-label lesion categories via keyword matching.
    • Evaluation Setup: The benchmark evaluates MLLMs on lesion classification, OV-VOS, and VQA, utilizing 3-frame box detections to prompt the EdgeTAM tracker for segmentation tasks.

Experiment

  • Colon-Bench experiments demonstrate that top-tier MLLMs like Gemini 3 Pro and Flash outperform specialized models in lesion detection and segmentation, while open-weight models such as Seed 1.6 show strong overall performance despite some families struggling with classification tasks.
  • Ablation studies confirm that utilizing temporal context from video clips significantly improves VQA accuracy and segmentation quality compared to single-frame inputs, with increasing the number of detection frames yielding steady gains in downstream segmentation metrics.
  • The proposed Colon-Skill framework validates that injecting distilled domain knowledge into prompts enhances VQA performance for high-capacity models, whereas smaller models show limited benefit from this additional context.
  • Validation of the annotation pipeline reveals that verification filtering and tracking stages provide the most substantial precision improvements, while human review offers marginal but consistent refinements to the final dataset quality.

KI mit KI entwickeln

Von der Idee bis zum Launch – beschleunigen Sie Ihre KI-Entwicklung mit kostenlosem KI-Co-Coding, sofort einsatzbereiter Umgebung und bestem GPU-Preis.

KI-gestütztes kollaboratives Programmieren
Sofort einsatzbereite GPUs
Die besten Preise

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp