Command Palette
Search for a command to run...
Colon-Bench: 全手技大腸内視鏡検査動画におけるスケーラブルな高密度病変注釈のためのアジェンティックワークフロー
Colon-Bench: 全手技大腸内視鏡検査動画におけるスケーラブルな高密度病変注釈のためのアジェンティックワークフロー
Abdullah Hamdi Changchun Yang Xin Gao
概要
大腸がんの予防において、内視鏡検査による早期スクリーニングは極めて重要である。しかし、この分野における堅牢な AI システムの開発は、高密度に注釈付けされた長シーケンス動画データセットの不足によって阻害されている。既存のデータセットは主に単一クラスのポリープ検出に焦点を当てており、現代のマルチモーダル大規模言語モデル(MLLM)を評価するために必要な、豊富な空間的・時間的・言語的注釈が欠如している。この重要なギャップに対処するため、我々は新規のマルチステージ・エージェンティックワークフローによって生成された Colon-Bench を導入する。本パイプラインは、時間的提案、バウンディングボックス追跡、AI 駆動型の視覚的確認、およびヒューマン・イン・ザ・ループレビューをシームレスに統合し、全手順動画のスケーラブルな注釈付けを実現する。その結果得られた検証済みベンチマークは前例のない規模を有し、528 件の動画、ポリープ、潰瘍、出血を含む 14 種類の病変カテゴリー、30 万超のバウンディングボックス、21 万 3 千のセグメンテーションマスク、および 13 万 3 千語の臨床記述を含んでいる。我々は Colon-Bench を用いて、病変分類、オープンボキャブラリ動画物体セグメンテーション(OV-VOS)、および動画ビジュアル質問応答(VQA)において、最先端の MLLM を厳密に評価した。その結果、MLLM は SAM-3 と比較して医療分野において驚くほど高い局所化性能を示した。最後に、MLLM における一般的な VQA エラーを分析し、新たな「colon-skill」プロンプト戦略を導入した。これにより、大多数の MLLM においてゼロショット性能が最大 9.7% 向上した。データセットおよびコードは https://abdullahamdi.com/colon-bench で公開されている。
One-sentence Summary
Researchers from King Abdullah University of Science and Technology introduce Colon-Bench, a comprehensive benchmark created via a novel multi-stage agentic workflow that overcomes prior data scarcity by providing dense spatiotemporal annotations for 14 lesion categories. This resource enables rigorous evaluation of Multimodal Large Language Models on complex colonoscopy tasks and demonstrates that a new colon-skill prompting strategy significantly boosts zero-shot performance without additional training.
Key Contributions
- The paper introduces Colon-Bench, a comprehensive benchmark for evaluating Multimodal Large Language Models on full-procedure colonoscopy videos, which demonstrates that these models outperform specialized baselines like Endo-CLIP by 30% in lesion detection tasks.
- A two-stage agentic workflow is presented that extracts cross-model error patterns to synthesize structured Colon-Skill prompts, resulting in training-free performance improvements of up to 9.7% on medical VQA tasks.
- Extensive experiments establish that utilizing temporal context from multiple video frames significantly enhances segmentation quality and VQA accuracy compared to single-frame inputs, with results showing a mean IoU increase from 43.1% to 54.4% when expanding context from one to seven frames.
Introduction
No source text was provided to summarize. Please supply the abstract or body snippet of the research paper so I can generate the background summary with the required technical context, limitations, and contributions.
Dataset
Colon-Bench Dataset Overview
The authors introduce Colon-Bench, a comprehensive multi-task benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on full-procedure colonoscopy videos. The dataset addresses the scarcity of densely annotated, long-sequence medical video data by leveraging a novel agentic workflow.
-
Dataset Composition and Sources
- The core data originates from 60 video sequences in the REAL-COLON dataset.
- The final curated benchmark spans 528 verified video windows across 59 sequences, totaling 464,035 frames (approximately 12.89 hours).
- It covers 14 distinct lesion categories, including sessile polyps, bleeding, ulcers, and erythematous lesions, with a long-tailed distribution where sessile polyps are the most frequent.
- Annotations include over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of verified clinical text descriptions.
-
Key Details for Each Subset
- Binary Classification: Comprises 790 clips (518 lesion-free and 272 lesion-positive) to test lesion presence detection.
- Detection and Segmentation: Utilizes 272 and 264 lesion-positive clips respectively, providing 61,538 per-frame bounding boxes and 57,550 per-frame masks.
- Visual Question Answering (VQA): Divided into two tiers:
- Prompted VQA: 1,485 five-choice questions over 499 clips featuring bounding-box overlays on confirmed lesions.
- Unprompted VQA: 2,740 questions over 918 clips using raw frames, including non-lesion windows to test open-ended reasoning.
-
Data Usage and Processing Strategy
- Agentic Workflow: The authors employ a multi-stage pipeline starting with a vision-language model (Gemini-2.5-flash-lite) to identify 1,325 candidate lesion windows.
- Filtering and Verification: Successive agents perform verification filtering, bounding-box tracking using EdgeTAM, and AI-driven visual confirmation (using Gemini-3 variants) to prune false positives.
- Human-in-the-Loop: A final review by a surgeon rejected only 69 windows (11.6% of those presented), ensuring high-quality spatial and textual labels.
- Debiasing: To prevent text-only shortcuts in VQA, the authors apply a two-stage debiasing process involving adversarial distractor regeneration and blind text-only stress tests.
-
Metadata and Annotation Construction
- Spatial Annotations: The pipeline generates dense tracking data, establishing the first Open-Vocabulary Video Object Segmentation (OV-VOS) benchmark for colonoscopy.
- Textual Descriptions: Free-form clinical descriptions are generated and verified, averaging 252.4 words per window, which are used to derive multi-label lesion categories via keyword matching.
- Evaluation Setup: The benchmark evaluates MLLMs on lesion classification, OV-VOS, and VQA, utilizing 3-frame box detections to prompt the EdgeTAM tracker for segmentation tasks.
Experiment
- Colon-Bench experiments demonstrate that top-tier MLLMs like Gemini 3 Pro and Flash outperform specialized models in lesion detection and segmentation, while open-weight models such as Seed 1.6 show strong overall performance despite some families struggling with classification tasks.
- Ablation studies confirm that utilizing temporal context from video clips significantly improves VQA accuracy and segmentation quality compared to single-frame inputs, with increasing the number of detection frames yielding steady gains in downstream segmentation metrics.
- The proposed Colon-Skill framework validates that injecting distilled domain knowledge into prompts enhances VQA performance for high-capacity models, whereas smaller models show limited benefit from this additional context.
- Validation of the annotation pipeline reveals that verification filtering and tracking stages provide the most substantial precision improvements, while human review offers marginal but consistent refinements to the final dataset quality.