2달 전

Abdullah Hamdi Changchun Yang Xin Gao

초록

대장암 예방을 위해 대장내시경 검사를 통한 조기 선별은 매우 중요하지만, 이 분야에 강건한 AI 시스템을 개발하는 것은 밀집되어 주석이 달린 장기간 비디오 데이터셋의 부재로 인해 방해받고 있습니다. 기존 데이터셋은 대부분 단일 클래스 폴립 검출에 초점을 맞추고 있으며, 현대적인 Multimodal Large Language Models(MLLMs) 의 평가를 위해 필요한 풍부한 공간적, 시간적, 언어적 주석을 제공하지 못합니다. 이러한 중요한 공백을 해소하기 위해, 우리는 새로운 다단계 에이전트 워크플로우를 통해 생성된 Colon-Bench 를 소개합니다. 본 파이프라인은 시간적 제안(temporal proposals), 바운딩 박스 추적, AI 기반 시각적 확인, 그리고 인간-루프(human-in-the-loop) 검수를 통합하여 전 과정 비디오를 확장 가능하게 주석 달기 작업을 수행합니다. 그 결과로 도출된 검증된 벤치마크는 그 범위가 전례가 없으며, 528 개의 비디오, 14 가지의 다양한 병변 카테고리(폴립, 궤양, 출혈 포함), 300,000 개 이상의 바운딩 박스, 213,000 개의 분할 마스크(segmentation masks), 그리고 133,000 단어로 구성된 임상적 설명을 포괄합니다. 우리는 Colon-Bench 를 활용하여 병변 분류, Open-Vocabulary Video Object Segmentation(OV-VOS), 그리고 비디오 Visual Question Answering(VQA) 분야에서 최신 MLLMs 을 엄격하게 평가했습니다. MLLM 의 결과는 SAM-3 에 비해 의료 분야에서 놀라울 정도로 높은 국소화(localization) 성능을 보여주었습니다. 마지막으로, 우리는 MLLM 에서 발생하는 일반적인 VQA 오류를 분석하여 새로운 "colon-skill" 프롬프팅 전략을 제시했으며, 이를 통해 대부분의 MLLM 에서 제로샷(zero-shot) 성능을 최대 9.7% 까지 향상시켰습니다. 본 데이터셋과 코드는 https://abdullahamdi.com/colon-bench 에서 이용 가능합니다.

One-sentence Summary

Researchers from King Abdullah University of Science and Technology introduce Colon-Bench, a comprehensive benchmark created via a novel multi-stage agentic workflow that overcomes prior data scarcity by providing dense spatiotemporal annotations for 14 lesion categories. This resource enables rigorous evaluation of Multimodal Large Language Models on complex colonoscopy tasks and demonstrates that a new colon-skill prompting strategy significantly boosts zero-shot performance without additional training.

Key Contributions

The paper introduces Colon-Bench, a comprehensive benchmark for evaluating Multimodal Large Language Models on full-procedure colonoscopy videos, which demonstrates that these models outperform specialized baselines like Endo-CLIP by 30% in lesion detection tasks.
A two-stage agentic workflow is presented that extracts cross-model error patterns to synthesize structured Colon-Skill prompts, resulting in training-free performance improvements of up to 9.7% on medical VQA tasks.
Extensive experiments establish that utilizing temporal context from multiple video frames significantly enhances segmentation quality and VQA accuracy compared to single-frame inputs, with results showing a mean IoU increase from 43.1% to 54.4% when expanding context from one to seven frames.

Introduction

No source text was provided to summarize. Please supply the abstract or body snippet of the research paper so I can generate the background summary with the required technical context, limitations, and contributions.

Dataset

Colon-Bench Dataset Overview

The authors introduce Colon-Bench, a comprehensive multi-task benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on full-procedure colonoscopy videos. The dataset addresses the scarcity of densely annotated, long-sequence medical video data by leveraging a novel agentic workflow.

Dataset Composition and Sources
- The core data originates from 60 video sequences in the REAL-COLON dataset.
- The final curated benchmark spans 528 verified video windows across 59 sequences, totaling 464,035 frames (approximately 12.89 hours).
- It covers 14 distinct lesion categories, including sessile polyps, bleeding, ulcers, and erythematous lesions, with a long-tailed distribution where sessile polyps are the most frequent.
- Annotations include over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of verified clinical text descriptions.
Key Details for Each Subset
- Binary Classification: Comprises 790 clips (518 lesion-free and 272 lesion-positive) to test lesion presence detection.
- Detection and Segmentation: Utilizes 272 and 264 lesion-positive clips respectively, providing 61,538 per-frame bounding boxes and 57,550 per-frame masks.
- Visual Question Answering (VQA): Divided into two tiers:
  - Prompted VQA: 1,485 five-choice questions over 499 clips featuring bounding-box overlays on confirmed lesions.
  - Unprompted VQA: 2,740 questions over 918 clips using raw frames, including non-lesion windows to test open-ended reasoning.
Data Usage and Processing Strategy
- Agentic Workflow: The authors employ a multi-stage pipeline starting with a vision-language model (Gemini-2.5-flash-lite) to identify 1,325 candidate lesion windows.
- Filtering and Verification: Successive agents perform verification filtering, bounding-box tracking using EdgeTAM, and AI-driven visual confirmation (using Gemini-3 variants) to prune false positives.
- Human-in-the-Loop: A final review by a surgeon rejected only 69 windows (11.6% of those presented), ensuring high-quality spatial and textual labels.
- Debiasing: To prevent text-only shortcuts in VQA, the authors apply a two-stage debiasing process involving adversarial distractor regeneration and blind text-only stress tests.
Metadata and Annotation Construction
- Spatial Annotations: The pipeline generates dense tracking data, establishing the first Open-Vocabulary Video Object Segmentation (OV-VOS) benchmark for colonoscopy.
- Textual Descriptions: Free-form clinical descriptions are generated and verified, averaging 252.4 words per window, which are used to derive multi-label lesion categories via keyword matching.
- Evaluation Setup: The benchmark evaluates MLLMs on lesion classification, OV-VOS, and VQA, utilizing 3-frame box detections to prompt the EdgeTAM tracker for segmentation tasks.

Experiment

Colon-Bench experiments demonstrate that top-tier MLLMs like Gemini 3 Pro and Flash outperform specialized models in lesion detection and segmentation, while open-weight models such as Seed 1.6 show strong overall performance despite some families struggling with classification tasks.
Ablation studies confirm that utilizing temporal context from video clips significantly improves VQA accuracy and segmentation quality compared to single-frame inputs, with increasing the number of detection frames yielding steady gains in downstream segmentation metrics.
The proposed Colon-Skill framework validates that injecting distilled domain knowledge into prompts enhances VQA performance for high-capacity models, whereas smaller models show limited benefit from this additional context.
Validation of the annotation pipeline reveals that verification filtering and tracking stages provide the most substantial precision improvements, while human review offers marginal but consistent refinements to the final dataset quality.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

2달 전

Abdullah Hamdi Changchun Yang Xin Gao

초록

One-sentence Summary

Key Contributions

The paper introduces Colon-Bench, a comprehensive benchmark for evaluating Multimodal Large Language Models on full-procedure colonoscopy videos, which demonstrates that these models outperform specialized baselines like Endo-CLIP by 30% in lesion detection tasks.
A two-stage agentic workflow is presented that extracts cross-model error patterns to synthesize structured Colon-Skill prompts, resulting in training-free performance improvements of up to 9.7% on medical VQA tasks.
Extensive experiments establish that utilizing temporal context from multiple video frames significantly enhances segmentation quality and VQA accuracy compared to single-frame inputs, with results showing a mean IoU increase from 43.1% to 54.4% when expanding context from one to seven frames.

Introduction

Dataset

Colon-Bench Dataset Overview

Dataset Composition and Sources
- The core data originates from 60 video sequences in the REAL-COLON dataset.
- The final curated benchmark spans 528 verified video windows across 59 sequences, totaling 464,035 frames (approximately 12.89 hours).
- It covers 14 distinct lesion categories, including sessile polyps, bleeding, ulcers, and erythematous lesions, with a long-tailed distribution where sessile polyps are the most frequent.
- Annotations include over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of verified clinical text descriptions.
Key Details for Each Subset
- Binary Classification: Comprises 790 clips (518 lesion-free and 272 lesion-positive) to test lesion presence detection.
- Detection and Segmentation: Utilizes 272 and 264 lesion-positive clips respectively, providing 61,538 per-frame bounding boxes and 57,550 per-frame masks.
- Visual Question Answering (VQA): Divided into two tiers:
  - Prompted VQA: 1,485 five-choice questions over 499 clips featuring bounding-box overlays on confirmed lesions.
  - Unprompted VQA: 2,740 questions over 918 clips using raw frames, including non-lesion windows to test open-ended reasoning.
Data Usage and Processing Strategy
- Agentic Workflow: The authors employ a multi-stage pipeline starting with a vision-language model (Gemini-2.5-flash-lite) to identify 1,325 candidate lesion windows.
- Filtering and Verification: Successive agents perform verification filtering, bounding-box tracking using EdgeTAM, and AI-driven visual confirmation (using Gemini-3 variants) to prune false positives.
- Human-in-the-Loop: A final review by a surgeon rejected only 69 windows (11.6% of those presented), ensuring high-quality spatial and textual labels.
- Debiasing: To prevent text-only shortcuts in VQA, the authors apply a two-stage debiasing process involving adversarial distractor regeneration and blind text-only stress tests.
Metadata and Annotation Construction
- Spatial Annotations: The pipeline generates dense tracking data, establishing the first Open-Vocabulary Video Object Segmentation (OV-VOS) benchmark for colonoscopy.
- Textual Descriptions: Free-form clinical descriptions are generated and verified, averaging 252.4 words per window, which are used to derive multi-label lesion categories via keyword matching.
- Evaluation Setup: The benchmark evaluates MLLMs on lesion classification, OV-VOS, and VQA, utilizing 3-frame box detections to prompt the EdgeTAM tracker for segmentation tasks.

Experiment

Colon-Bench experiments demonstrate that top-tier MLLMs like Gemini 3 Pro and Flash outperform specialized models in lesion detection and segmentation, while open-weight models such as Seed 1.6 show strong overall performance despite some families struggling with classification tasks.
Ablation studies confirm that utilizing temporal context from video clips significantly improves VQA accuracy and segmentation quality compared to single-frame inputs, with increasing the number of detection frames yielding steady gains in downstream segmentation metrics.
The proposed Colon-Skill framework validates that injecting distilled domain knowledge into prompts enhances VQA performance for high-capacity models, whereas smaller models show limited benefit from this additional context.
Validation of the annotation pipeline reveals that verification filtering and tracking stages provide the most substantial precision improvements, while human review offers marginal but consistent refinements to the final dataset quality.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

Colon-Bench: 전 과정 대장내시경 영상에서 확장 가능한 밀집 병변 주석을 위한 Agentic 워크플로우

Abdullah Hamdi Changchun Yang Xin Gao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Colon-Bench Dataset Overview

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

Colon-Bench: 전 과정 대장내시경 영상에서 확장 가능한 밀집 병변 주석을 위한 Agentic 워크플로우

Abdullah Hamdi Changchun Yang Xin Gao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Colon-Bench Dataset Overview

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

Colon-Bench: 전 과정 대장내시경 영상에서 확장 가능한 밀집 병변 주석을 위한 Agentic 워크플로우

Abdullah Hamdi Changchun Yang Xin Gao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Colon-Bench Dataset Overview

Experiment

AI로 AI 구축

HyperAI Newsletters