Command Palette
Search for a command to run...
ScribblePrompt: 모든 생의학 이미지를 위한 빠르고 유연한 상호작용형 Segmentation
ScribblePrompt: 모든 생의학 이미지를 위한 빠르고 유연한 상호작용형 Segmentation
Halle E. Wong Marianne Rakic John Guttag Adrian V. Dalca
초록
의생명 이미지 분할(Biomedical image segmentation)은 과학 연구와 임상 진료 모두에서 매우 중요한 역할을 합니다. 충분한 라벨링된 데이터가 있다면, 딥러닝 모델을 학습시켜 특정 의생명 이미지 분할 작업을 정확하게 자동화할 수 있습니다. 그러나 학습 데이터를 생성하기 위해 수동으로 이미지를 분할하는 작업은 노동 집약적이며 고도의 전문 지식을 필요로 합니다.본 논문에서는 Scribble, Click, Bounding box를 사용하여 작업자가 이전에 본 적 없는 구조물도 분할할 수 있도록 지원하는 유연한 신경망 기반의 의생명 영상용 상호작용형 분할 도구인 ScribblePrompt를 제안합니다. 엄격한 정량적 실험을 통해, ScribblePrompt는 유사한 수준의 상호작용이 주어졌을 때 학습 과정에서 보지 못한 데이터셋에 대해 기존 방식보다 더 정확한 분할 결과를 생성함을 입증했습니다. 또한 도메인 전문가를 대상으로 한 사용자 연구 결과, ScribblePrompt는 기존의 차순위 방식(next best method)과 비교했을 때 Dice 점수를 15% 향상시키는 동시에 주석(annotation) 시간을 28% 단축했습니다.ScribblePrompt의 성공은 일련의 세심한 설계 결정에 기반합니다. 여기에는 매우 다양하고 폭넓은 이미지와 작업을 포함하는 학습 전략, 시뮬레이션된 사용자 상호작용 및 라벨을 위한 혁신적인 알고리즘, 그리고 빠른 추론(inference)을 가능하게 하는 네트워크 구조가 포함됩니다. 저희는 인터랙티브 데모를 통해 ScribblePrompt를 선보이며, 코드 및 Scribble 주석 데이터셋을 https://scribbleprompt.csail.mit.edu 에서 공개합니다.
One-sentence Summary
Researchers from MIT propose ScribblePrompt, a flexible neural network-based interactive segmentation tool that utilizes scribbles, clicks, and bounding boxes to segment unseen biomedical structures, outperforming previous methods by reducing expert annotation time by 28% and improving Dice scores by 15% through a diverse training strategy and novel simulated interaction algorithms.
Key Contributions
- The paper introduces ScribblePrompt, a flexible neural network-based interactive segmentation framework that supports multiple user inputs including scribbles, clicks, and bounding boxes. This method allows for the segmentation of previously unseen biomedical structures at inference time without the need for task-specific retraining.
- The work presents novel algorithms for simulating realistic user interactions and generating synthetic labels, which facilitates training on a highly diverse set of images and tasks. This simulation engine enables the model to generalize effectively to new datasets and specialized medical imaging modalities.
- Experimental results and user studies demonstrate that the system outperforms previous methods by achieving a 15% improvement in Dice score and a 28% reduction in annotation time compared to the next best baseline. The ScribblePrompt-UNet architecture also provides computational efficiency capable of running on a CPU.
Introduction
Accurate biomedical image segmentation is essential for clinical care and scientific research, yet manual annotation remains a labor intensive process requiring significant domain expertise. Existing deep learning methods often struggle with generalization, as they are typically trained for specific tasks or modalities and fail when encountering unseen structures. While vision foundation models like SAM show promise, they often perform poorly on subtle biomedical delineations and require limited interaction types. The authors leverage a new framework called ScribblePrompt to enable flexible, interactive segmentation across diverse biomedical images using scribbles, clicks, and bounding boxes. By introducing a novel scribble simulation engine and a diverse training strategy, the authors provide a tool that generalizes to unseen tasks without retraining, significantly reducing annotation time while improving segmentation accuracy.
Dataset
The authors developed a comprehensive biomedical imaging framework using the following data strategies:
- Dataset Composition and Sources: The training collection is built upon large scale efforts like MegaMedical, comprising 77 open access biomedical imaging datasets. This collection includes over 54,000 scans across 16 image types and 711 labels, spanning diverse domains such as the brain, thorax, abdomen, spine, cells, skin, eyes, and more.
- Task Definition and Processing:
- 2D segmentation tasks are defined by a combination of the dataset, the specific axis (for 3D modalities), and the label.
- For datasets with multiple labels, each is treated as a separate binary segmentation task.
- For 3D volumes, the authors extract the middle slice and the slice containing the maximum label area.
- To prevent overfitting, the authors implement a synthetic label mechanism where a superpixel algorithm partitions an image into a multi-label mask, from which a single label is randomly selected to replace the ground truth with a specific probability.
- Training Strategy: The authors use hierarchical sampling during training to balance datasets of different sizes. This sampling is performed by dataset and modality, then by axis, and finally by label. Both the input images and the sampled segmentations undergo data augmentation before simulating user interactions.
- MedScribble Dataset: For manual evaluation, the authors curated the MedScribble dataset, which contains manual scribble annotations from three annotators for 64 image segmentation pairs. These pairs were randomly selected from the validation splits of 14 different datasets. A specific subset of 31 image segmentation pairs from 7 unseen datasets is used to report results for manual scribble evaluation.
Method
The ScribblePrompt framework is designed as an interactive segmentation method that generalizes across diverse biomedical imaging modalities. The core objective is to learn a function fθ(xt,ui,y^i−1t) that produces an iterative segmentation y^i given an input image xt, a set of user interactions ui, and the previous prediction y^i−1t. The model is optimized by minimizing the difference between the true segmentation yt and the k iterative predictions through a supervised segmentation loss:
L(θ;T)=Et∈T[E(xt,yt)∈t[∑i=1kLSeg(yt,ftheta(xt,ui,y^i−1t))]]
The training process involves simulating a sequence of interactive steps. As shown in the framework diagram:
Initially, a set of interactions u1, such as bounding boxes, clicks, or scribbles, is simulated based on the ground truth yt to produce the first prediction y^1t. In subsequent steps, the framework simulates user corrections by identifying the error region εit=yt−y^it and generating new interactions ui+1 based on this error. This iterative process repeats for k steps to refine the segmentation.
To enhance generalization and prevent the model from overfitting to specific tasks, the authors incorporate a synthetic label generation mechanism. During training, a sample (x0,y0) may be replaced by a synthetic label ysynth with a probability psynth. This is achieved by applying a superpixel algorithm to the image x0 to create a map of k superpixels, from which a single superpixel is randomly selected to serve as the synthetic target. The overall training flow, including this optional augmentation, is illustrated in the figure below:
The interaction simulation utilizes several strategies for different prompt types. For scribbles, the authors implement line, centerline, and contour strategies, which are then corrupted through random masking and deformation to mimic human variability. For clicks, they employ random, center, or interior border region sampling. Bounding boxes are simulated by computing the minimum enclosing box of the label and enlarging it slightly. These prompts are encoded into input channels, allowing the network to process them efficiently. For the ScribblePrompt-UNet architecture, the input consists of five channels: the image, bounding box encoding, positive click/scribble encoding, negative click/scribble encoding, and the previous prediction logits.
Experiment
The researchers evaluated ScribblePrompt through manual scribble tests, simulated iterative interactions, a user study with experienced neuroimaging researchers, and computational runtime analysis. The experiments validate the model's ability to generalize to unseen medical modalities and anatomical regions using flexible prompts such as bounding boxes, clicks, and scribbles. Findings demonstrate that ScribblePrompt provides superior segmentation accuracy and greater efficiency than existing generalist models, offering a more responsive and user-friendly experience that significantly reduces annotation time.
The authors compare ScribblePrompt-UNet against the SAM (ViT-b) model through a user study involving experienced annotators. Results indicate that ScribblePrompt-UNet achieves higher segmentation accuracy and lower error rates while requiring less time and fewer interactions per task. ScribblePrompt-UNet achieves a higher mean Dice score compared to SAM (ViT-b). The ScribblePrompt-UNet model results in lower HD95 values than the SAM baseline. Users completed tasks more efficiently with ScribblePrompt-UNet, requiring less time and fewer interaction steps per task.
The authors evaluate the ScribblePrompt models against several baseline methods using manual scribble inputs across different datasets. Results show that both ScribblePrompt-UNet and ScribblePrompt-SAM achieve higher Dice scores and lower Hausdorff Distance compared to existing interactive segmentation methods. ScribblePrompt models outperform SAM variants and MedSAM when using manual scribble prompts The ScribblePrompt models demonstrate superior accuracy and boundary adherence compared to MIDeepSeg ScribblePrompt variants achieve the highest Dice scores among all compared methods in the manual scribble evaluation
The authors evaluate several interactive segmentation models using manual scribbles on the MedScribble and ACDC datasets. Results show that the ScribblePrompt models achieve the highest Dice scores and the lowest Hausdorff Distance compared to all other tested methods. ScribblePrompt-SAM and ScribblePrompt-UNet outperform existing methods like SAM, SAM-Med2D, and MedSAM in segmentation accuracy. ScribblePrompt-UNet achieves the best overall performance on the ACDC dataset in terms of both Dice score and Hausdorff Distance. The ScribblePrompt models demonstrate superior ability to handle manual scribble inputs compared to SAM-based baselines.
The authors compare the computational efficiency of ScribblePrompt-UNet against several SAM-based models and MIDeepSeg. Results show that ScribblePrompt-UNet maintains low latency on both CPU and GPU hardware. ScribblePrompt-UNet achieves significantly faster GPU inference times compared to all other evaluated models. On a single CPU, ScribblePrompt-UNet demonstrates competitive performance with low latency, outperforming larger SAM variants. The models show a clear trade-off between parameter count and inference speed, with ScribblePrompt-UNet providing high efficiency with a small parameter footprint.
The authors compare ScribblePrompt-UNet to ScribFormer, a scribble-supervised learning method, using the ACDC dataset. Results show that ScribblePrompt-UNet achieves a comparable Dice score while maintaining a significantly lower HD95. ScribblePrompt-UNet achieves a Dice score similar to ScribFormer ScribblePrompt-UNet demonstrates much better boundary accuracy as indicated by a lower HD95 The performance of ScribblePrompt-UNet is competitive with specialized scribble-supervised models
The authors evaluated ScribblePrompt models through user studies, comparative segmentation benchmarks on MedScribble and ACDC datasets, and computational efficiency tests. The results demonstrate that ScribblePrompt models provide superior segmentation accuracy and better boundary adherence than existing SAM variants and specialized scribble-supervised methods. Furthermore, the models enhance user productivity by requiring fewer interactions and offer high computational efficiency with low latency across different hardware.