Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes

In computer vision, object detection is an important task that finds itsapplication in many scenarios. However, obtaining extensive labels can bechallenging, especially in crowded scenes. Recently, the Segment Anything Model(SAM) has been proposed as a powerful zero-shot segmenter, offering a novelapproach to instance segmentation tasks. However, the accuracy and efficiencyof SAM and its variants are often compromised when handling objects in crowdedand occluded scenes. In this paper, we introduce Crowd-SAM, a SAM-basedframework designed to enhance SAM's performance in crowded and occluded sceneswith the cost of few learnable parameters and minimal labeled images. Weintroduce an efficient prompt sampler (EPS) and a part-whole discriminationnetwork (PWD-Net), enhancing mask selection and accuracy in crowded scenes.Despite its simplicity, Crowd-SAM rivals state-of-the-art (SOTA)fully-supervised object detection methods on several benchmarks includingCrowdHuman and CityPersons. Our code is available athttps://github.com/FelixCaae/CrowdSAM.