2 months ago

An Empirical Study of CLIP for Text-based Person Search

Cao, Min ; Bai, Yang ; Zeng, Ziyin ; Ye, Mang ; Zhang, Min

Abstract

Text-based Person Search (TBPS) aims to retrieve the person images usingnatural language descriptions. Recently, Contrastive Language Image Pretraining(CLIP), a universal large cross-modal vision-language pre-training model, hasremarkably performed over various cross-modal downstream tasks due to itspowerful cross-modal semantic learning capacity. TPBS, as a fine-grainedcross-modal retrieval task, is also facing the rise of research on theCLIP-based TBPS. In order to explore the potential of the visual-languagepre-training model for downstream TBPS tasks, this paper makes the firstattempt to conduct a comprehensive empirical study of CLIP for TBPS and thuscontribute a straightforward, incremental, yet strong TBPS-CLIP baseline to theTBPS community. We revisit critical design considerations under CLIP, includingdata augmentation and loss function. The model, with the aforementioned designsand practical training tricks, can attain satisfactory performance without anysophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP inmodel generalization and model compression, demonstrating the effectiveness ofTBPS-CLIP from various aspects. This work is expected to provide empiricalinsights and highlight future CLIP-based TBPS research.