Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

In the domain of computer vision, Parameter-Efficient Tuning (PET) isincreasingly replacing the traditional paradigm of pre-training followed byfull fine-tuning. PET is particularly favored for its effectiveness in largefoundation models, as it streamlines transfer learning costs and optimizeshardware utilization. However, the current PET methods are mainly designed forsingle-modal optimization. While some pioneering studies have undertakenpreliminary explorations, they still remain at the level of aligned encoders(e.g., CLIP) and lack exploration of misaligned encoders. These methods showsub-optimal performance with misaligned encoders, as they fail to effectivelyalign the multimodal features during fine-tuning. In this paper, we introduceDETRIS, a parameter-efficient tuning framework designed to enhance low-rankvisual feature propagation by establishing dense interconnections between eachlayer and all preceding layers, which enables effective cross-modal featureinteraction and adaptation to misaligned encoders. We also suggest using textadapters to improve textual features. Our simple yet efficient approach greatlysurpasses state-of-the-art methods with 0.9% to 1.8% backbone parameterupdates, evaluated on challenging benchmarks. Our project is available at\url{https://github.com/jiaqihuang01/DETRIS}.