11 days ago

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng

Abstract

High-resolution Large Multimodal Models (LMMs) encounter the challenges ofexcessive visual tokens and quadratic visual complexity. Currenthigh-resolution LMMs address the quadratic complexity while still generatingexcessive visual tokens. However, the redundancy in visual tokens is the keyproblem as it leads to more substantial compute. To mitigate this issue, wepropose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as thevisual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresseshigh-resolution images into information-rich visual features, effectivelypreventing the generation of excessive visual tokens. To enhance thecapabilities of ConvLLaVA, we propose two critical optimizations. Since thelow-resolution pretrained ConvNeXt underperforms when directly applied on highresolution, we update it to bridge the gap. Moreover, since ConvNeXt's originalcompression ratio is inadequate for much higher resolution inputs, we train asuccessive stage to further compress the visual tokens, thereby reducingredundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536resolution generating only 576 visual tokens, capable of handling images ofarbitrary aspect ratios. Experimental results demonstrate that our methodachieves competitive performance with state-of-the-art models on mainstreambenchmarks. The ConvLLaVA model series are publicly available athttps://github.com/alibaba/conv-llava.