9 days ago

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao

Abstract

Visual data comes in various forms, ranging from small icons of just a fewpixels to long videos spanning hours. Existing multi-modal LLMs usuallystandardize these diverse visual inputs to a fixed resolution for visualencoders and yield similar numbers of tokens for LLMs. This approach isnon-optimal for multimodal understanding and inefficient for processing inputswith long and short visual contents. To solve the problem, we propose Oryx, aunified multimodal architecture for the spatial-temporal understanding ofimages, videos, and multi-view 3D scenes. Oryx offers an on-demand solution toseamlessly and efficiently process visual inputs with arbitrary spatial sizesand temporal lengths through two core innovations: 1) a pre-trained OryxViTmodel that can encode images at any resolution into LLM-friendly visualrepresentations; 2) a dynamic compressor module that supports 1x to 16xcompression on visual tokens by request. These design features enable Oryx toaccommodate extremely long visual contexts, such as videos, with lowerresolution and high compression while maintaining high recognition precisionfor tasks like document understanding with native resolution and nocompression. Beyond the architectural improvements, enhanced data curation andspecialized training on long-context retrieval and spatial-aware data help Oryxachieve strong capabilities in image, video, and 3D multimodal understandingsimultaneously. Our work is open-sourced at https://github.com/Oryx-mllm/Oryx.