Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Large multi-modal models (LMMs) exhibit remarkable performance acrossnumerous tasks. However, generalist LMMs often suffer from performancedegradation when tuned over a large collection of tasks. Recent researchsuggests that Mixture of Experts (MoE) architectures are useful for instructiontuning, but for LMMs of parameter size around O(50-100B), the prohibitive costof replicating and storing the expert models severely limits the number ofexperts we can use. We propose Omni-SMoLA, an architecture that uses the SoftMoE approach to (softly) mix many multimodal low rank experts, and avoidsintroducing a significant number of new parameters compared to conventional MoEmodels. The core intuition here is that the large model provides a foundationalbackbone, while different lightweight experts residually learn specializedknowledge, either per-modality or multimodally. Extensive experimentsdemonstrate that the SMoLA approach helps improve the generalist performanceacross a broad range of generative vision-and-language tasks, achieving newSoTA generalist performance that often matches or outperforms singlespecialized LMM baselines, as well as new SoTA specialist performance.