HyperAI

MoMa Architecture

The MoMa framework (full name: Mixture of Modality-Aware Experts) was proposed by Meta in the paper “MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts" proposed a new modality-aware mixture of experts (MoE) architecture designed for pre-training mixed-modality, early-fusion language models.

MoMa processes arbitrary sequences of images and text by partitioning expert modules into modality-specific groups. These groups specialize in processing specified tokens while adopting learned routing within each group to maintain semantically informed adaptivity. The results show that pre-training efficiency is significantly improved through this modality-specific parameter allocation. Under a 1 trillion token training budget, the MoMa 1.4B model with 4 text experts and 4 image experts achieves FLOP savings: 3.7x overall savings compared to the computationally equivalent dense baseline, with 2.6x savings for text and 5.2x savings for image processing, measured by pre-training loss. This outperforms the standard expert selection MoE with 8 mixed-modality experts, which achieves 3x overall FLOP savings (3x for text and 2.8x for image). Combining MoMa with Mixed-with-Deep (MoD) further reduces pre-training FLOPs to 4.2x overall (text: 3.4x, image: 5.3x), although this combination degrades causal inference performance due to increased sensitivity to router accuracy. These results suggest that MoMa has the potential to significantly improve the efficiency of mixed-modality, early-fusion language model pre-training, paving the way for more resource-efficient and powerful multimodal AI systems.