2 months ago

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Zihao Huang Yu Bao Qiyang Min Siyan Chen Ran Guo Hongzhi Huang Defa Zhu Yutao Zeng Banggu Wu Xun Zhou

Abstract

While Mixture of Experts (MoE) models achieve remarkable efficiency byactivating only subsets of parameters, they suffer from high memory accesscosts during inference. Memory-layer architectures offer an appealingalternative with very few memory access, but previous attempts like UltraMemhave only matched the performance of 2-expert MoE models, falling significantlyshort of state-of-the-art 8-expert configurations. We present UltraMemV2, aredesigned memory-layer architecture that closes this performance gap. Ourapproach introduces five key improvements: integrating memory layers into everytransformer block, simplifying value expansion with single linear projections,adopting FFN-based value processing from PEER, implementing principledparameter initialization, and rebalancing memory-to-FFN computation ratios.Through extensive evaluation, we demonstrate that UltraMemV2 achievesperformance parity with 8-expert MoE models under same computation andparameters but significantly low memory access. Notably, UltraMemV2 showssuperior performance on memory-intensive tasks, with improvements of +1.6points on long-context memorization, +6.2 points on multi-round memorization,and +7.9 points on in-context learning. We validate our approach at scale withmodels up to 2.5B activated parameters from 120B total parameters, andestablish that activation density has greater impact on performance than totalsparse parameter count. Our work brings memory-layer architectures toperformance parity with state-of-the-art MoE models, presenting a compellingalternative for efficient sparse computation.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Zihao Huang Yu Bao Qiyang Min Siyan Chen Ran Guo Hongzhi Huang Defa Zhu Yutao Zeng Banggu Wu Xun Zhou1 more

Abstract

Build AI with AI

Hyper Newsletters

Zihao Huang Yu Bao Qiyang Min Siyan Chen Ran Guo Hongzhi Huang Defa Zhu Yutao Zeng Banggu Wu Xun Zhou