HyperAI

Memorization-Compression Cycles Improve Generalization

Fangyuan Yu
Date de publication: 5/15/2025
Memorization-Compression Cycles Improve Generalization
Résumé

We prove theoretically that generalization improves not only through datascaling but also by compressing internal representations. To operationalizethis insight, we introduce the Information Bottleneck Language Modeling (IBLM)objective, which reframes language modeling as a constrained optimizationproblem: minimizing representation entropy subject to optimal predictionperformance. Empirically, we observe an emergent memorization-compression cycleduring LLM pretraining, evidenced by oscillation positive/negative gradientalignment between cross-entropy and Matrix-Based Entropy (MBE), a measure ofrepresentation entropy. This pattern closely mirrors the predictive-compressivetrade-off prescribed by IBLM and also parallels the biological alternationbetween awake learning and sleep consolidation. Motivated by this observation,we propose Gated Phase Transition (GAPT), a training algorithm that adaptivelyswitches between memorization and compression phases. When applied to GPT-2pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improvescross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretrainingtask on arithmetic multiplication. In a setting designed to simulatecatastrophic forgetting, GAPT reduces interference by compressing andseparating representations, achieving a 97% improvement in separation -paralleling the functional role of sleep consolidation.