2 months ago

Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Ziyue Li, Chenrui Fan, Tianyi Zhou

Abstract

Grokking, i.e., test performance keeps improving long after training lossconverged, has been recently witnessed in neural network training, making themechanism of generalization and other emerging capabilities such as reasoningmysterious. While prior studies usually train small models on a few toy orhighly-specific tasks for thousands of epochs, we conduct the first study ofgrokking on checkpoints during one-pass pretraining of a 7B large languagemodel (LLM), i.e., OLMoE. We compute the training loss and evaluategeneralization on diverse benchmark tasks, including math reasoning, codegeneration, and commonsense/domain-specific knowledge retrieval tasks. Our study, for the first time, verifies that grokking still happens in thepretraining of large-scale foundation models, though different data may entergrokking stages asynchronously. We further demystify grokking's "emergence ofgeneralization" by investigating LLM internal dynamics. Specifically, we findthat training samples' pathways (i.e., expert choices across layers) evolvefrom random, instance-specific to more structured and shareable between samplesduring grokking. Also, the complexity of a sample's pathway reduces despite theconverged loss. These indicate a memorization-to-generalization conversion,providing a mechanistic explanation of delayed generalization. In the study, wedevelop two novel metrics to quantify pathway distance and the complexity of asingle pathway. We show their ability to predict the generalization improvementon diverse downstream tasks. They are efficient, simple to compute and solelydependent on training data. Hence, they have practical value for pretraining,enabling us to monitor the generalization performance without finetuning andtest. Theoretically, we show that more structured pathways reduce modelcomplexity and improve the generalization bound.