HyperAI

Grokking

In the field of deep learning,Grokking refers to a phenomenon in which neural networks can achieve good generalization even after the training error decays for a long time. What’s interesting about the grokking phenomenon is that it is a dynamic phenomenon — that is, the gap between training loss and test loss only exists in the middle of training; a network that is able to understand will eventually generalize so that both training loss and test loss are very low by the end of training.

This phenomenon suggests that the neural network may be mainly learning some basic features or patterns of the data in the initial stage, resulting in a rapid decrease in training loss. In the subsequent stages, the network begins to gradually understand the deeper features and structures of the data, so the test loss begins to decrease significantly. This phenomenon may mean that the network has transitioned from a simple feature learning stage to a more complex feature learning stage, or it may reflect some dynamic changes in the network learning process.

"Grokking as the transition from lazy to rich training dynamics"The paper proposes that the phenomenon of grokking (the test loss of a neural network drops significantly after the training loss) is due to the transition from initial "lazy" training to subsequent "rich" feature learning. Using polynomial regression on a two-layer network, the authors show that grokking occurs when the network shifts from fitting the data with the initial features to learning new features for better generalization. They argue that the rate of feature learning and initial feature alignment are key to this delayed generalization, a concept that may apply to more complex neural networks.

The grokking phenomenon can be viewed as a transition from kernel to feature learning mechanism.The characteristic is that the training loss of the neural network decreases significantly earlier than the test loss, which can happen when the network switches from a lazy training dynamic to a richer feature learning mode. Grokking can be triggered by the transition from the kernel mechanism to the feature learning mechanism.

References

【1】"Grokking as the transition from lazy to rich training dynamics"——Included in ICLR 2024