Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

Roman Abramov, Felix Steinbauer, Gjergji Kasneci

公開日: 5/11/2025

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop
Reasoning with Transformers

要約

Transformers have achieved great success in numerous NLP tasks but continueto exhibit notable gaps in multi-step factual reasoning, especially whenreal-world knowledge is sparse. Recent advances in grokking have demonstratedthat neural networks can transition from memorizing to perfectly generalizingonce they detect underlying logical patterns - yet these studies have primarilyused small, synthetic tasks. In this paper, for the first time, we extendgrokking to real-world factual data and address the challenge of datasetsparsity by augmenting existing knowledge graphs with carefully designedsynthetic data to raise the ratio phi_r of inferred facts to atomic factsabove the threshold required for grokking. Surprisingly, we find that evenfactually incorrect synthetic data can strengthen emergent reasoning circuitsrather than degrade accuracy, as it forces the model to rely on relationalstructure rather than memorization. When evaluated on multi-hop reasoningbenchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA -substantially improving over strong baselines and matching or exceeding currentstate-of-the-art results. We further provide an in-depth analysis of howincreasing phi_r drives the formation of generalizing circuits insideTransformers. Our findings suggest that grokking-based data augmentation canunlock implicit multi-hop reasoning capabilities, opening the door to morerobust and interpretable factual reasoning in large-scale language models.

論文の詳細を見る