To Understand AI, Study Its Evolution: How Training Dynamics Reveal What Models Truly Do
This article offers a compelling and insightful perspective on the growing field of AI interpretability, highlighting the unique approach of Naomi Saphra, whose work bridges the gap between AI development and deep understanding. Her analogy of using evolutionary biology to study AI models is both powerful and refreshing. By emphasizing the importance of stochastic gradient descent—the core algorithm behind training—she shifts the focus from static snapshots of finished models to the dynamic, often unpredictable, journey of how models actually learn. This perspective is crucial because it challenges the common assumption that what we observe in a trained model is necessarily meaningful or essential. Instead, Saphra urges researchers to ask not just how a model works, but why it works that way, echoing the scientific rigor of evolutionary biology. One of the most striking aspects of her story is how personal adversity shaped her research path. Losing the ability to type due to a neurological condition forced her to adapt, leading her to focus on training dynamics—a niche area at the time. This experience underscores how constraints can foster innovation and long-term thinking. Her ability to work at a slower pace allowed her to avoid the hype cycles that often drive short-term research, enabling deeper, more original insights. This resilience and perspective are rare and valuable in a field often dominated by rapid progress and flashy results. Her critique of current interpretability methods is particularly sharp. Many researchers identify patterns in trained models—such as a neuron that activates during French output—but treat these as functional explanations without considering whether they are causally relevant or even beneficial. Her example of image classification neurons that hinder performance when prevented from developing shows how seemingly important features can be evolutionary byproducts rather than functional necessities. This distinction between correlation and causation is central to robust scientific inquiry. The article also highlights a key limitation of post-training causal interventions: manipulating a neuron might disrupt the model not because that neuron is essential, but because it’s deeply interconnected. Only by observing the training process can researchers detect whether a structure and a behavior emerge together, suggesting a genuine causal link. This temporal alignment offers stronger evidence than isolated interventions. Ultimately, Saphra’s call for precision in language and methodology in interpretability research is vital. If we are to trust or regulate AI systems, we must understand not just what they do, but how and why they do it. Her work reminds us that true understanding comes not from dissecting a finished product, but from watching it grow.
