OpenAI's Sparse Model Offers Insight into AI's Inner Thoughts
In the era of rapid AI advancement driven by large language models, one fundamental challenge remains: the inner workings of these systems are still largely opaque. We build neural networks to power AI, but we don’t write down their logic step by step. Instead, models learn by adjusting billions of weights through training until they perform well—yet the result is often a dense, tangled structure that humans struggle to interpret. As AI increasingly influences critical domains like science, education, healthcare, and public safety, this lack of understanding has become deeply concerning. OpenAI is now working to change that. In a recent exclusive interview with MIT Technology Review, OpenAI research scientist Leo Gao revealed a new experimental large language model called the weight-sparse transformer. While its performance falls far short of state-of-the-art models like GPT-5, Claude, or Gemini—roughly equivalent to OpenAI’s 2018 GPT-1 in capability (though no direct comparison has been made)—it has a rare and powerful trait: it can be truly understood by humans. Why does a comprehensible model matter? Today’s large models are both impressive and unsettling. They generate answers without explaining their reasoning, sometimes hallucinate without clear triggers, and display complex reasoning without anyone knowing whether it’s reliable. Most LLMs rely on dense neural networks, where every neuron connects to nearly every neuron in adjacent layers. While effective for learning, this design scatters knowledge across countless connections, creating a complex, entangled web that resists analysis. In such models, a single concept may be split across multiple distant neurons; a single neuron may serve multiple purposes; and tracing the full path of a logical inference is nearly impossible. As a result, large models are often likened to “airplane engines no one dares to open.” OpenAI’s approach flips this paradigm. Instead of optimizing for performance, the team is pursuing mechanistic interpretability—understanding not just what a model outputs, but how it arrives at that output. They built a model structurally similar to GPT-2 but introduced a key change: they forced most weights to zero, creating a sparse network where each neuron connects to only a few others. This design forces the model to localize features and functions, rather than spreading them across the network. The trade-off is clear: the model is slower and less capable. But in return, its internal structure becomes readable, traceable, and interpretable. “The difference in explainability is striking,” Gao said. The team tested the model on simple tasks, such as determining whether a string should be closed with single or double quotes based on the opening quote. In a dense model, this process is nearly invisible. But in the sparse model, researchers could clearly observe the full computation chain: the model encoded single and double quotes into separate channels; an MLP module processed them into “is a quote” and “which type of quote”; an attention head skipped over intermediate tokens to locate the opening quote; and the final output copied the correct closing type. This is the first time the internal logic of a model has been visualized with such clarity. For more complex tasks—like variable binding in Python code—the full circuit is harder to map, but OpenAI still identified key functional pathways. For example, one attention head copied a variable’s name to its definition, and another carried the type from the definition to its usage. These “partial circuits” already allow researchers to predict the model’s behavior. Can this be scaled to models like GPT-3 or beyond? Math researcher Elisenda Grigsby from Boston College remains skeptical, noting that larger models must handle far more complex and diverse tasks. Gao and his team acknowledge the current limitations and agree this approach will never match the performance of the most advanced models. But they believe that with further development, it may one day be possible to build a GPT-3-level model that is fully interpretable. “Maybe within a few years, we’ll have a fully explainable GPT-3,” Gao said. “You could walk into every part of it and understand how it performs every task. If we can build such a system, we’ll learn so much.” Whether or not this method can scale to the largest models, the weight-sparse transformer represents a pivotal step. It’s not about building the most powerful AI, but about answering a fundamental question: can we make AI not just strong, but transparent and trustworthy? While mechanistic interpretability is still in its early stages, this experiment shows that a more open, understandable future for AI is not just possible—it’s already within reach.
