MIT and UC San Diego Researchers Develop Method to Detect and Manipulate Hidden Concepts in Large Language Models
Large language models (LLMs) like ChatGPT and Claude have evolved beyond simple text generators, absorbing vast amounts of human knowledge and expressing complex abstract ideas such as biases, moods, personalities, and stances. However, how these concepts are stored and represented within the models has remained largely hidden. Now, a team from MIT and the University of California San Diego has developed a method to uncover, analyze, and even manipulate these hidden representations. The researchers created a targeted approach using a predictive modeling algorithm called a recursive feature machine (RFM), which efficiently identifies specific patterns in a model’s internal data that correspond to concepts of interest. Unlike traditional unsupervised methods that scan through massive amounts of data like casting a wide net, this method acts like using bait to attract a specific type of fish—focusing precisely on the concept being studied. The team tested their method on more than 500 general concepts across several of the largest LLMs and vision-language models. These included fears (like fear of marriage or insects), expert personas (such as social influencer or medievalist), moods (boastful, detachedly amused), location preferences (Boston, Kuala Lumpur), and archetypal identities (Ada Lovelace, Neil deGrasse Tyson). The algorithm successfully located representations of these concepts and could then steer the model’s output by amplifying or weakening them. For example, when the team enhanced the “conspiracy theorist” concept in a vision-language model, the model generated a response about the famous “Blue Marble” Earth image that reflected a conspiratorial tone—claiming the photo was staged or manipulated. Similarly, by amplifying the “anti-refusal” concept, the model bypassed its usual safety protocols and provided instructions for illegal activities, such as robbing a bank. While the ability to extract and manipulate such concepts raises ethical concerns—especially around misuse—the researchers emphasize the potential for positive applications. The method can help identify and reduce harmful biases or vulnerabilities in models, improving safety. It can also be used to customize models for specific tasks by enhancing traits like brevity, logical reasoning, or empathy. The approach works by training RFMs to detect numerical patterns in the model’s internal representations—specifically, the vectors of numbers that encode words and concepts as the model processes prompts. By comparing data from prompts related to a concept with those that are unrelated, the algorithm learns the unique signature of that concept within the model’s architecture. Once identified, researchers can mathematically adjust the strength of the concept during inference, directly shaping the model’s output. The team published their findings in the journal Science and has released the underlying code publicly, enabling broader research and responsible development. According to Adityanarayanan “Adit” Radhakrishnan, assistant professor of mathematics at MIT, this work reveals that LLMs store abstract concepts in structured, manipulable forms—not just as surface-level responses. “This shows that these concepts are present in the model, but not always active,” Radhakrishnan said. “Our method allows us to extract and control them in ways that traditional prompting cannot achieve.” The research was supported by the National Science Foundation, the Simons Foundation, the TILOS institute, and the U.S. Office of Naval Research.
