Micro-Adjusting Input Embeddings Enhances LLM Reasoning to Near-Perfect Levels
A team of researchers from Brown University has discovered that fine-tuning just the input embedding layer of a large language model (LLM) can enable near-perfect performance on complex reasoning tasks—challenging long-standing assumptions about the model’s inherent reasoning capabilities. This finding has significant implications for how we understand and develop artificial intelligence. For years, a central debate in AI research has been whether LLMs possess true "abstract reasoning" abilities. Abstract reasoning is widely considered a hallmark of general and flexible intelligence, and if LLMs lack it, the current paradigm of AI development may need a fundamental reevaluation. However, the definition of "abstract reasoning" and the criteria for measuring it remain ambiguous. Some studies have argued that LLMs fail on tasks involving visual, analogical, and quantitative reasoning, suggesting a lack of deep cognitive capacity. The Brown team re-examined this framework, successfully reproducing prior results showing that unmodified pre-trained LLMs perform poorly on these tasks. But they also found a striking new insight: by only fine-tuning the input embedding layer while keeping the Transformer blocks frozen, performance improved dramatically—matching or even surpassing that of full model fine-tuning, and achieving near-perfect results on certain tasks. Further experiments revealed this effect is not limited to text. In visual reasoning tasks, when the visual encoder was fine-tuned on the target domain, a frozen pre-trained LLM still delivered strong results. This suggests that the core reasoning capability of LLMs is highly transferable, but its full potential is unlocked only when the input representation is properly adapted to the task. These findings lead to a deeper philosophical question: what does it mean to be an "abstract reasoner"? The team argues that before answering whether LLMs are abstract reasoners, researchers must first clarify why this question matters. The motivation may stem from two distinct goals: (1) understanding whether LLMs emulate human-like cognition (in which case, minimal modification is essential), or (2) building more effective, efficient AI systems (where fine-tuning is a valid and necessary tool). These goals require different experimental designs and cannot be addressed with a single framework. The team’s work has been well received by the academic community. The area chair and reviewers praised the paper for its rigorous replication of benchmark experiments and its challenge to the prevailing view. One reviewer noted the study "demonstrates convincingly that minimal input-level adaptation can yield dramatic performance gains," while the area chair highlighted the "important implications" of the findings and commended the authors for their deep reflection on the definition of abstract reasoning and the motivations behind studying it. The practical applications of this discovery are broad. First, it could drastically reduce the cost of deploying LLMs on downstream tasks—developers could fine-tune only the input layer instead of the entire model, saving significant compute and training time. Second, it enables faster, lighter-weight adaptation, making large models more suitable for resource-constrained environments like mobile and embedded devices. Third, it supports the design of unified multimodal interfaces, where shared embedding spaces allow seamless integration across vision, language, and other modalities. The project began as an attempt to use in-context planning to teach LLMs to complete tasks in simulated environments. However, the team found the required compute was too high, so they shifted focus to visual reasoning tasks—still requiring strong reasoning but more feasible to study. The work then centered on two key questions: (1) what kind of visual representation is best for reasoning, and (2) how to simplify the model architecture. The team explored object-centric representations, which proved highly effective. They adopted a LLaVA-like architecture but sought to avoid full model fine-tuning. This led to a pivotal question: if the Transformer blocks are already good at abstract reasoning, why not just adapt the input? The team tested this idea and found it worked—both in text and visual tasks. A key moment came when team member Chen questioned whether the fine-tuned visual encoder generalized well. The results were surprising and counterintuitive, prompting Ellie to suggest exploring input embedding fine-tuning in the language model instead. This shift ultimately led to the paper’s core discovery. The lead author, Yuntian (Cloud) Chen, is a Ph.D. student in computer science at Brown University, working under Professors Chen Sun and Ellie Pavlick. His research focuses on multimodal learning, particularly vision-language models, and model interpretability. He is currently a research scientist intern at Meta, collaborating with Hengduo Li. Chen holds a B.S. in both Computer Science and Statistics from Wake Forest University, and an M.S. in Computer Science from Brown. He began working with his advisors during his master’s program and later collaborated with Bo Pang and Ashish Thapliyal at Google Research. The team plans to release the paper on arXiv and open-source the code soon.
