HyperAI

Recent advancements in large language models (LLMs) like OpenAI’s o1/3 and DeepSeek-R1 have significantly improved their performance on complex reasoning tasks. Despite these achievements, the inner workings of these models, particularly the step-by-step reasoning process, remain largely opaque. Most evaluations focus solely on the accuracy of the final answers, which does not provide insight into the reasoning mechanisms employed by the models. This approach can be problematic because it fails to distinguish between correct answers derived from logical reasoning and those based on internal knowledge or prior deductions. To address this issue, researchers from UC Santa Cruz, Stanford, and Tongji University have developed a new framework that separates the reasoning process into two distinct components: factual knowledge and logical steps. The framework introduces two key metrics: the Knowledge Index (KI) for measuring factual accuracy and Information Gain (InfoGain) for assessing the quality of reasoning. The Knowledge Index evaluates how accurately an LLM's reasoning steps align with established facts and expert knowledge. This metric ensures that the model’s statements are grounded in reality, reducing the risk of errors that might arise from incorrect assumptions or faulty information. On the other hand, Information Gain measures the reduction in uncertainty with each step of the reasoning process. It quantifies how much new, relevant information the model provides at each stage, helping to identify whether the reasoning is effective and efficient. The researchers applied this framework to analyze the reasoning abilities of the Qwen2.5-7B model and its distilled variant, DeepSeek-R1, across math and medical domains. In mathematical tasks, accurate reasoning is often more critical than factual recall, as problems typically require a series of abstract steps to reach a solution. In contrast, medical tasks often depend heavily on factual knowledge, making the distinction between these two types of reasoning even more crucial. Their findings were intriguing. While supervised fine-tuning (SFT) consistently improved the factual accuracy and overall performance of Qwen-Base in medical tasks, it sometimes compromised the reasoning depth. This suggests that SFT excels at instilling domain-specific knowledge but may not always enhance the model’s ability to think through complex problems. Conversely, reinforcement learning (RL) was found to refine reasoning by eliminating irrelevant information and improving the logical flow of the model’s steps. However, RL alone did not sufficiently improve factual accuracy, indicating that a combination of SFT and RL is necessary for optimal performance in reasoning tasks. For example, when compared to Qwen-Base, the distilled Qwen-R1 performed worse on medical tasks despite being fine-tuned and reinforced. This discrepancy is attributed to the initial focus of Qwen-R1’s training on math and coding domains, leading to a mismatch in the types of reasoning required for medical tasks. The researchers concluded that domain-specific training is vital for enhancing the performance of LLMs in specialized fields. The study also revealed that reasoning skills do not easily transfer between domains. For instance, a model trained extensively on math tasks did not perform well on medical tasks, even after subsequent training. This indicates that each domain has unique requirements for reasoning, and generalized models may struggle to meet these needs without targeted adjustments. The researchers’ framework provides valuable insights into the strengths and weaknesses of LLMs in different contexts. By breaking down the reasoning process into measurable components, it enables developers to identify and address specific issues in the model’s performance. This could lead to the creation of more reliable and interpretable LLMs, particularly in high-stakes areas like medicine and law, where accurate and logical reasoning is paramount. Moreover, the framework’s applicability extends beyond math and medicine. It can be adapted to evaluate LLMs in other critical domains, such as finance and legal services, where systematic and precise reasoning is equally important. The ability to separate knowledge from logic allows for tailored training strategies that can enhance factual accuracy while maintaining or improving reasoning depth. Industry insiders have praised the research for its potential to bring transparency and trustworthiness to LLMs. The framework could facilitate more informed decision-making in AI development, ensuring that models are not only accurate but also robust in their reasoning processes. Companies like OpenAI and DeepMind are likely to benefit from these insights as they continue to refine their LLMs for various applications. The researchers involved in this project are from prestigious institutions, including UC Santa Cruz, Stanford, and Tongji University. Their interdisciplinary approach brings together expertise in computer science, linguistics, and domain-specific knowledge, creating a comprehensive and rigorous method for evaluating LLMs. This work underscores the importance of ongoing collaboration between academic and industry experts to advance the field of AI and ensure that models are both accurate and reliable. The paper, code, and project page are available for further exploration, and interested readers can follow the researchers on Twitter and join the 99k+ ML SubReddit for updates and discussions.

New Framework Separates Knowledge and Logic in LLM Reasoning, Enhancing Evaluation Across Domains

Related Links