Advancements in reasoning-focused LLMs like OpenAI's o1/3 and DeepSeek-R1 have improved complex task performance, yet their reasoning processes remain unclear.
Current evaluations of LLMs often focus on final-answer accuracy, masking the reasoning steps and the combination of knowledge and logic.
Factual errors and lack of reasoning depth in math and medicine demonstrate the limitations of current final-answer evaluation methods.
Researchers propose a new framework to assess LLM reasoning by separating factual knowledge and logical steps using the Knowledge Index and Information Gain metrics.
Evaluation of Qwen models across math and medicine tasks shows that reasoning skills do not easily transfer between domains.
The study compares supervised fine-tuning and reinforcement learning in domain-specific tasks, highlighting the impact on accuracy, knowledge retention, and reasoning depth.
Results indicate that while supervised fine-tuning enhances factual accuracy, it may weaken reasoning depth, whereas reinforcement learning improves both reasoning and knowledge.
The framework introduced in the study aims to make LLMs more interpretable and trustworthy, particularly in critical fields like medicine and math.