Researchers propose a method to enhance the learning of graph reasoning capabilities in Large Language Models (LLMs) by using reinforcement learning on synthetic graph data.
The approach involves designing solution-based and process-based rewards for synthetic graph problems to help LLMs understand the underlying principles of graph reasoning and prevent overfitting.
Experiments using reinforcement learning algorithms like GRPO and DPO show a significant improvement in LLM performance on various datasets, including real-world tasks with implicit graph structures.
The study highlights the importance of process-based rewards over solution-based rewards, the potential gains from mixing synthetic and real-world task data, and the challenges of compositionality and explainable intermediate steps in LLM learning.