<ul><li>Reinforcement Learning Finetuning (RFT) has advanced reasoning capabilities of large language models (LLMs) for better tool use.</li><li>VTool-R1 is a framework that trains VLMs to generate multimodal chains of thought using both text and visual reasoning steps.</li><li>VTool-R1 integrates visual editing tools into RFT process, enabling VLMs to learn when and how to use visual reasoning steps.</li><li>Experiments show that VTool-R1 improves reasoning performance by teaching VLMs to think with images and generate multimodal chain of thoughts.</li></ul>

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

Discover more