DiaTool-DPO is a novel method that enhances Tool-Augmented Large Language Models' (TA-LLMs) dialogue capabilities through Direct Preference Optimization.
DiaTool-DPO models TA-LLM interactions as a Markov Decision Process and categorizes user queries into 3 types based on their state transition trajectories.
By introducing a specialized objective loss for dialogue control, DiaTool-DPO achieves substantial improvements over baseline in information gathering (94.8% vs. 44%) and tool call rejection (91% vs. 9.6%) while maintaining core functionality.
DiaTool-DPO enables the development of TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.