The paper introduces the concept of learning to search (L2S) from expert demonstrations to address the limitations of behavioral cloning (BC) in imitation learning.
L2S involves learning a world model and a reward model to enable agents to plan to match expert outcomes, even after making mistakes.
The approach named SAILOR consistently out-performs state-of-the-art Diffusion Policies trained via BC on various visual manipulation tasks from different benchmarks.
SAILOR demonstrates the ability to identify nuanced failures, is robust to reward hacking, and shows improved performance compared to using 5-10 times more demonstrations in BC.