RIPT-VLA is a reinforcement-learning-based interactive post-training paradigm for Vision-Language-Action models, using sparse binary success rewards.
Existing VLA training pipelines heavily rely on offline expert demonstration data and supervised imitation, limiting their adaptability to new tasks. RIPT-VLA enables interactive post-training with a stable policy optimization algorithm.
RIPT-VLA applies to various VLA models, improving success rates significantly. It is computationally and data-efficient, enabling model enhancement with minimal supervision.
Results show RIPT-VLA's effectiveness in generalizing across tasks and scenarios, with significant success rate improvements for different VLA models.