<ul data-eligibleForWebStory="true"><li>Researchers have developed a self-supervised approach named V-JEPA 2 to understand, predict, and plan in the physical world.</li><li>V-JEPA 2 was pre-trained on over 1 million hours of internet video data and achieves top performance in motion understanding and human action anticipation tasks.</li><li>By integrating V-JEPA 2 with a large language model, it excels in video question-answering tasks at a large scale.</li><li>The researchers further demonstrate the application of self-supervised learning in robotic planning by training V-JEPA 2-AC on unlabeled robot videos and achieving object manipulation tasks.</li><li>V-JEPA 2-AC allows picking and placing objects using planning with image goals on Franka arms in different lab environments.</li><li>This achievement is obtained without task-specific training, reward, or robot data collection in the target environments.</li></ul>

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Discover more