Researchers have developed a self-supervised approach named V-JEPA 2 to understand, predict, and plan in the physical world.
V-JEPA 2 was pre-trained on over 1 million hours of internet video data and achieves top performance in motion understanding and human action anticipation tasks.
By integrating V-JEPA 2 with a large language model, it excels in video question-answering tasks at a large scale.
The researchers further demonstrate the application of self-supervised learning in robotic planning by training V-JEPA 2-AC on unlabeled robot videos and achieving object manipulation tasks.
V-JEPA 2-AC allows picking and placing objects using planning with image goals on Franka arms in different lab environments.
This achievement is obtained without task-specific training, reward, or robot data collection in the target environments.