Meta AI has introduced V-JEPA 2, an open-source self-supervised world model for visual understanding, prediction, and planning.
V-JEPA 2 is pretrained on 1 million hours of internet-scale video and 1 million images using a visual mask denoising objective.
The model uses data scaling, model scaling, training schedule, and spatial-temporal augmentation to achieve an 88.2% average accuracy on benchmark tasks.
It demonstrates strong motion and appearance understanding capabilities and transferable visual features.
V-JEPA 2 encoder shows competence in temporal reasoning tasks without language supervision during pretraining.
V-JEPA 2-AC is an action-conditioned variant fine-tuned on robot video data, enabling zero-shot planning through model-predictive control.
The model outperforms baselines in planning efficiency, achieving a 100% success rate on reach tasks and excelling in grasp and manipulation tasks.
Operating with a monocular RGB camera, V-JEPA 2-AC showcases generalization capabilities for real-world applications.
Meta's V-JEPA 2 signifies progress in self-supervised learning for physical intelligence, showcasing the potential of visual representations for perception and control.
The research paper, models on Hugging Face, and GitHub page are available for further exploration.
Meta AI encourages engagement through Twitter, Reddit, and their Newsletter.