Meta AI introduces Perception Encoder (PE), a vision model family trained using a single contrastive vision-language objective and refined with alignment techniques tailored for downstream tasks.
PE operates across three scales—PEcoreB, PEcoreL, and PEcoreG—with the largest (G-scale) model containing 2B parameters, functioning as a general-purpose encoder for image and video inputs.
PE demonstrates strong zero-shot generalization across a wide range of vision benchmarks, achieving competitive results on image classification and fine-grained datasets, as well as state-of-the-art performance on video tasks.
The release of PE, alongside its codebase and the PE Video Dataset, provides a foundation for building multimodal AI systems and advancing integrated and robust visual understanding.