The University of California, Santa Cruz has introduced OpenVision, a new family of vision encoders that aims to enhance existing models like OpenAI's CLIP and Google's SigLIP.
Vision encoders convert visual content into numerical data for non-visual AI models, facilitating tasks such as image recognition within large language models.
OpenVision offers 26 models with parameters ranging from 5.9 million to 632.1 million under the Apache 2.0 license for commercial use.
Developed by a team at UCSC, OpenVision leverages the CLIPS training pipeline and Recap-DataComp-1B dataset for training.
The models cater to various use cases, with larger models suitable for high accuracy tasks and smaller ones optimized for edge deployments.
OpenVision demonstrates strong performance in vision-language tasks and outperforms CLIP and SigLIP in benchmark evaluations.
The training strategy of progressive resolution training leads to faster training with no loss in performance in high-resolution tasks like OCR.
The use of synthetic captions and text decoder during training enhances the semantic representation learning of the vision encoder.
OpenVision facilitates integration with small language models for efficient multimodal model development with limited parameters.
The open and modular approach of OpenVision benefits AI engineering, data infrastructure, and security teams by offering a plug-and-play solution for vision capabilities.