Vision-language models (VLMs) are designed to handle tasks involving visual and textual data such as describing images and generating text based on images.
Most VLMs use transformer-based architectures adapted to process both pixel data and token-based text.
Dual-encoder models in VLMs use separate neural networks for processing images and text to align embeddings in a shared latent space.
One popular dual-encoder model, OpenAI’s CLIP, uses transformer-based text encoder and a vision encoder like ResNet or ViT for image processing.