Applications and working of Vision Language Models (VLMs)

A naukri.com initiative

New

Home

Deep Learning News

Applicatio...

Medium

229

Image Credit: Medium

Applications and working of Vision Language Models (VLMs)

Vision Language Models (VLMs) have a wide range of applications including generating captions for images, Question Answering (VQA), Zero-Shot and Few-Shot Image Classification, and document analysis.
VLMs can also support text-based image search, object recognition, image segmentation, chatbots, automated data labeling, and the processing of videos.
VLMs can be trained using pre-training and supervised fine-tuning (SFT). At different stages of training, the weights of some VLM components are unfrozen and updated, while keeping others frozen, for effective learning. The training includes alignment with human preferences.
VLMs are characterized by Vision Encoder, Text Encoder, and Decoder Language Model components, which work together to extract information from images and text, and fuse them to generate text output.
Fusion mechanisms like Cross-Attention and attention based mechanisms are used to combine the visual embeddings and text embeddings.
The type of training data and the training objective can change in different stages of training, and synthetic data is also used commonly in VLM training for better results.
The VLM training data is in the form of image-text pairs, interleaved image-text documents, image-instruct-answer triplets, and even pdfs.
Most VLMs are built upon Transformer models. The main idea in architecture of VLMs is extracting visual features and textual features, and then combining that information and utilizing it during text generation through LLM.
VLMs are well suited for Vision-Language Navigation (VLN), Multimodal Machine Translation, and Text to Image Generation
Parameter Efficient Fine-Tuning (PEFT) techniques like LoRA are commonly used during the LLM training. Different self-supervised learning based pre-training objectives are used in pre-training.

Read Full Article

13 Likes

Discover more

For uninterrupted reading, download the app