Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

A naukri.com initiative

New

Meta-Adapt...

Arxiv

156

Image Credit: Arxiv

Large Multimodal Models (LMMs) often struggle with in-context learning (ICL) when performing new tasks with limited supervision.
In smaller LMMs, the ICL performance is inconsistent and does not always improve with more examples.
The inconsistency in ICL performance is attributed to LMMs being overwhelmed by unnecessary information in image embeddings.
A meta-learning approach is proposed to enable few-shot capabilities in LMMs by using fixed soft prompts distilled from task-relevant image features.
These prompts can be adapted at test time with just a few examples, addressing the issue of overwhelming information in image embeddings.
An attention-mapper module is introduced to aid in the prompt distillation, which can be integrated with the LLaVA v1.5 architecture.
The attention-mapper module is jointly learned with soft prompts, allowing for task adaptation in LMMs with minimal data using gradient steps.
Evaluation on the VL-ICL Bench demonstrates that the proposed method consistently outperforms ICL and related prompt-tuning approaches.
Even under image perturbations, the proposed method improves task induction and reasoning for visual question answering tasks.

Read Full Article

9 Likes

For uninterrupted reading, download the app