Visual grounding focuses on detecting objects from images based on language expressions.
A new task named Multimodal Reference Visual Grounding (MRVG) is introduced, where a model has access to a set of reference images of objects in a database.
A novel method named MRVG-Net is introduced to solve the visual grounding problem, which achieves superior performance compared to the state-of-the-art LVLMs.
The approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding.