<ul><li>Visual grounding focuses on detecting objects from images based on language expressions.</li><li>A new task named Multimodal Reference Visual Grounding (MRVG) is introduced, where a model has access to a set of reference images of objects in a database.</li><li>A novel method named MRVG-Net is introduced to solve the visual grounding problem, which achieves superior performance compared to the state-of-the-art LVLMs.</li><li>The approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding.</li></ul>

Multimodal Reference Visual Grounding

Discover more