<ul><li>This research focuses on building a multimodal foundation model for egocentric video understanding.</li><li>The research includes generating a large dataset of high-quality QA samples for egocentric videos.</li><li>A challenging egocentric QA benchmark with videos and questions is introduced to evaluate the models' performance.</li><li>A specialized multimodal architecture with a novel memory pointer prompting mechanism is proposed to enhance video comprehension.</li></ul>

MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA

Discover more