This research focuses on building a multimodal foundation model for egocentric video understanding.The research includes generating a large dataset of high-quality QA samples for egocentric videos.A challenging egocentric QA benchmark with videos and questions is introduced to evaluate the models' performance.A specialized multimodal architecture with a novel memory pointer prompting mechanism is proposed to enhance video comprehension.