Researchers have developed an AI model that learns to connect vision and sound without human intervention, mimicking how humans naturally learn.
This approach could have applications in journalism, film production, and improving a robot's understanding of real-world environments.
The AI model was trained to align audio and visual data from video clips without human labels, improving performance in video retrieval tasks and scene classification.
The method developed by MIT researchers helps the model learn a finer-grained correspondence between video frames and accompanying audio.
Architectural tweaks were made to balance learning objectives and enhance system performance in processing audiovisual information.
The model, named CAV-MAE Sync, splits audio into smaller windows to improve learning of finer-grained correspondence.
By introducing separate data representations for contrastive and reconstructive learning objectives, the model's performance was boosted.
The enhancements led to improved video retrieval accuracy and scene classification in audiovisual scenarios.
The researchers aim to integrate new models for better data representation and include text data to enhance the system's capabilities further.
Funding for this work is provided by the German Federal Ministry of Education and Research and the MIT-IBM Watson AI Lab.