Researchers propose a multimodal joint training framework called MMAudio for high-quality video-to-audio synthesis.
MMAudio is trained using both video and text-audio data to generate semantically aligned audio samples.
A conditional synchronization module improves audio-visual synchrony at the frame level.
MMAudio achieves state-of-the-art performance in audio quality, semantic alignment, and audio-visual synchronization with low inference time and parameter count.