AudioX is a unified Diffusion Transformer model for Anything-to-Audio and Music Generation.
It can generate both general audio and music with high quality, and offers flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio.
AudioX utilizes a multi-modal masked training strategy to learn from masked inputs across modalities, resulting in robust and unified cross-modal representations.
Extensive experiments show that AudioX outperforms state-of-the-art specialized models and exhibits remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture.