Recent advances in text-to-speech synthesis have made progress in generating high-quality short utterances for individual speakers.
However, current systems face challenges in extending their capabilities to long, multi-speaker, and spontaneous dialogues, such as podcasts.
In response, MoonCast is introduced as a solution for high-quality zero-shot podcast generation using text-only sources and the voices of unseen speakers.
Experiments have shown that MoonCast outperforms baselines, particularly in terms of spontaneity and coherence.