<ul><li>Recent advances in text-to-speech synthesis have made progress in generating high-quality short utterances for individual speakers.</li><li>However, current systems face challenges in extending their capabilities to long, multi-speaker, and spontaneous dialogues, such as podcasts.</li><li>In response, MoonCast is introduced as a solution for high-quality zero-shot podcast generation using text-only sources and the voices of unseen speakers.</li><li>Experiments have shown that MoonCast outperforms baselines, particularly in terms of spontaneity and coherence.</li></ul>

MoonCast: High-Quality Zero-Shot Podcast Generation

Discover more