Cosyvoice, maintained by Jichengdu on Replicate, is a scalable multilingual text-to-speech system known for advanced voice cloning capabilities.
The model, built on large language model architecture, supports streaming synthesis, cross-lingual generation, and bidirectional streaming.
It focuses on low-latency performance and high-quality output, standing out among related models like OpenVoice and Parler TTS.
Cosyvoice takes text and reference audio as inputs to generate natural-sounding speech in multiple languages and styles, producing WAV format speech output at a 16kHz sample rate.