The GPT-SoVITS process involves phoneme and BERT feature extraction, GPT semantic modeling, SoVITS decoding to speech, and output speech.
In GPT-SoVITS, the GPT semantic modeling stage utilizes a core model to convert phonemes and semantic features into semantic tokens.
GPT in GPT-SoVITS refers to a custom Text-to-Semantic Transformer specialized for speech synthesis.
SoVITS is an extended version of VITS, which generates audio from semantic tokens rather than raw text, offering improved naturalness and speaker fidelity.