Speech generation technology advancements raise concerns about potential misuse of synthetic speech signals.
The study addresses three key tasks: single-model attribution in an open-world scenario, model attribution in a closed-world scenario, and distinguishing synthetic from real speech.
The research uses standardized average residuals between audio signals and filtered versions as vocoder fingerprints for identification purposes.
The vocoder fingerprints prove to be effective in achieving over 99% average AUROC on LJSpeech and JSUT datasets for various tasks.
The study also demonstrates resilience to noise to a certain extent, as shown in the accompanying robustness study.