This paper introduces BemaGANv2, an advanced GAN-based vocoder for high-fidelity and long-term audio generation.
BemaGANv2 builds upon the original BemaGAN architecture by incorporating architectural innovations like the Anti-aliased Multi-Periodicity composition (AMP) module in the generator.
The generator in BemaGANv2 uses the Snake activation function to better model periodic structures in audio.
BemaGANv2's discriminator framework includes the Multi-Envelope Discriminator (MED) to extract temporal envelope features and the Multi-Resolution Discriminator (MRD) to model long-range dependencies.
The evaluation of BemaGANv2 includes different discriminator configurations like MSD + MED, MSD + MRD, and MPD + MED + MRD using various objective metrics and subjective evaluations.
Objective metrics used for evaluation include FAD, SSIM, PLCC, and MCD, while subjective evaluations involve MOS and SMOS scores.
The paper provides a tutorial on model architecture, training methodology, and implementation details to ensure reproducibility.
The code and pre-trained models for BemaGANv2 are available at https://github.com/dinhoitt/BemaGANv2.