Recent MIDI-to-audio synthesis methods using deep neural networks have been successful in generating high-quality, expressive instrumental tracks.
These methods usually require MIDI annotations for supervised training, which limits the diversity of instrument timbres and expression styles in the output.
CoSaRef is introduced as a MIDI-to-audio synthesis method that does not depend on MIDI-audio paired datasets.
CoSaRef involves two main steps: generating a synthetic audio track using concatenative synthesis from MIDI input and refining it using a diffusion-based deep generative model trained without MIDI annotations.
This method enhances the diversity of timbres and expression styles in the generated audio output.
CoSaRef also enables fine control over timbres and expression through sample selection and extra MIDI design, akin to traditional functions in digital audio workstations.
Experiments demonstrated that CoSaRef can produce realistic tracks while maintaining detailed timbre control via one-shot samples.
Despite not being trained with MIDI annotations, CoSaRef outperformed a state-of-the-art timbre-controllable method based on MIDI supervision in both objective and subjective evaluations.