Speech separation (SS) seeks to disentangle a multi-talker speech mixture into single-talker speech streams.
Causal separation models, which rely only on past and present information, offer a promising solution for real-time streaming.
A novel frontend is introduced to mitigate the mismatch between training and run-time inference by incorporating future information into causal models through predictive patterns.
The pretrained frontend employs a transformer decoder network with a causal convolutional encoder as the backbone and is pretrained in a self-supervised manner with innovative pretext tasks.