Dynamic Facial Expression Recognition (DFER) has received significant interest for enabling empathic and human-compatible technologies.
Multimodal emotion recognition based on audio and video data is being explored to improve DFER models' robustness towards real-world applications.
Advancements in self-supervised learning (SSL) and adapting pre-trained static models are being utilized in multimodal DFER.
This research work proposes adapting SSL-pre-trained disjoint unimodal encoders to improve multimodal DFER performance and achieves state-of-the-art results on DFER benchmarks.