Spontaneous speech emotion data often have uncertainty in labels due to grader opinion variation.
Using the probability density function of emotion grades as targets instead of consensus grades improves performance on benchmark evaluation sets.
Saliency-driven foundation model representation selection helps train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition.
Performance evaluation across multiple test-sets, along with analysis across gender and speakers, is necessary to assess the usefulness of emotion models.