Creating a Speech and Music Emotion Recognition System for Mixed Source Audio
Summary
While both speech emotion recognition and music emotion recognition have been studied extensively in different communities, little research went into the recognition of emotion from mixed audio sources, i.e. when both speech and music are present. However, many application scenarios require models that are able to extract emotions from mixed audio sources, such as television content. We coined this recognition problem as MiSME recognition, Mixed Speech Music Emotion recognition. This master thesis studies how mixed audio affects both speech and music emotion recognition using a random forest and deep neural network model, and investigates if blind source separation of the mixed signal beforehand is beneficial, along with a feature importance analysis. We created a mixed audio dataset, with 25% speech-music overlap without contextual relationship between the two.
The speech and music emotion recognition experiments consisted of six experiments each, where the models were trained and tested on different combinations of the three audio types available: single-source audio (speech-only / music-only), mixed audio or blind-source separated audio. Deezer's Spleeter tool was used to create the blind-source separated version of the dataset.
The results showed that both speech and music emotion recognition are possible on mixed audio far above chance-level, meaning that a functional MiSME system can indeed be created. The speech models performed best when blind-source separation was included as a preprocessing step, but there remained a performance gap compared to speech-only audio, suggesting that lower levels of speech emotion recognition performance should be expected on mixed audio. The music models were able to perform better on mixed audio, with and without blind-source separation depending on the model, than on music-only audio. We attributed this to speech-presence forcing the models to favor less ambiguous features during training, resulting in a better generalizing model. The results also showed that both speech and music models trained on single-source audio achieve chance-level performance on mixed audio, rendering them incapable of MiSME recognition.
The feature importance analysis produced many insights regarding which speech and music features are (un)important for mixed audio. It showed that the optimal features were highly dissimilar between audio types for both speech and music emotion recognition, which means that the optimal features for each audio type are different.
This research thus not only shows that both speech and music emotion recognition are possible far above chance-level on mixed audio, but also gives insight into the use of blind-source separation and common speech and music features in a mixed audio scenario. This is important knowledge when estimating emotion from real-world data, where individual speech and music tracks are often not available.