Predicting Relevance of Emotion Tags
Summary
Emotion tags form an important tool for searching and categorizing large music collections such as online music services, e.g. Last.fm, or production music databases. How- ever, the tags are often not provided or evaluated by experts, resulting in noisy and less useful tag sets. The main goal of this research is to provide a method for automatically evaluating the relevance of those tags. The method consists of using the distance between emotion tag and audio clip, both plotted in arousal/valence (AV) space, as predictor for tag relevance. To this end, first, AV prediction regression models for audio clips are trained and tested with cross validation on three different datasets. Results on the train/test sets are matching state of the art R2 scores, however, the results deteriorate when the models are validated on other datasets, especially for valence prediction models. Therefore, not the predicted but the human rated AV values of the clips are used in the next step. Second, relevance of four emotion tags (angry, happy, sad and tension) and one set of ten tags together are predicted for audio clips from two datasets with regression models that are trained and tested using two different predictors separately: (1) distance between AV values of the clip and emotion words, and (2) directly with audio features. Except for ‘angry’, AV distance predictors outperform audio feature predictors (e.g. for ‘happy’ respectively R2 > .65 vs. R2 > .18). A second evaluation on a self composed dataset with a larger set of different emotion tags did not lead to useful results. Whether the method generalizes to other tags is therefore still inconclusive. These findings (1) indicate that AV prediction of music is still in development phase, and (2) point into the direction of a new promising method for evaluating at least a few emotion tags.