Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributorgeen
dc.contributor.advisorGauthier, David
dc.contributor.authorBakker, Ilse
dc.date.accessioned2025-08-28T00:01:03Z
dc.date.available2025-08-28T00:01:03Z
dc.date.issued2025
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/50015
dc.description.abstractThis thesis investigates the potential of the Wigner-Ville Distribution (WVD) as a high-resolution alternative time-frequency music representation to the widely used, but less detailed, Mel-spectrogram as input for Convolutional Recurrent Neural Network (CRNN)-based audio feature extraction (Hershey et al., 2017). While WVD has previously been applied in urban sound classification tasks using CNNs (Christonasis et al., 2023), its use for musical feature extraction—specifically genre, energy, and valence—remains largely unexplored. The deep learning architecture employed in this study builds upon the Music2Vec pipeline proposed by Hebbar (2020), combining convolutional layers to extract spatial features from time-frequency representations, Long Short-Term Memory (LSTM) layers to capture temporal dependencies, and a Deep Neural Network (DNN) for both classification and regression tasks. Experiments were conducted on the Free Music Archive (FMA) dataset, consisting of 8 balanced musical genres with 1,000 tracks each (30 seconds per track). For each of the 8,000 tracks, five Pseudo WVD spectrogram versions were generated by varying segment length and frequency resolution, alongside comparative Mel-spectrograms. A final experiment combined both representations as a two-channel input. Preliminary results show that Mel-spectrograms outperform PWVD in genre classification, likely due to their perceptual scaling and time resolution that effectively capture rhythm, harmony, and instrumentation. Conversely, PWVD excels at predicting energy, benefiting from its fine-grained spectral detail. Valence prediction showed no clear pattern, although the best performance was achieved with a PWVD projected onto the Mel scale, suggesting that combining perceptual and detailed spectral information may be advantageous for Valence prediction. Moreover, the combined PWVD&Mel input didn’t show great improvement, but remain promising with a different architecture.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectH8 What you hear is what you get: An exploration of audio level feature extraction for music recommendations
dc.titleH8 What you hear is what you get: An exploration of audio level feature extraction for music recommendations
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.courseuuApplied Data Science
dc.thesis.id52762


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record