H8 What you hear is what you get: An exploration of audio level feature 
extraction for music recommendations

Bakker, Ilse

dc.rights.license	CC-BY-NC-ND
dc.contributor	geen
dc.contributor.advisor	Gauthier, David
dc.contributor.author	Bakker, Ilse
dc.date.accessioned	2025-08-28T00:01:03Z
dc.date.available	2025-08-28T00:01:03Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50015
dc.description.abstract	This thesis investigates the potential of the Wigner-Ville Distribution (WVD) as a high-resolution alternative time-frequency music representation to the widely used, but less detailed, Mel-spectrogram as input for Convolutional Recurrent Neural Network (CRNN)-based audio feature extraction (Hershey et al., 2017). While WVD has previously been applied in urban sound classification tasks using CNNs (Christonasis et al., 2023), its use for musical feature extraction—specifically genre, energy, and valence—remains largely unexplored. The deep learning architecture employed in this study builds upon the Music2Vec pipeline proposed by Hebbar (2020), combining convolutional layers to extract spatial features from time-frequency representations, Long Short-Term Memory (LSTM) layers to capture temporal dependencies, and a Deep Neural Network (DNN) for both classification and regression tasks. Experiments were conducted on the Free Music Archive (FMA) dataset, consisting of 8 balanced musical genres with 1,000 tracks each (30 seconds per track). For each of the 8,000 tracks, five Pseudo WVD spectrogram versions were generated by varying segment length and frequency resolution, alongside comparative Mel-spectrograms. A final experiment combined both representations as a two-channel input. Preliminary results show that Mel-spectrograms outperform PWVD in genre classification, likely due to their perceptual scaling and time resolution that effectively capture rhythm, harmony, and instrumentation. Conversely, PWVD excels at predicting energy, benefiting from its fine-grained spectral detail. Valence prediction showed no clear pattern, although the best performance was achieved with a PWVD projected onto the Mel scale, suggesting that combining perceptual and detailed spectral information may be advantageous for Valence prediction. Moreover, the combined PWVD&Mel input didn’t show great improvement, but remain promising with a different architecture.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	H8 What you hear is what you get: An exploration of audio level feature extraction for music recommendations
dc.title	H8 What you hear is what you get: An exploration of audio level feature extraction for music recommendations
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Applied Data Science
dc.thesis.id	52762

Files in this item

Name:: Thesis_Ilse_Bakker.docx
Size:: 3.098Mb
Format:: Microsoft Word 2007

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

H8 What you hear is what you get: An exploration of audio level feature extraction for music recommendations

Files in this item

This item appears in the following Collection(s)