H8 What you hear is what you get: An exploration of audio level feature 
extraction for music recommendations

Bakker, Ilse

View/Open

Thesis_Ilse_Bakker.docx (3.098Mb)

Publication date

2025

Author

Bakker, Ilse

Metadata

Show full item record

Summary

This thesis investigates the potential of the Wigner-Ville Distribution (WVD) as a high-resolution alternative time-frequency music representation to the widely used, but less detailed, Mel-spectrogram as input for Convolutional Recurrent Neural Network (CRNN)-based audio feature extraction (Hershey et al., 2017). While WVD has previously been applied in urban sound classification tasks using CNNs (Christonasis et al., 2023), its use for musical feature extraction—specifically genre, energy, and valence—remains largely unexplored. The deep learning architecture employed in this study builds upon the Music2Vec pipeline proposed by Hebbar (2020), combining convolutional layers to extract spatial features from time-frequency representations, Long Short-Term Memory (LSTM) layers to capture temporal dependencies, and a Deep Neural Network (DNN) for both classification and regression tasks. Experiments were conducted on the Free Music Archive (FMA) dataset, consisting of 8 balanced musical genres with 1,000 tracks each (30 seconds per track). For each of the 8,000 tracks, five Pseudo WVD spectrogram versions were generated by varying segment length and frequency resolution, alongside comparative Mel-spectrograms. A final experiment combined both representations as a two-channel input. Preliminary results show that Mel-spectrograms outperform PWVD in genre classification, likely due to their perceptual scaling and time resolution that effectively capture rhythm, harmony, and instrumentation. Conversely, PWVD excels at predicting energy, benefiting from its fine-grained spectral detail. Valence prediction showed no clear pattern, although the best performance was achieved with a PWVD projected onto the Mel scale, suggesting that combining perceptual and detailed spectral information may be advantageous for Valence prediction. Moreover, the combined PWVD&Mel input didn’t show great improvement, but remain promising with a different architecture.

URI

https://studenttheses.uu.nl/handle/20.500.12932/50015

Collections

Theses