View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        H8 What you hear is what you get: An exploration of audio level feature extraction for music recommendations

        Thumbnail
        View/Open
        Thesis_Ilse_Bakker.docx (3.098Mb)
        Publication date
        2025
        Author
        Bakker, Ilse
        Metadata
        Show full item record
        Summary
        This thesis investigates the potential of the Wigner-Ville Distribution (WVD) as a high-resolution alternative time-frequency music representation to the widely used, but less detailed, Mel-spectrogram as input for Convolutional Recurrent Neural Network (CRNN)-based audio feature extraction (Hershey et al., 2017). While WVD has previously been applied in urban sound classification tasks using CNNs (Christonasis et al., 2023), its use for musical feature extraction—specifically genre, energy, and valence—remains largely unexplored. The deep learning architecture employed in this study builds upon the Music2Vec pipeline proposed by Hebbar (2020), combining convolutional layers to extract spatial features from time-frequency representations, Long Short-Term Memory (LSTM) layers to capture temporal dependencies, and a Deep Neural Network (DNN) for both classification and regression tasks. Experiments were conducted on the Free Music Archive (FMA) dataset, consisting of 8 balanced musical genres with 1,000 tracks each (30 seconds per track). For each of the 8,000 tracks, five Pseudo WVD spectrogram versions were generated by varying segment length and frequency resolution, alongside comparative Mel-spectrograms. A final experiment combined both representations as a two-channel input. Preliminary results show that Mel-spectrograms outperform PWVD in genre classification, likely due to their perceptual scaling and time resolution that effectively capture rhythm, harmony, and instrumentation. Conversely, PWVD excels at predicting energy, benefiting from its fine-grained spectral detail. Valence prediction showed no clear pattern, although the best performance was achieved with a PWVD projected onto the Mel scale, suggesting that combining perceptual and detailed spectral information may be advantageous for Valence prediction. Moreover, the combined PWVD&Mel input didn’t show great improvement, but remain promising with a different architecture.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/50015
        Collections
        • Theses
        Utrecht university logo