What you hear is what you get: An exploration of audio level feature extraction for music recommendations

Sahu, Kshitij

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Gauthier, David
dc.contributor.author	Sahu, Kshitij
dc.date.accessioned	2025-08-20T23:01:17Z
dc.date.available	2025-08-20T23:01:17Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/49799
dc.description.abstract	In the age of digital music streaming, the ability to automatically understand and organize audio content is critical for applications such as recommendation, retrieval, and genre detection. This thesis introduces a self-supervised learning approach for extracting semantic embeddings from raw audio waveforms using a hybrid CNN-Transformer model. These embeddings are trained using contrastive learning Barlow Twins, and are intended for use in content-based music recom- mendation systems. By combining local acoustic detail extraction via CNNs with sequence modelling via Transformers, the system learns rich representations with- out labelled data. We evaluate the embeddings using t-SNE visualizations, FAISS- based similarity retrieval, and through a prototype interactive recommendation demo. The results demonstrate the effectiveness of our approach in organizing music meaningfully and enabling cold-start recommendation without user history.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	What you hear is what you get: An exploration of audio level feature extraction for music recommendations
dc.title	What you hear is what you get: An exploration of audio level feature extraction for music recommendations
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Applied Data Science
dc.thesis.id	52109

Files in this item

Name:: thesis.pdf
Size:: 4.231Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record