What you hear is what you get: An exploration of audio level feature extraction for music recommendations
Summary
In the age of digital music streaming, the ability to automatically understand
and organize audio content is critical for applications such as recommendation,
retrieval, and genre detection. This thesis introduces a self-supervised learning
approach for extracting semantic embeddings from raw audio waveforms using a
hybrid CNN-Transformer model. These embeddings are trained using contrastive
learning Barlow Twins, and are intended for use in content-based music recom-
mendation systems. By combining local acoustic detail extraction via CNNs with
sequence modelling via Transformers, the system learns rich representations with-
out labelled data. We evaluate the embeddings using t-SNE visualizations, FAISS-
based similarity retrieval, and through a prototype interactive recommendation
demo. The results demonstrate the effectiveness of our approach in organizing
music meaningfully and enabling cold-start recommendation without user history.