dc.description.abstract | In the age of digital music streaming, the ability to automatically understand
and organize audio content is critical for applications such as recommendation,
retrieval, and genre detection. This thesis introduces a self-supervised learning
approach for extracting semantic embeddings from raw audio waveforms using a
hybrid CNN-Transformer model. These embeddings are trained using contrastive
learning Barlow Twins, and are intended for use in content-based music recom-
mendation systems. By combining local acoustic detail extraction via CNNs with
sequence modelling via Transformers, the system learns rich representations with-
out labelled data. We evaluate the embeddings using t-SNE visualizations, FAISS-
based similarity retrieval, and through a prototype interactive recommendation
demo. The results demonstrate the effectiveness of our approach in organizing
music meaningfully and enabling cold-start recommendation without user history. | |