A Multimodal Approach: Acoustic-Linguistic Modelling for Neural Extractive Speech Summarisation on Podcasts

Çalik, Berk

View/Open

Thesis_BerkCalik.pdf (2.663Mb)

Publication date

2023

Author

Çalik, Berk

Metadata

Show full item record

Summary

Podcasts, a contemporary medium of audio-only content, have rapidly progressed in consumption and generation across the internet. Along with the accelerated pace of its popularity in recent years, effectively publicising the podcast shows has become a need for all podcast creators, listeners and streaming platforms. To improve the overall visibility of podcast contents and enhance user engagement, a summary of an episode has become a need for the users or utilising in searching and recommendation systems, which can be a replacement of or in addition to keywords, manual descriptions and transcripts. Since manual summarisation for podcast episodes takes ample time, automatic summarisation becomes a valuable task. Specifically, in the context of automatic summarisation task for spoken documents, we need to consider that the extracted salient information relies on what is said but also how it is said. In the wake of this, this thesis investigates summarisation models for podcasts and proposes a multimodal approach exploiting acoustic and linguistic features. Accordingly, we aimed to explore how to automatically generate an extractive summary from a podcast episode in a multimodal way. For our research, we have employed a lexical-only pre-trained transformer model (i.e. SentenceBERT) for embedding sentences in the transcripts. In this work, speech summarisation of podcasts is defined as a classification task; with respect to that, our purpose was to extract meaningful sentences from the transcribed text where the importance is predicted by combining acoustic and linguistic information. To build an experimental setup for analysing the impact of acoustic features, we have integrated a two-layer multilayer perceptron on the top layer of the SentenceBERT model. Feature projection, ranking and selection were also performed for feature importance analysis of acoustic information. After projection and selection of acoustic features, our proposed multimodal model outperforms the baseline (text-only) and achieves moderately better ROUGE scores; with this project, we aim not to find a complete solution for the automatic summarisation of podcast episodes but to understand the critical part of the puzzle linked to incorporating acoustic features into podcast summarisation.

URI

https://studenttheses.uu.nl/handle/20.500.12932/43582

Collections

Theses