Improving the Automatic Speech Recognition Model Whisper with Voice Activity Detection

Dielen, Isa

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Karnstedt-Hulpus, I.R.
dc.contributor.author	Dielen, Isa
dc.date.accessioned	2023-09-06T10:08:44Z
dc.date.available	2023-09-06T10:08:44Z
dc.date.issued	2023
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/45045
dc.description.abstract	The Dutch National Police possesses a substantial amount of audio data, and transcribing these audio files manually for analysis purposes is very labour intensive. That is why the department TROI developed an application that uses Whisper, a state-of-the-art Automatic Speech Recognition model, to automatically transcribe these audio files. Although the accuracy of Whisper is quite high, its execution time is relatively slow, posing a challenge when needing to transcribe large amounts of audio files. Considering Whisper’s minimalist strategy for data pre-processing, it is conceivable that incorporating advanced pre-processing techniques could further optimize its performance in terms of running time, without a considerable comprise in accuracy. Therefore, this research aims to investigate the influence of the pre-processing method voice activity detection on Whisper’s performance. The experiments in this research compare the performance of Whisper, Faster-Whisper and WhisperX on Dutch long-form audio data of two datasets: CGN and NFI-FRITS. Faster-Whisper and WhisperX both incorporate a voice activity detection model, Silero VAD and PyAnnote VAD, respectively. Additional experiments with hyperparameter tuning and testing the voice activity detection models are conducted. The evaluation metrics used in this research are Word Error Rate, precision, recall, F1 score and Real-Time Factor. The results demonstrate that both Faster-Whisper and WhisperX outperform the baseline Whisper model. They exhibit improved WER, precision, recall, F1-score, and RTF, indicating the advantages of incorporating voice activity detection models within the Whisper framework. Generally, WhisperX is outperforming Faster-Whisper across the different datasets and settings, on almost all performance metrics. Furthermore, the effects of tuning speech probability thresholds in Faster-Whisper and WhisperX are not clear, as they do not show a specific trend. The comparison between Silero and PyAnnote VAD shows variations in precision and recall, where PyAnnote VAD, as incorporated in WhisperX, is outperforming on precision, and Silero VAD, as incorporated in Faster-Whisper, is outperforming on recall. In conclusion, incorporating a voice activity detection model as a pre-processing technique enhances the performance of Whisper by improving transcription accuracy measured by the Word-Error Rate, precision, recall, and F1-score, and reducing the execution time measured by the Real-Time Factor. However, the effects of tuning the speech probability threshold in Faster-Whisper and WhisperX, are limited. Future research is recommended to examine alternative voice activity detection models, test the memory usage of the models, investigate differences between Whisper implementations, and look into alternative preprocessing techniques.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Considering the minimalist strategy for data pre-processing employed by the Whisper algorithm, it is conceivable that incorporating pre-processing techniques could further optimize its performance. This potential enhancement could expedite the execution time of Whisper without a considerable comprise in accuracy. This research aims to compare the performance of Whisper (baseline) with two re-implementations that incorporate the pre-processing technique Voice Activity Detection.
dc.title	Improving the Automatic Speech Recognition Model Whisper with Voice Activity Detection
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Automatic Speech Recognition; ASR; Whisper; Pre-processing; Voice Activity Detection
dc.subject.courseuu	Applied Data Science
dc.thesis.id	23821

Files in this item

Name:: MScThesis_IsaDielen_Final.pdf
Size:: 1.294Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record