Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorKarnstedt-Hulpus, I.R.
dc.contributor.authorDielen, Isa
dc.date.accessioned2023-09-06T10:08:44Z
dc.date.available2023-09-06T10:08:44Z
dc.date.issued2023
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/45045
dc.description.abstractThe Dutch National Police possesses a substantial amount of audio data, and transcribing these audio files manually for analysis purposes is very labour intensive. That is why the department TROI developed an application that uses Whisper, a state-of-the-art Automatic Speech Recognition model, to automatically transcribe these audio files. Although the accuracy of Whisper is quite high, its execution time is relatively slow, posing a challenge when needing to transcribe large amounts of audio files. Considering Whisper’s minimalist strategy for data pre-processing, it is conceivable that incorporating advanced pre-processing techniques could further optimize its performance in terms of running time, without a considerable comprise in accuracy. Therefore, this research aims to investigate the influence of the pre-processing method voice activity detection on Whisper’s performance. The experiments in this research compare the performance of Whisper, Faster-Whisper and WhisperX on Dutch long-form audio data of two datasets: CGN and NFI-FRITS. Faster-Whisper and WhisperX both incorporate a voice activity detection model, Silero VAD and PyAnnote VAD, respectively. Additional experiments with hyperparameter tuning and testing the voice activity detection models are conducted. The evaluation metrics used in this research are Word Error Rate, precision, recall, F1 score and Real-Time Factor. The results demonstrate that both Faster-Whisper and WhisperX outperform the baseline Whisper model. They exhibit improved WER, precision, recall, F1-score, and RTF, indicating the advantages of incorporating voice activity detection models within the Whisper framework. Generally, WhisperX is outperforming Faster-Whisper across the different datasets and settings, on almost all performance metrics. Furthermore, the effects of tuning speech probability thresholds in Faster-Whisper and WhisperX are not clear, as they do not show a specific trend. The comparison between Silero and PyAnnote VAD shows variations in precision and recall, where PyAnnote VAD, as incorporated in WhisperX, is outperforming on precision, and Silero VAD, as incorporated in Faster-Whisper, is outperforming on recall. In conclusion, incorporating a voice activity detection model as a pre-processing technique enhances the performance of Whisper by improving transcription accuracy measured by the Word-Error Rate, precision, recall, and F1-score, and reducing the execution time measured by the Real-Time Factor. However, the effects of tuning the speech probability threshold in Faster-Whisper and WhisperX, are limited. Future research is recommended to examine alternative voice activity detection models, test the memory usage of the models, investigate differences between Whisper implementations, and look into alternative preprocessing techniques.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectConsidering the minimalist strategy for data pre-processing employed by the Whisper algorithm, it is conceivable that incorporating pre-processing techniques could further optimize its performance. This potential enhancement could expedite the execution time of Whisper without a considerable comprise in accuracy. This research aims to compare the performance of Whisper (baseline) with two re-implementations that incorporate the pre-processing technique Voice Activity Detection.
dc.titleImproving the Automatic Speech Recognition Model Whisper with Voice Activity Detection
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsAutomatic Speech Recognition; ASR; Whisper; Pre-processing; Voice Activity Detection
dc.subject.courseuuApplied Data Science
dc.thesis.id23821


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record