Improving the effectiveness of different Automatic Speech Recognition models with hyperparameter tuning

Acosta, Christian

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Karnstedt-Hulpus, I.R.
dc.contributor.author	Acosta, Christian
dc.date.accessioned	2023-09-06T09:40:36Z
dc.date.available	2023-09-06T09:40:36Z
dc.date.issued	2023
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/44959
dc.description.abstract	Automatic Speech Recognition (ASR) models have shown great progress in recent years. Whisper is one of the latest models, showing state-of-the-art performance on a broad range of unseen datasets. This makes it a useful model for a broad range of applications, such as converting audio files into text transcripts. Detectives of the National Police Corps have a large amount of audio data to process for their investigations. Manual processing is tedious and resource intensive. Whisper can be a useful tool for speeding up investigations and alleviating the workload. While Whisper performs well out-of-the-box, its performance can still be further improved. Through the method of hyperparameter tuning and comparing different implementations of Whisper, the processing time, memory usage, and accuracy have been optimized. Firstly, we show that reducing computational precision improved the performance in all models tested. Secondly, reducing beam size to a more greedy strategy reduced processing time and memory usage with minimal influence on accuracy. Thirdly, larger batch sizes decreased processing time and increased accuracy, but also increased memory usage. Lastly, implementing Voice Activity Detection increased accuracy and decreased processing time without increasing memory usage. We conclude that Faster-Whisper is the overall best performing model for the current use-case. It has the best trade-off between processing time, memory usage, and accuracy. Consequently, this allows for the greatest transcription throughput when multiple instances of the model are used in parallel.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Improvement of the performance and accuracy of the Whisper Automatic Speech Recognition model and its variants through tuning of various hyperparameters. The aim of this study was to increase processing throughput without negatively affecting model accuracy.
dc.title	Improving the effectiveness of different Automatic Speech Recognition models with hyperparameter tuning
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Automatic Speech Recognition; Speech to Text; Whisper
dc.subject.courseuu	Applied Data Science
dc.thesis.id	23516

Files in this item

Name:: Master thesis ADS_Christian ...
Size:: 852.3Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record