dc.description.abstract | Automatic Speech Recognition (ASR) models have shown great progress in recent years. Whisper is one of the latest models, showing state-of-the-art performance on a broad range of unseen datasets. This makes it a useful model for a broad range of applications, such as converting audio files into text transcripts.
Detectives of the National Police Corps have a large amount of audio data to process for their investigations. Manual processing is tedious and resource intensive. Whisper can be a useful tool for speeding up investigations and alleviating the workload. While Whisper performs well out-of-the-box, its performance can still be further improved. Through the method of hyperparameter tuning and comparing different implementations of Whisper, the processing time, memory usage, and accuracy have been optimized.
Firstly, we show that reducing computational precision improved the performance in all models tested. Secondly, reducing beam size to a more greedy strategy reduced processing time and memory usage with minimal influence on accuracy. Thirdly, larger batch sizes decreased processing time and increased accuracy, but also increased memory usage. Lastly, implementing Voice Activity Detection increased accuracy and decreased processing time without increasing memory usage.
We conclude that Faster-Whisper is the overall best performing model for the current use-case. It has the best trade-off between processing time, memory usage, and accuracy. Consequently, this allows for the greatest transcription throughput when multiple instances of the model are used in parallel. | |