Improving the Automatic Speech Recognition Model Whisper with Voice Activity Detection
Summary
The Dutch National Police possesses a substantial amount of audio data, and transcribing these audio files manually for analysis purposes is very labour intensive. That is why the department TROI developed an application that uses Whisper, a state-of-the-art Automatic Speech Recognition model, to automatically transcribe these audio files. Although the accuracy of Whisper is quite high, its execution time is relatively slow, posing a challenge when needing to transcribe large amounts of audio files. Considering Whisper’s minimalist strategy for data pre-processing, it is conceivable that incorporating advanced pre-processing techniques could further optimize its performance in terms of running time, without a considerable comprise in accuracy. Therefore, this research aims to investigate the influence of the pre-processing method voice activity detection on Whisper’s performance. The experiments in this research compare the performance of Whisper, Faster-Whisper and WhisperX on Dutch long-form audio data of two datasets: CGN and NFI-FRITS. Faster-Whisper and WhisperX both incorporate a voice activity detection model, Silero VAD and PyAnnote VAD, respectively. Additional experiments with hyperparameter tuning and testing the voice activity detection models are conducted. The evaluation metrics used in this research are Word Error Rate, precision, recall, F1 score and Real-Time Factor.
The results demonstrate that both Faster-Whisper and WhisperX outperform the baseline Whisper model. They exhibit improved WER, precision, recall, F1-score, and RTF, indicating the advantages of incorporating voice activity detection models within the Whisper framework. Generally, WhisperX is outperforming Faster-Whisper across the different datasets and settings, on almost all performance metrics. Furthermore, the effects of tuning speech probability thresholds in Faster-Whisper and WhisperX are not clear, as they do not show a specific trend. The comparison between Silero and PyAnnote VAD shows variations in precision and recall, where PyAnnote VAD, as incorporated in WhisperX, is outperforming on precision, and Silero VAD, as incorporated in Faster-Whisper, is outperforming on recall. In conclusion, incorporating a voice activity detection model as a pre-processing technique enhances the performance of Whisper by improving transcription accuracy measured by the Word-Error Rate, precision, recall, and F1-score, and reducing the execution time measured by the Real-Time Factor. However, the effects of tuning the speech probability threshold in Faster-Whisper and WhisperX, are limited. Future research is recommended to examine alternative voice activity detection models, test the memory usage of the models, investigate differences between Whisper implementations, and look into alternative preprocessing techniques.