Enhancing Dutch Audio Transcription through Integration of Speaker Diarization into the Automatic Speech Recognition model Whisper
Summary
The Dutch National Police faces the challenge of efficiently processing and transcribing a significant amount of audio data collected during investigations. To assist detectives in their work, artificial intelligence (AI) models, such as Whisper, an automatic speech recognition (ASR) model, are implemented into user-friendly applications. However, Whisper lacks the ability to distinguish between speakers, limiting its application in scenarios involving multiple speakers and overlapping speech. This thesis explores the performance of speaker diarization pipelines
from PyAnnote and NeMo on the VoxConverse and NFI-FRITS datasets. Additionally, experiments are conducted to improve the performance of the pipelines on both datasets by choosing appropriate hyperparameter settings. By incorporating a speaker diarization system alongside Whisper, the aim is to enhance the robustness and comprehensiveness of an existing speech-to-text application. The evaluation reveals promising results, with hyperparameter tuning and domain-specific configurations significantly improving the Diarization Error Rate (DER)
for both datasets. PyAnnote benefits from adjusted segmentation and clustering thresholds, as well as changes in the clustering method. NeMo’s clustering diarizer outperforms the neural diarizer, and domain-specific configurations enhance performance. In general, NeMo demonstrates superior performance on both datasets
in terms of Diarization Error Rate (DER) compared to PyAnnote. However, this improved performance comes at the cost of increased computational requirements in terms of speed and memory usage. By augmenting Whisper with speaker diarization, investigators can efficiently analyze transcribed text ascribed to individual speakers, improving the accuracy and efficiency of audio data analysis. Further research should focus on compiling an enlarged domain-specific dataset with varying numbers of speakers to enable more specific hyperparameter tuning and achieve
better performance results. Additionally, optimizing resource usage for the superior NeMo model would enhance its speed and memory efficiency. Overall, this research contributes to advancing speaker diarization methods alongside the Whisper ASR model. These advancements will lead to more effective speech analysis tools for law enforcement and other fields relying on accurate and comprehensive audio processing. The code is available at https://github.com/anouk1512/MSc_WhisperSpeakerDiarization.git.