View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Enhancing Dutch Audio Transcription through Integration of Speaker Diarization into the Automatic Speech Recognition model Whisper

        Thumbnail
        View/Open
        1429264_MUL_Thesis_FINAL_VERSION.pdf (672.8Kb)
        Publication date
        2023
        Author
        Mul, Anouk
        Metadata
        Show full item record
        Summary
        The Dutch National Police faces the challenge of efficiently processing and transcribing a significant amount of audio data collected during investigations. To assist detectives in their work, artificial intelligence (AI) models, such as Whisper, an automatic speech recognition (ASR) model, are implemented into user-friendly applications. However, Whisper lacks the ability to distinguish between speakers, limiting its application in scenarios involving multiple speakers and overlapping speech. This thesis explores the performance of speaker diarization pipelines from PyAnnote and NeMo on the VoxConverse and NFI-FRITS datasets. Additionally, experiments are conducted to improve the performance of the pipelines on both datasets by choosing appropriate hyperparameter settings. By incorporating a speaker diarization system alongside Whisper, the aim is to enhance the robustness and comprehensiveness of an existing speech-to-text application. The evaluation reveals promising results, with hyperparameter tuning and domain-specific configurations significantly improving the Diarization Error Rate (DER) for both datasets. PyAnnote benefits from adjusted segmentation and clustering thresholds, as well as changes in the clustering method. NeMo’s clustering diarizer outperforms the neural diarizer, and domain-specific configurations enhance performance. In general, NeMo demonstrates superior performance on both datasets in terms of Diarization Error Rate (DER) compared to PyAnnote. However, this improved performance comes at the cost of increased computational requirements in terms of speed and memory usage. By augmenting Whisper with speaker diarization, investigators can efficiently analyze transcribed text ascribed to individual speakers, improving the accuracy and efficiency of audio data analysis. Further research should focus on compiling an enlarged domain-specific dataset with varying numbers of speakers to enable more specific hyperparameter tuning and achieve better performance results. Additionally, optimizing resource usage for the superior NeMo model would enhance its speed and memory efficiency. Overall, this research contributes to advancing speaker diarization methods alongside the Whisper ASR model. These advancements will lead to more effective speech analysis tools for law enforcement and other fields relying on accurate and comprehensive audio processing. The code is available at https://github.com/anouk1512/MSc_WhisperSpeakerDiarization.git.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/44643
        Collections
        • Theses
        Utrecht university logo