Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorNazarov, Aleksei
dc.contributor.authorHamandouche, Daniel
dc.date.accessioned2022-04-08T00:00:42Z
dc.date.available2022-04-08T00:00:42Z
dc.date.issued2022
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/41469
dc.description.abstractVoice activity detection is the challenge to detect the presence or absence of speech in audio files. Different obstructions, such as noisy environments, make this challenge tougher. A noisy environment is, in that case, defined as a recording, which has a low signal-to-noise ratio (SNR). This thesis aims to detect speech in very noisy environments with an SNR of -10dB and lower. As current deep learning methods were not designed for these very noisy environments and, thus, do not perform well, a new model is adapted to the task. Wav2vec2.0 was created for speech recognition and uses raw audio as input. This thesis adopted and finetuned Wav2vec2.0 for speech detection and extended it by using spectrograms as a second input. The resulting model is compared to two existing speech detection models, one using an architecture based on Lenet-5 and one using a U-net-shaped architecture. All three models are tested on the QUT-NOISE-TIMIT dataset. The results show that the Wav2vec2.0 downstream model performs best on all noise levels. Wav2vec2.0 had the lowest half-total error rate for high noise with 7.71% on a signal-to-noise ratio of -10dB and 18.56% for an SNR of -15dB. All models missed at least 40% of speech for higher noise, so no model is stable. Furthermore, a sub-question investigated whether the results are beneficial for follow-up tasks. For that, the predictions of each model were used as a pre-processing step by removing all segments without speech. The results show that Wav2vec2.0 not only improves speech detection for high noise environments, but this improvement also affects speech emotion recognition in these environments. Additionally, this thesis shows that self-supervised learning methods as well as using raw audio are beneficial to the task of speech detection.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectIn this thesis, Wav2vec was adapted to the task of detecting speech in noisy environments. The outcomes were compared with a CNN and a U-net model. Additionally, Speech Emotion Recognition was used as follow-up task to see the benefit of the Wav2vec model as pre-processing step.
dc.titleSpeech Detection for noisy audio files
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsVoice Activity Detection; Wav2vec; U-net; Speech Emotion Recognition
dc.subject.courseuuArtificial Intelligence
dc.thesis.id3265


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record