Speech Detection for noisy audio files

Hamandouche, Daniel

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Nazarov, Aleksei
dc.contributor.author	Hamandouche, Daniel
dc.date.accessioned	2022-04-08T00:00:42Z
dc.date.available	2022-04-08T00:00:42Z
dc.date.issued	2022
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/41469
dc.description.abstract	Voice activity detection is the challenge to detect the presence or absence of speech in audio files. Different obstructions, such as noisy environments, make this challenge tougher. A noisy environment is, in that case, defined as a recording, which has a low signal-to-noise ratio (SNR). This thesis aims to detect speech in very noisy environments with an SNR of -10dB and lower. As current deep learning methods were not designed for these very noisy environments and, thus, do not perform well, a new model is adapted to the task. Wav2vec2.0 was created for speech recognition and uses raw audio as input. This thesis adopted and finetuned Wav2vec2.0 for speech detection and extended it by using spectrograms as a second input. The resulting model is compared to two existing speech detection models, one using an architecture based on Lenet-5 and one using a U-net-shaped architecture. All three models are tested on the QUT-NOISE-TIMIT dataset. The results show that the Wav2vec2.0 downstream model performs best on all noise levels. Wav2vec2.0 had the lowest half-total error rate for high noise with 7.71% on a signal-to-noise ratio of -10dB and 18.56% for an SNR of -15dB. All models missed at least 40% of speech for higher noise, so no model is stable. Furthermore, a sub-question investigated whether the results are beneficial for follow-up tasks. For that, the predictions of each model were used as a pre-processing step by removing all segments without speech. The results show that Wav2vec2.0 not only improves speech detection for high noise environments, but this improvement also affects speech emotion recognition in these environments. Additionally, this thesis shows that self-supervised learning methods as well as using raw audio are beneficial to the task of speech detection.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	In this thesis, Wav2vec was adapted to the task of detecting speech in noisy environments. The outcomes were compared with a CNN and a U-net model. Additionally, Speech Emotion Recognition was used as follow-up task to see the benefit of the Wav2vec model as pre-processing step.
dc.title	Speech Detection for noisy audio files
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Voice Activity Detection; Wav2vec; U-net; Speech Emotion Recognition
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	3265

Files in this item

Name:: Master Thesis Daniel Hamandouc ...
Size:: 1.750Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record