dc.description.abstract | Voice activity detection is the challenge to detect the presence or absence of speech in audio files. Different obstructions, such as noisy environments, make this challenge tougher. A noisy environment is, in that case, defined as a recording, which has a low signal-to-noise ratio (SNR).
This thesis aims to detect speech in very noisy environments with an SNR of -10dB and lower. As current deep learning methods were not designed for these very noisy environments and, thus, do not perform well, a new model is adapted to the task. Wav2vec2.0 was created for speech recognition and uses raw audio as input. This thesis adopted and finetuned Wav2vec2.0 for speech detection and extended it by using spectrograms as a second input. The resulting model is compared to two existing speech detection models, one using an architecture based on Lenet-5 and one using a U-net-shaped architecture. All three models are tested on the QUT-NOISE-TIMIT dataset. The results show that the Wav2vec2.0 downstream model performs best on all noise levels. Wav2vec2.0 had the lowest half-total error rate for high noise with 7.71% on a signal-to-noise ratio of -10dB and 18.56% for an SNR of -15dB. All models missed at least 40% of speech for higher noise, so no model is stable.
Furthermore, a sub-question investigated whether the results are beneficial for follow-up tasks. For that, the predictions of each model were used as a pre-processing step by removing all segments without speech.
The results show that Wav2vec2.0 not only improves speech detection for high noise environments, but this improvement also affects speech emotion recognition in these environments. Additionally, this thesis shows that self-supervised learning methods as well as using raw audio are beneficial to the task of speech detection. | |