Implementing Wav2Vec 2.0 into an Automated Reading Tutor
Summary
The number of low-literate adults in the Netherlands has been steadily increasing over the past decades.
Research shows that proper reading instruction requires repeated individualized feedback. However, teachers
often do not have the time or resources to provide this. Computer assisted reading tutors could provide a
solution. Most current systems show good results at detecting word-level errors, but struggle to identify
mispronunciations. Recent studies have shown that the use of large semi-supervised models like Wav2Vec
2.0 could improve the performance of mispronunciation detection models. The goal of this thesis is to research
the effectiveness Wav2Vec 2.0 for the task of mispronunciation detection in Dutch children, and to implement
it into an automated reading tutor. First, two types of Wav2Vec 2.0 models were created for classification
of mispronunciation data from the speech therapy domain. Specifically, the task was target phone detection
(TPD), where the pronunciation of each phone in a word is assessed individually. The first model performs
end-to-end phonetic transcription, the second model uses pooling over the time dimension on the Wav2Vec
2.0 embeddings and then attempt to classify mispronunciations directly. Both of these models were then
implemented into a reading error detection (RED) model to see whether the mispronunciation detection
aspect of the RED model could be improved. For TPD, the models significantly improved over a baseline
goodness of pronunciation (GOP) model. For RED, the use of Wav2Vec 2.0 lead to a small improvement for
the classification of phone-level errors.