Applying Image Recognition to Automatic Speech Recognition: Determining Suitability of Spectrograms for Training a Deep Neural Network for Speech Recognition
MetadataShow full item record
In speech recognition, Neural Networks are used to recognise the sequence of phonemes in an audio signal. These networks are trained on audio data pre-processed into some (type of) spectral vector. We present an alternative method that pre-processes speech utterances into visual representations, called spectrograms, and train a neural network suitable for image recognition to identify phonemes. The resulting network was able to classify 99.73% of a set of vowels containing samples of ‘iy’, ‘ah’ and ‘uw’ correctly, 91.87% of a set of vowels containing samples ‘iy’, ‘ih’ and ‘eh’, and 75.97% of the full dataset of twelve vowels. These results show that using image recognition in automatic speech recognition is worth further investigating.