An Empirical Evaluation of Convolutional and Recurrent Neural Networks for Lip Reading
Summary
The 3DCNN and the LSTM are both suited for video classification because of their
ability to take into account temporal information. However, the two models do this
in a very distinct manner. The aim of this work is to investigate which of the two
models is better suited for automatic lip reading. Moreover, we also tested which
model is better suited for transfer learning. We conducted two groups of experiments
in this work. The first group consisted of experiments in which the two models were
tested under several conditions in which the models were trained from scratch. The
second group was conducted to determine which of the two models is better suited
for transfer learning. We used a pretrained 3DCNN and LSTM from the first group of
experiments to verify whether the accuracy of a model trained on a different dataset
improved, compared to when it was trained from scratch. From the first group of
experiments, we concluded that the 3DCNN is better suited for automatic lip reading
because it achieves a higher test set accuracy than the LSTM. However, the 3DCNN
takes a lot longer to train than the LSTM. From the second group of experiments, we
can conclude that overall the 3DCNN is better suited for transfer learning. On the
basis of all the experiments conducted, we conclude that overall the 3DCNN seems
to be better suited for use in automatic lip reading in many different conditions.