Towards Explaining Neural Networks
Summary
In recent years, Neural Networks (NN) and Deep Learning (DL) have achieved exceptional performance in a number of applications such as computer vision, natural language processing, audio recognition and machine translation. However, NNs as predictors are usually not interpretable in practice and the learning mechanism is theoretically not well understood yet by researchers. Therefore, NNs are known as "black boxes". To explain how NNs work, we perform different types of empirical analysis on trained models for a simple supervised classification task on one-dimensional signals, including analysis of hidden layer activations, visualization by gradient ascent, experiments on learning noise labels and measuring distance in the high-dimensional feature space, etc. In practice, NN models surpass the traditional signal processing methods in terms of performance on the task. For explanations on how NNs work, first we observe that this specific task can be interpreted directly from weights with some certain NN structures with limited expressivity; second, empirically NNs learn a smoothed first derivative extractor in this task, from which we suggest that NN models learn "principal subpatterns"; third, with measuring the inner- and inter-class distance of the data samples, we suggest that the behaviour of the networks that learn from real or structured data is to shrink the layer activation representation to a certain range of encoding for data samples in the same class with internal hidden layers, which differs from the behaviour of the networks in the abnormal case to fit random noise with brute-force memorization. The difference in network behaviour also provides a reasonable answer to the question why the over-parameterized NNs are able to achieve generalization power.