## A Convolutional Neural Attention Approach to the Identifier Naming Problem in Program Code

##### Summary

Conditional random fields were previously used by Raychev, Vechev, and Krause, 2015 in a solution to the VarNaming problem (Allamanis, Barr, Bird, & Sutton, 2014), which is defined as the problem of assigning good, mutually consistent names to the various locally defined variable identifiers within a piece of source code. The conditional random field model assigns probabilities to each possible assignment of names to the various variable identifiers and then tries to find the assignment of maximum probability. By design, the implementation of the conditional random field model of Raychev et al., 2015 makes some locality assumptions during inference which might preclude it from obtaining a more global picture of the optimal names that should be assigned to each identifier.
In recent years, convolutional neural networks have become invaluable tools in the area of computer vision and image processing. Convolutional neural networks allow for state of the art prediction performance in the task of image classification by aggregating local information into a hierarchy of increasingly higher level features that eventually inform the final prediction. In this thesis, we propose (and experimentally verify) an application of convolutional neural network architectures to the domain of inferring good identifier names by interpreting a particular graph based representation of the source code as some sort of generalized image. From this perspective, source code identifiers are to be seen as pixels belonging to some higher dimensional picture, with the normal adjacency relations between pixels being replaced by various syntactic and semantic relations that are extracted from the source code. Interpreted this way, standard convolutional architectures become applicable to the source code domain, and the goal is to design a convolutional architecture that can complement the conditional random field model by extracting a set of distinctive -higher order- features that can be used to improve the initial predictions made by the conditional random field model.
Although the idea is seemingly straightforward, a concrete implementation of this idea raises many technical and conceptual questions. Topologically, these generalized images are a lot more complicated than their ordinary two dimensional counterparts, with the generalized variant possibly being as complicated as any general multigraph. The adjacency relation between the pixels is very sparse, by which we mean that almost all ’pixels’ lie on the border, that, even in the conventional setting, makes them more difficult to handle. Typical concepts like K-by-K receptive fields, stride, pooling, etc. need to be re-evaluated in this new environment. Another serious difference is the fact that in our generalized setting, pixel values are now categorical instead of numerical. At first glance this prevents us from applying any numerical kernel operations, like averaging over the pixel values in a neighborhood, but by adequately embedding the categorical labels into a finite dimensional vector space, the use of numerical kernels can be recovered.