Statistical Modeling at the Syntax-Semantics Interface: Exploiting Automatically Induced Lexical Classes Evaluated through Variational Bayesian Inference
Summary
So far, the task of automatic verb classification has been widely explored through supervised as well as unsupervised machine learning techniques, based on syntactic and semantic features, and strictly related to argument structure theory and Levin (1993)’s verb classes. In the present study we go a step further than the previous research in this field (e.g. Lapata and Brew, 2004, Merlo and Stevenson, 2001, or Sun and Korhonen, 2009) by using automatically induced verb classes not as a goal, but rather as a starting point for a lexicon induction experiment for individual verbs. Inspired by Rooth, Riezler, Prescher, Carroll, and Beil (1999), a first experiment involves a clustering process of verbs represented by co-occurrence vectors of argument nouns extracted from the subcategorization frames of transitive and intransitive verbs; from the resulting model, a second experiment shows that lexicons of argument nouns for fixed verbs can be created by re-estimating the nouns’ absolute frequencies with respect to the same verb, modified by cluster-related probabilities from the model. Apart from being relatively simple statistical inference steps, the relevance of this study is also determined by the detailed and combined evaluation system used for model selection, including a Pseudo-Disambiguation task, in-depth cluster metrics, and a Variational Bayes Gaussian Mixture. It was found that argument selectional preference is a good indicator of verb classes, especially for the data set that included verbs of the alternation in which the object of the transitive is the subject of the intransitive. Moreover, through the support of a quantitative, WordNet-based method, it was shown that such classes are relatively little levinian. Future research could be directed to the exploration of adjunct slots, as well as an extension of the evaluation architecture to other clustering tasks within NLP.