## Mining Exceptional Linear Regression Models with Tree-Constrained Gradient Ascent

##### Summary

The Exceptional Model Mining (EMM) framework is a problem encountered in exploratory data mining, in which the task is to find subgroups in a dataset, for which a model induced within that group deviates substantially from the same kind of model induced on the entire dataset.
In this work, we consider the EMM problem with linear regression models, where we seek to find subgroups within which the regression model's coefficients deviate substantially from those of the global model. In particular, we consider the Beam Search (BS) strategy that is often used for solving this problem heuristically, and seek to develop a search strategy that can improve on the performance of that algorithm within this context.
Specifically, then, we propose a search strategy that uses individual records' influence on the parameter-estimation of the model within, and hence on the quality of, a subgroup being optimized during the search. The key idea is to generalize, during this search, the notion of a subgroup to that of a \emph{soft} subgroup. We use this conceptualization to rewrite the quality measure as a differentiable function of the individual records' degree of inclusion in this soft subgroup.
Optimization is then performed using what may be interpreted as a form of constrained gradient ascent. Specifically, we introduce a generalization of classification trees, which may be grown on any differentiable objective function. We show that the resulting model may be interpreted, colloquially, as trying to find the \emph{weakest} possible constraint on the objective function, while still conforming to the tree structure. Optimization is then performed by assigning all records to the tree's leafs, and the records' degree of inclusion in the subgroup is optimized \emph{en bloc}, for all records assigned to the same leaf, using gradient ascent. The tree structure then enables us, once converged, to readily express the resulting optimum in a rule-based format that is meaningful in the EMM context. We refer to this tree-constrained gradient ascent strategy as GRADCAT, for Gradient Ascent with Differentiable ClAssification Trees.
We furthermore propose a post-processing procedure that is used to reduce overfitting on the training data, and to remove superfluous conditions in the resulting subgroup description. We refer in this work to the composite algorithm, which constitutes the novel EMM-specific search strategy proper, as GRADCAT-EMM.
By showing that EMM with linear regression models is a strict generalization of the Subgroup Discovery paradigm, i.e., that the latter is strictly a special case of the former, we find that the novel algorithm can also be immediately applied in that context.
Experiments are performed in which GRADCAT-EMM's performance is compared with that of Beam Search, as well as with that of a composite algorithm that uses the proposed post-processing together with the BS algorithm's method of finding subgroup descriptions. We perform these experiments on both synthetic and real world datasets. On the synthetic data, we find that GRADCAT-EMM consistently statistically significantly outperforms both of the BS-based strategies. On the real world datasets, we initially find that BS performs significantly better than both other algorithms. However, comparing characteristics of the results obtained, we argue that this does not properly reflect our preferences of these characteristics. Arguing that this is due to an oversight in the evaluation measure used, we re-run the experiments on the real world data, using an adapted measure that is meant to compensate for this observed effect. From these experiments, we find that we no longer observe a significant difference in the characteristics of the results obtained. We also find that the BS algorithm with post-processing now performs marginally better than the other strategies, but we largely fail to detect any significant difference between the algorithms.
We end up arguing that the observed relative difference in the algorithms' performance, between the synthetic and real world datasets, can largely be attributed to a difference in the characteristics of these different datasets. In attempting to characterize the difference between these datasets, we do not find a real preference for any of the algorithms if the dataset represents a relatively small and well-structured search space. However, we conclude by hypothesizing that GRADCAT-EMM may be expected to perform particularly well on datasets that represent a large search space with a very low signal-to-noise ratio.