Identification and interpretation of heterogeneous sparse conditional independence structures using Gaussian Mixture Modelling and Gaussian Graphical Modelling
Summary
Humans, and especially researchers, often find themselves trying to categorise things in order to better understand them. Take, for example, different subtypes of cancer, mental illness, personality, political affiliation, or opinions. In cases where it is unknown what kind of subtypes exist, identifying subtypes is a challenge. In the fields of artificial intelligence and computational data science, this challenge can be addressed with unsupervised learning, also known as clustering. Clustering methods divide data into subgroups, based on features. However, from this division, it is often difficult to infer how or why subgroups are different. For this thesis, I address this shortcoming by combining a clustering method – Gaussian Mixture Modelling – with a structural estimation method – Gaussian Graphical Modelling. Structural estimation methods reveal relations between variables and are visualised as networks that can be analysed and interpreted. The combined method divides the data based on the structure of these networks. By comparing the network structures, we infer structural differences between subgroups. The method is first tested on artificial data, showing that it is sensitive to low sample sizes. Then, the method is applied to two datasets: data on Social Media Disorder among Dutch Adolescents, and European data on the public opinion of immigrants and refugees. Results for the Social Media Disorder data show three slightly different subtypes. Results for the public opinion data suggest three clear, distinguishable subtypes.