Predicting Patient Churn: Features that predict when Breast Cancer Patients leave their online community

Sternheim, A.M.

View/Open

2018_07_09_Eindversie_Thesis_AMS.pdf (2.071Mb)

Publication date

2018

Author

Sternheim, A.M.

Metadata

Show full item record

Summary

Often, patient organisations maintain an online community aimed at the patients they associate with. The Dutch breast cancer organisation (BVN) for example hosts the forum ‘de Amazones’, which is aimed at breast cancer patients. It is an important medium for patients and the association alike. for the association, it is a means to enable patients the best quality of life. For patients, participation on the forum empowers them. It was observed in the past year, however, that the activity on ‘de Amazones’ strongly decreased. This thesis applies the principle of customer churn (unsubscription) to the forum as an effort to identify those forum users that will leave the forum within one, two or three months. Identifying churners is a first step towards a program for social communities like the Duch breast cancer association to identify users as churners, and respond accordingly. The problem of churn prediction was approached from a supervised machine learning point. Twelve simple and easy-to-annotate variables were used to identify all forum posts with. Half of them described a single point in time (static features), while the other half summarised the past month (retrospective features). They were grouped into different feature groups, called: inactivity, textual, textual (retrospective), opinion, and opinion (retrospective) features. Different combinations of these variables were used to predict whether or not the writer of the post would be churned within one, two, or three months. The algorithm that was used is called XGBoost. It builds an ensemble of trees with gradient boosting. The resulting models were compared, to determine which groups of features were the most influential ones. Predictive accuracy was measured in ROC-AUC. The results show that on a realistic test set, churn in one month can be most accu- rately predicted (AUC = 0.670). This result is further examined with the false negative rate (FNR), which reflects how many of churners were correctly identified. Both scores were visibly influenced whenever only few samples from the data were available. It was also shown that the two restrospective feature groups were the most influential feature groups.

URI

https://studenttheses.uu.nl/handle/20.500.12932/30697

Collections

Theses