Predicting Patient Churn: Features that predict when Breast Cancer Patients leave their online community
Summary
Often, patient organisations maintain an online community aimed at the patients they
associate with. The Dutch breast cancer organisation (BVN) for example hosts the forum
‘de Amazones’, which is aimed at breast cancer patients. It is an important medium for
patients and the association alike. for the association, it is a means to enable patients
the best quality of life. For patients, participation on the forum empowers them. It
was observed in the past year, however, that the activity on ‘de Amazones’ strongly
decreased.
This thesis applies the principle of customer churn (unsubscription) to the forum as
an effort to identify those forum users that will leave the forum within one, two or three
months. Identifying churners is a first step towards a program for social communities like
the Duch breast cancer association to identify users as churners, and respond accordingly.
The problem of churn prediction was approached from a supervised machine learning
point. Twelve simple and easy-to-annotate variables were used to identify all forum
posts with. Half of them described a single point in time (static features), while the
other half summarised the past month (retrospective features). They were grouped into
different feature groups, called: inactivity, textual, textual (retrospective), opinion, and
opinion (retrospective) features. Different combinations of these variables were used to
predict whether or not the writer of the post would be churned within one, two, or
three months. The algorithm that was used is called XGBoost. It builds an ensemble of
trees with gradient boosting. The resulting models were compared, to determine which
groups of features were the most influential ones. Predictive accuracy was measured in
ROC-AUC.
The results show that on a realistic test set, churn in one month can be most accu-
rately predicted (AUC = 0.670). This result is further examined with the false negative
rate (FNR), which reflects how many of churners were correctly identified. Both scores
were visibly influenced whenever only few samples from the data were available. It was
also shown that the two restrospective feature groups were the most influential feature
groups.