dc.description.abstract | Purpose: Breast and colorectal cancer are among the most dominant types of cancer regarding incidence and mortality. Cancer staging is a critical part in the treatment of cancer patients, but is not represented in healthcare claims, while these claims are a rich source for finding more insight in cancer treatment. The purpose of this study is to predict cancer stage from healthcare claims, evaluating both model performance and predictor importance. Improvement on previous studies is attempted by broadening the range of predictors included by including indirectly linked activities and prescribed medicines, as well as classifying all 4 stages of cancer separately.
Methods: Data sets for the breast and colorectal cancer studies have been constructed by combining clinical patient data and care activity data from several different hospitals in the Netherlands. Multiple preprocessing steps have been applied to these data sets, including SMOTE and AENN to combat class imbalance. On these processed data sets, neural network, random forest, support vector machine and Super Learner models were trained to predict cancer stage from healthcare activities. These models were assessed based on AUC, sensitivity and specificity. Finally, predictor importance was determined via a combination of a model-agnostic interpretation method and a scoring system.
Results: The best performing model for breast cancer stage prediction was the random forest model with an AUC of 0.71. For the colorectal cancer study, the best performing model was the Super Learner model with feature selection, SMOTE and AENN, with an AUC of 0.61. These results show that the models have not been able to improve on results from previous studies. Predictor importance analysis showed a broad range of variables with high importance scores, including directly linked activities, indirectly linked activities as well prescribed medicines. These predictors however do not correspond to the treatment patterns described in the literature, as directly linked activities are underrepresented in the important predictors when compared to the literature.
Conclusion: This study has shown that using small and imbalanced data sets causes difficulties in constructing viable prediction models for predicting breast and colorectal cancer stage. However, including a broader range of predictors has been shown to be a possible improvement compared to previous studies. This motivates further research with larger, more balanced data sets. | |