Self-Service Data Science in Healthcare: Using AutoML in the knowledge discovery process
Summary
Introduction: The healthcare industry has been lagging in the adoption of analytics.
One of the reasons for lagging is the shortage of data scientists in the healthcare sector.
Advancements in Machine Learning (ML) and research on its accessibility for nonexperts sparked the research field of Automated Machine Learning (AutoML). Because
AutoML is designed to make ML accessible to non-expert users, this research aims to
find out how researcher-physicians can be supported in their knowledge discovery process by applying AutoML as part of the research field of Applied Data Science (ADS).
This is the first study, to the best of our knowledge, to test AutoML methods with domain experts in the healthcare domain.
Method: The method used in this research is design science. First, we selected TPOT
as AutoML method based on the results of a benchmark test and requirements from
researcher-physicians. We integrated TPOT into two artefacts, a web-application and a
notebook. We have evaluated the artefacts with the framework for evaluation in design
science to find out which method suits researcher-physicians best.
Results: The benchmark test found that there was no AutoML method that consistently
outperformed all other methods one-hour and four-hour budgets. However, TPOT and
Auto-Sklearn performed best on both tests. As TPOT was the method that satisfied
most requirements, we integrated TPOT into two artefacts. Both artefacts had a similar
workflow, but different user interfaces because of a conflict in requirements. Artefact
A, a web-application, was perceived better for uploading a dataset and comparing results. Artefact B, a Jupiter notebook, was perceived better regarding the workflow and
being in control of model construction. Thus, a hybrid artefact would be best for researcher-physicians. However, both artefacts missed model explainability and an explanation of variable importance for the created model. Hence, the researcher-physicians indicated that they would only use AutoML for the explorative phase of their
knowledge discovery process.
Discussion: The results suggest that AutoML methods need work on explaining the
created models and their route to model creation. Another issue is the stability of the
(Auto)ML models; the models created by an evolutionary algorithm based AutoML
methods are hard to reproduce due to their random inception. As much as changing the
seed can change the outcome for a single patient.