Self-Service Data Science in Healthcare: Using AutoML in the knowledge discovery process

Ooms, R.L.J.

View/Open

ThesisOoms_final_AutoML_Anonymous.pdf (2.718Mb)

Publication date

2019

Author

Ooms, R.L.J.

Metadata

Show full item record

Summary

Introduction: The healthcare industry has been lagging in the adoption of analytics. One of the reasons for lagging is the shortage of data scientists in the healthcare sector. Advancements in Machine Learning (ML) and research on its accessibility for nonexperts sparked the research field of Automated Machine Learning (AutoML). Because AutoML is designed to make ML accessible to non-expert users, this research aims to find out how researcher-physicians can be supported in their knowledge discovery process by applying AutoML as part of the research field of Applied Data Science (ADS). This is the first study, to the best of our knowledge, to test AutoML methods with domain experts in the healthcare domain. Method: The method used in this research is design science. First, we selected TPOT as AutoML method based on the results of a benchmark test and requirements from researcher-physicians. We integrated TPOT into two artefacts, a web-application and a notebook. We have evaluated the artefacts with the framework for evaluation in design science to find out which method suits researcher-physicians best. Results: The benchmark test found that there was no AutoML method that consistently outperformed all other methods one-hour and four-hour budgets. However, TPOT and Auto-Sklearn performed best on both tests. As TPOT was the method that satisfied most requirements, we integrated TPOT into two artefacts. Both artefacts had a similar workflow, but different user interfaces because of a conflict in requirements. Artefact A, a web-application, was perceived better for uploading a dataset and comparing results. Artefact B, a Jupiter notebook, was perceived better regarding the workflow and being in control of model construction. Thus, a hybrid artefact would be best for researcher-physicians. However, both artefacts missed model explainability and an explanation of variable importance for the created model. Hence, the researcher-physicians indicated that they would only use AutoML for the explorative phase of their knowledge discovery process. Discussion: The results suggest that AutoML methods need work on explaining the created models and their route to model creation. Another issue is the stability of the (Auto)ML models; the models created by an evolutionary algorithm based AutoML methods are hard to reproduce due to their random inception. As much as changing the seed can change the outcome for a single patient.

URI

https://studenttheses.uu.nl/handle/20.500.12932/34863

Collections

Theses