Exploring AI-Assisted Triage: Comparing Classical Machine Learning and GPT-4o for Clinical Decision Making

Braam, Emma

View/Open

Master_Thesis_AI_EmmaBraam_6412300.pdf (2.935Mb)

Publication date

2025

Author

Braam, Emma

Metadata

Show full item record

Summary

Accurate and efficient triage is essential to optimize the allocation of limited medical resources in the emergency department (ED). Nevertheless, triage decisions are inherently hard, requiring rapid but justifiable judgments under time constraints, incomplete patient data, and uncertainty. In such circumstances, large language models (LLMs) like GPT-4o introduce a new potential by leveraging their advanced capabilities. However, concerns about the transparency and reliability of LLMs persist, particularly in high-stakes settings such as triage. One promising direction is to constrain LLMs to use structured rule-based reasoning, ideally in a way that reflects how triagists reason. In triage, clinicians often rely on default reasoning, which means that they rely on general assumptions unless contradicted by specific evidence. The BOID framework (Beliefs, Obligations, Intentions, Desires) builds on this by representing different mental attitudes and resolving conflicts between them through structured prioritization. Integrating BOID into LLMs could potentially allow users to trace which default rules from which mental attitudes were prioritized in particular triage decisions. While this thesis does not implement a full BOID-LLM system, we take a first step by focusing on the O-component. We treat triage decisions from our real-world Korean Triage and Acuity Scale (KTAS) data as obligation-driven and explore whether GPT-4o can simulate obligation-based reasoning, without the support of retrieval mechanisms. We compare GPT-4o’s performance using multiple prompt engineering techniques against classical machine learning (ML) models including Decision Trees (DT), Random Forests (RF), and eXtreme Gradient Boosting (XGBoost). We further extract triage decision rules from GPT-4o and compare them with feature importance insights from applying SHapley Additive exPlanations (SHAP) on our best performing classical ML model. Finally, we reflect on ethical concerns, including reliability, fairness, transparency, and data privacy. Our results showed that RFs slightly outperformed DTs and XGBoost. Among prompt engineering techniques, integrating SHAP-derived results from our best performing classical ML model in the prompt improved GPT-4o’s performance. However, even its best configuration (Weighted Kappa = 0.6186) fell short of all classical ML models. Applying this to real-world triage would mean that a significant proportion of patients would still be misprioritized. Furthermore, GPT4o seemed to rely on general medical knowledge rather than explicit instructions in the prompt. In addition, GPT-4o showed inconsistencies in its extracted triage rules and showed a gap between how it claims to reason and what it actually predicts. These results raise ethical concerns when using LLMs like GPT-4o for high-stakes clinical tasks such as triage decision making and emphasize the need for a hybrid BOID-LLM system that combines the explainability and structure of default logic with LLMs to support safe and reliable decision making in triage.

URI

https://studenttheses.uu.nl/handle/20.500.12932/48926

Collections

Theses