Building a Natural Conversational Agent for Healthcare by Examining Empathetic Language
Summary
In our contemporary healthcare situation, lifestyle-related diseases are increasing, primarily influenced by unhealthy habits such as smoking, poor diet, lack of exercise, and excessive alcohol consumption (Balwan & Kour, 2021). Simultaneously, the use of Conversational Artificial Intelligence (Conversational AI) or chatbots has gained popularity and emerged as powerful tools, particularly in the healthcare sector (Amiri & Karahanna, 2022). This thesis aims to explore how to design and evaluate a natural and intuitive chatbot for the healthcare sector. Specifically, it examines the impact of incorporating empathy into the chatbot’s responses on user experience during interactions about lifestyle changes. The study also investigates how this empathetic communication might affect the user’s willingness to integrate a healthcare conversational agent into their daily routines (de Boer et al., 2023). The study’s objectives include assessing the impact of empathetic versus neutral tones in messages, and understanding user expectations in human-computer interactions.
The methodology involves two experiments: 1) An initial evaluation of different Large Language Models (LLMs), both general and domain-specific. This test assessed the quality of the answers to medical questions generated by the LLMs, through metrics that encompass different aspects of human conversation. The evaluation was conducted with G-Eval (Y. Liu et al., 2023) on the MASH-QA dataset (Zhu et al., 2020). The selected models were GPT-4 (OpenAI, 2024), Llama3 (Meta, 2024), MedAlpaca (Han et al., 2023), and Meditron (Chen et al., 2023). Metrics include fluency, naturalness, coherence and groundedness (Y. Liu et al., 2023; Zhong et al., 2022). Results showed similar average scores across the models, ranging from 0.812 to 0.827, which shows high capability across all metrics. Due to the focus of this project, I chose the model with the highest score on naturalness, MedAlpaca (0.826). 2) A user experiment survey, conducted to evaluate the chatbot’s performance in real-world-like scenarios, with participants requesting lifestyle advice related to behaviour changes according to four topics they enact: exercising more, eating healthier, quitting smoking and reducing alcohol intake. The condition in the human experiment, which involves varying the level of empathetic language used by the chatbot, is the primary independent variable to be investigated. A subsequent questionnaire, based on the Chatbot Usability Questionnaire (Holmes et al., 2019) focuses on message naturalness and the impact of empathy on natural language generation. Results show that participants consistently favour the empathetic chatbot across all the scenarios, with a highly significant p-value of less than 0.001.