Improving Rare Disease Diagnosis with BERT
Marcel Santoso, Marcel
MetadataShow full item record
A rare disease is an illness that affects less than one in every 2,000 individuals. There are more than 6,000 recognized rare diseases in the European Union. Collectively, rare diseases affect thirty million people in the European Union. Many doctors do not have sufficient experience and knowledge to diagnose such a diverse and rare group of diseases. As a result, rare disease patients often wait for years before receiving a definite diagnosis. Electronic health records of diagnosed patients can guide the diagnosis process of current and future rare disease patients. However, extracting relevant clinical information from millions of EHRs is challenging, especially when most diagnosis information is recorded in unstructured texts. Different clinicians may use different terms to describe the same disease and symptoms. Additionally, contexts, such as negation and cues for familial history, may affect diagnosis interpretation. BERT is one of the current state-of-the-art natural language processing (NLP) models that have been shown to understand linguistic contexts and perform NLP tasks well. This review aims to explore how BERT can improve rare disease diagnosis by processing clinical notes in EHRs. The ability of BERT to learn contextualized embeddings from data helps it to identify important words for rare disease diagnoses, such as symptoms and clinical signs, reliably. Additionally, BERT can also predict the most probable diagnosis given all information recorded in clinical notes. This information can help doctors restrict their diagnosis search space and expedite the diagnosis process of rare diseases. The use of contextualized embedding also allows BERT to be trained with imperfect labels in the fine-tuning phase. This skips the need to use labeled rare disease datasets for BERT fine-tuning process. BERT shows potential to be used in diagnosis support. However, class imbalance and limited training data for certain diseases must be sorted to improve BERT performance further.