Reporting and handling of unrecorded values in automatically textextracted study variables from electronic health records in epidemiological studies: A Literature Review
Summary
Background: Electronic health records (EHRs) provide valuable patient data for research and natural language processing (NLP) helps convert this unstructured data into analyzable formats, though the issue of unrecorded values in the free text remains and can introduce biases. This paper reviews how current studies handle and report these unrecorded values in NLP-extracted data from EHRs in epidemiological studies.
Methods: Based on a previous review article, a total of 30 recent observational studies using NLP techniques to extract data from EHRs were included. Data were extracted on the intended usage of NLP, the relevant text-extracted variables, the role of the variable in the analysis (eg. determinant, outcome) and variables in direct relation to the text-extracted variable in the studies’ analyses. Explicit mentions were collected for the referring practices regarding textual variables, addressing of negated and unrecorded values, and reported limitations of NLP techniques.
Results: 14 out of 30 studies used both text and structured data while the other 16 used text only for their text-extracted variable. Purpose of NLP techniques varied such as creating new variables or identifying additional cases. Reporting of text-extracted variables differed, with 11 variables being referred as a text-extracted variable in the analysis while the remaining 21 being referred as the actual variables itself. Only six studies reported handling of negated values from the EHRs and only six studies explicitly considered unrecorded values. Finally, 25 studies acknowledged limitations related to NLP techniques, reporting challenges in accuracy and data quality.
Conclusion: We highlight the significant variability in reporting and handling of text-extracted variables from EHRs, particularly regarding unrecorded values and emphasize the need for standardized guidelines to improve consistency and accuracy in NLP-assisted epidemiological research