Reporting and handling of unrecorded values in automatically textextracted study variables from electronic health records in epidemiological 
studies: A Literature Review

Brink, Jasper van den

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Schuit, E.
dc.contributor.author	Brink, Jasper van den
dc.date.accessioned	2025-02-13T00:01:11Z
dc.date.available	2025-02-13T00:01:11Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/48498
dc.description.abstract	Background: Electronic health records (EHRs) provide valuable patient data for research and natural language processing (NLP) helps convert this unstructured data into analyzable formats, though the issue of unrecorded values in the free text remains and can introduce biases. This paper reviews how current studies handle and report these unrecorded values in NLP-extracted data from EHRs in epidemiological studies. Methods: Based on a previous review article, a total of 30 recent observational studies using NLP techniques to extract data from EHRs were included. Data were extracted on the intended usage of NLP, the relevant text-extracted variables, the role of the variable in the analysis (eg. determinant, outcome) and variables in direct relation to the text-extracted variable in the studies’ analyses. Explicit mentions were collected for the referring practices regarding textual variables, addressing of negated and unrecorded values, and reported limitations of NLP techniques. Results: 14 out of 30 studies used both text and structured data while the other 16 used text only for their text-extracted variable. Purpose of NLP techniques varied such as creating new variables or identifying additional cases. Reporting of text-extracted variables differed, with 11 variables being referred as a text-extracted variable in the analysis while the remaining 21 being referred as the actual variables itself. Only six studies reported handling of negated values from the EHRs and only six studies explicitly considered unrecorded values. Finally, 25 studies acknowledged limitations related to NLP techniques, reporting challenges in accuracy and data quality. Conclusion: We highlight the significant variability in reporting and handling of text-extracted variables from EHRs, particularly regarding unrecorded values and emphasize the need for standardized guidelines to improve consistency and accuracy in NLP-assisted epidemiological research
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	A literature review of 30 recent epidemiological studies that use automatic text-extraction techniques. Their reporting practices of text-extracted variables and most noteably unreported values were analyzed and discussed.
dc.title	Reporting and handling of unrecorded values in automatically textextracted study variables from electronic health records in epidemiological studies: A Literature Review
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Epidemiology
dc.thesis.id	36757

Files in this item

Name:: FINAL_WritingAssignment_Jasper ...
Size:: 434.9Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

Reporting and handling of unrecorded values in automatically textextracted study variables from electronic health records in epidemiological studies: A Literature Review

Files in this item

This item appears in the following Collection(s)