Diagnosing deviations: explainable cellwise anomaly detection in tabular clinical data

Kuipers, Harmjan

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Siebes, Arno
dc.contributor.author	Kuipers, Harmjan
dc.date.accessioned	2025-09-17T23:01:35Z
dc.date.available	2025-09-17T23:01:35Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50407
dc.description.abstract	The increasing volume and complexity of tabular data generated in clinical trials has outpaced traditional manual review workflows, which typically rely on univariate thresholds and two‐dimensional visualizations. Individual anomalous measurements—so-called cellwise outliers—can evade such marginal checks and compromise entire records, underscoring the need for automated, scalable detection pipelines that also provide explainability to clinical data managers . The primary objective of this thesis is to evaluate classification performance of both univariate and multivariate cellwise anomaly detection methods on a tabular dataset of multiple clinical studies. Clinical data from 50 placebo-arm studies comprising 1,104 subjects and spanning January 2016 to September 2022 across vital signs, laboratory, ECG, and demographic domains were injected with two mutually exclusive synthetic outlier types—small (±3 SD) and extreme (×10)—each at a 1% frequency . The univariate approach employed the STAR_outlier algorithm to identify marginal deviations. In parallel, the multivariate workflow included within-day last-observation-carried-forward and between-day iterative imputation followed by a self-supervised LightGBM gradient boosting regression model that predicted each feature using all other parameters (including lagged and lead timepoints). Reconstruction errors were transformed into anomaly scores and cellwise anomalies flagged based on a quantile threshold. When evaluated across studies, the multivariate LightGBM model consistently flagged extreme anomalies with high reliability but struggled to detect subtle deviations when extremes were present, prompting threshold adjustments that improved small-anomaly recall. Study-specific models modestly enhanced small-outlier detection but still fell short of operational requirements, and the univariate STAR_outlier method delivered intermediate results—outperforming multivariate detection of minor anomalies in the presence of extreme outliers. but not matching its sensitivity to the latter. Importantly, the LightGBM model was intrinsically capable of detecting small outliers, as evaluation of a dataset with only small outliers did result in classification performance metrics comparable to extreme outliers. In conclusion, while both automated multivariate and marginal univariate techniques can effectively flag gross cellwise anomalies in clinical trial data, the reliable detection of subtle anomalies, alongside extreme values, remains challenging. Future efforts should focus on refined threshold strategies or 2-stage approaches, enriched feature engineering (including temporal‐difference and rolling‐window statistics) and targeted hyperparameter optimization to advance explainable, scalable anomaly detection in clinical data review .
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	The primary objective of this thesis was to evaluate classification performance of both univariate and multivariate cellwise anomaly detection methods on a tabular clinical data
dc.title	Diagnosing deviations: explainable cellwise anomaly detection in tabular clinical data
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	clinical, anomaly, outlier, multivariate, pharmaceutical, medicne, classification, XGBoost
dc.subject.courseuu	Applied Data Science
dc.thesis.id	54025

Files in this item

Name:: ADSthesisHKU1_8961808_final.pdf
Size:: 1.608Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record