Hidden uncertainty in data analysis: Understanding sources of variability in many-analyst projects
Summary
This study examines (1) how analytical decisions contribute to variability in many-analyst studies and (2) whether specific decisions can be identified as key drivers.
Several models, varying in complexity, were trained and validated on a synthetic multiverse dataset and tested for generalization on the many-analyst dataset from Breznau et al. (2022). While non-linear models performed well on the multiverse dataset (XGBoost R2 = 0.96), none generalized to the many-analyst dataset (R2 ~ 0.0), possibly due to noise or the absence of key decisions in the synthetic data. SHAP values and feature importance highlighted that choices about variables, especially type of independent variables was most impactful.
Although current models failed to explain variance in many-analyst settings, findings suggest that efforts to explain variability in many-analysts projects should employ complex models capturing non-linear relationships and emphasize the choice of variables.