Improving Data Quality: A Review on DataCentric AI and AI-Actionable Data
Summary
Artificial Intelligence (AI) has shown remarkable potential in recent years, particularly in
fields such as cancer diagnostics. These systems are increasingly being used for tasks like
tumor identification, cancer type classification, and predicting patient outcomes.
However, despite their great potential, AI systems often face limitations when applied to
real-world clinical settings. Performance is often affected due to poor generalization to
new environments, low-quality training data, and the underrepresentation of diverse
patient groups and cancer types.
This literature review explores a new paradigm known as Data-Centric AI (DCAI), which
shifts the focus from optimizing model architectures to improving the quality of training
data. After outlining the current challenges in cancer detection AI (e.g., data bias, label
inconsistency, limited institutional collaboration), we explore three key areas where DCAI
techniques are being applied: (1) representation and diversity, (2) label quality and data
preprocessing, and (3) accessibility, generalizability, and collaboration.
We analyze recent studies that apply DCAI techniques, such as synthetic data generation,
semi-supervised labeling, and federated learning, to address challenges in these areas.
The review concludes by highlighting the crucial role of data quality in building robust
AI models that generalize well across multiple clinical settings and in realizing the full
potential of AI in oncology.