Improving Data Quality: A Review on DataCentric AI and AI-Actionable Data

Garcia Mondejar, Marta

View/Open

LiteratureReview_MartaGarciaMondejar.pdf (403.1Kb)

Publication date

2025

Author

Garcia Mondejar, Marta

Metadata

Show full item record

Summary

Artificial Intelligence (AI) has shown remarkable potential in recent years, particularly in fields such as cancer diagnostics. These systems are increasingly being used for tasks like tumor identification, cancer type classification, and predicting patient outcomes. However, despite their great potential, AI systems often face limitations when applied to real-world clinical settings. Performance is often affected due to poor generalization to new environments, low-quality training data, and the underrepresentation of diverse patient groups and cancer types. This literature review explores a new paradigm known as Data-Centric AI (DCAI), which shifts the focus from optimizing model architectures to improving the quality of training data. After outlining the current challenges in cancer detection AI (e.g., data bias, label inconsistency, limited institutional collaboration), we explore three key areas where DCAI techniques are being applied: (1) representation and diversity, (2) label quality and data preprocessing, and (3) accessibility, generalizability, and collaboration. We analyze recent studies that apply DCAI techniques, such as synthetic data generation, semi-supervised labeling, and federated learning, to address challenges in these areas. The review concludes by highlighting the crucial role of data quality in building robust AI models that generalize well across multiple clinical settings and in realizing the full potential of AI in oncology.

URI

https://studenttheses.uu.nl/handle/20.500.12932/50108

Collections

Theses