dc.description.abstract | Abstract
Background: Neurodegenerative diseases like Alzheimer's disease (AD) and Amyotrophic Lateral Sclerosis (ALS) pose significant public health challenges due to their complex aetiologies involving genetic, environmental, and biological factors. Single-cell RNA sequencing (scRNA-seq) enables detailed analysis of cellular heterogeneity in these diseases. However, the high dimensionality and sparsity of scRNA-seq data complicate the classification of diseased versus healthy cells, necessitating systematic evaluation of machine learning models and feature engineering strategies.
Methods: We analysed scRNA-seq datasets from the dorsolateral prefrontal cortex of AD patients and controls, and from the primary motor cortex of C9orf72-associated ALS (C9ALS) patients, sporadic ALS (SALS) patients, and shared controls. Logistic Regression and Random Forest classifiers were trained to distinguish diseased from healthy cells using various feature extraction methods: random feature selection, dimensionality reduction (Most Variable Features and Principal Component Analysis), and a biologically focussed approach combining Differential Expression (DE) analysis and Weighted Gene Co-expression Network Analysis (WGCNA). Five-fold cross-validation ensured robust evaluation.
Results: Classification accuracy was exceptionally high across all datasets. The biologically focussed method achieved the highest performance, with Logistic Regression attaining peak test AUCs of 0.980 in SALS and 0.976 in C9ALS. Dimensionality reduction was also effective, particularly in AD, where fewer significant features limited the biologically focussed method. Classifiers identified 104 shared genes, among AD, C9ALS, and SALS, implicated in neurodegeneration. Pathway enrichment analysis of these genes highlighted associations with neurodegenerative pathways, mitochondrial dysfunction, and synaptic processes. Machine learning classifiers identified additional critical genes and pathways beyond those detected by DE analysis alone.
Conclusions: Accurate classification of single-cell transcriptomic data in AD and ALS is feasible, with performance significantly influenced by feature extraction methods. Biologically focussed approaches and dimensionality reduction techniques enhanced classifier accuracy and identified key transcriptomic features distinguishing diseased and healthy cells. These findings deepen our understanding of molecular mechanisms in neurodegenerative diseases and may inform the development of novel diagnostic and therapeutic strategies. Further validation and functional studies are needed to translate these insights into clinical applications. | |