Feature selection for biomarker discovery
Summary
This research is a comparative study of feature selection methods for biomarker
discovery. 10 different machine learning techniques were considered for feature selection.
The main assumption behind the research was that certain biomarkers can
reflect the perceived strenuousness of the different exercise levels. For measuring the
perceived exercise intensity, the Borg scale was used.
Using the top 10 most expressive biomarkers selected by each model, 39 different
biomarkers were selected out of the total 64. The most frequently occurred one was
"factord" selected by 7 models. Biomarkers "trp" and "CORT" were both selected
by 6 of the models. "ifabp", "LEUCO" and "BICARB" were selected by 5 of the
models.
In general, the predictive power of the applied machine learning techniques do
not vary much. The highest accuracy, 78% was achieved by Logistic Regression.
Regarding the area under the ROC curve, the best result was achieved using the
full logistic regression model with an AUC = 0.72.
Applying feature selection however, a better performance can be achieved compared
to the models with all the predictors. Recursive feature elimination on the
random forest model yielded an 81% accuracy and the Lasso on logistic regression
yielded an even higher 84% accuracy.
All in all, considering the criteria for selecting candidate models, Logistic regression
represents a balanced mix of model performance and interpretability.