Feature selection for biomarker discovery
MetadataShow full item record
This research is a comparative study of feature selection methods for biomarker discovery. 10 different machine learning techniques were considered for feature selection. The main assumption behind the research was that certain biomarkers can reflect the perceived strenuousness of the different exercise levels. For measuring the perceived exercise intensity, the Borg scale was used. Using the top 10 most expressive biomarkers selected by each model, 39 different biomarkers were selected out of the total 64. The most frequently occurred one was "factord" selected by 7 models. Biomarkers "trp" and "CORT" were both selected by 6 of the models. "ifabp", "LEUCO" and "BICARB" were selected by 5 of the models. In general, the predictive power of the applied machine learning techniques do not vary much. The highest accuracy, 78% was achieved by Logistic Regression. Regarding the area under the ROC curve, the best result was achieved using the full logistic regression model with an AUC = 0.72. Applying feature selection however, a better performance can be achieved compared to the models with all the predictors. Recursive feature elimination on the random forest model yielded an 81% accuracy and the Lasso on logistic regression yielded an even higher 84% accuracy. All in all, considering the criteria for selecting candidate models, Logistic regression represents a balanced mix of model performance and interpretability.