dc.description.abstract | Enhancers are cis-regulatory elements that play a central role in transcriptional gene regulation and cell-type specific gene expression. More than 95% of variants associated with common diseases identified by GWAS lie in non-coding regions, most of which are thought to be enhancers. Enhancer variants are causally implicated in rheumatoid arthritis, Alzheimer’s, diabetes, and cancer. Hence, it is of crucial importance to have a genome-wide catalog of enhancer elements to better understand health and disease and for targeted therapy design. While markers of enhancer identity such as P300 binding, histone modifications, and sequence motifs have been discovered, no known combination of markers unequivocally identifies enhancers. The problem is made more difficult by the extreme cell-type specificity of enhancers, exemplified by the mere ~6% overlap between enhancers in the VISTA database. We need computational methods to overcome these issues and obtain enhancer catalogs. Over the years, methods have evolved from unsupervised approaches to classical machine-learning, deep learning methods, and now large language model-based approaches, where models are pre-trained on massive amounts of biological sequences and then fine-tuned for enhancer prediction. There has been a gradual improvement in performance, but many challenges remain: high-quality training data of validated enhancers is sparse, cell-type specific prediction is in its infancy while enhancers are highly cell-type specific, experimental validation is complicated and forms a major bottleneck, and methods report different metrics, thresholds, and use different datasets for validation, preventing proper comparison and preventing important methodological improvements. Other reviews of computational methods for enhancer prediction have focused chiefly on specific subsets of these methods or current developments. Here, I take a different perspective and review the chronological evolution of enhancer prediction methods. I show that deep learning-based classifiers have improved predictive performance and that newer generative and pre-trained methods hold great promise, but that the field is limited by i) the lack of standardization in reporting model performance, ii) the lack of experimental validation on genome-wide predictions, and iii) a narrow focus on only one or a few cell types which runs counter to enhancer biology. Finally, I discuss future perspectives informed by the rise of multi-modal foundation models and generative models in the broader ML field | |