Unveiling Cryptic Regulatory Elements in 5’UTRs with DNABert-2: A Comparative Analysis with CNN Models
Summary
5’ Untranslated Regions (UTRs) play a pivotal role in post-transcriptional regulation of gene expression, however, investigating these non-coding regions poses significant challenges. The advent of deep learning methods can help unveiling the complex nature of regulatory elements, including that of the 5’UTR. In this study we propose an explainable deep-learning-based approach to uncover functionally significant features in 5’UTRs from existing long-read RNA-seq data. We trained two models, namely DNABert-2 and a Convolutional Neural Network (CNN) to tackle two distinct classification tasks: discriminating brain-specific 5’UTRs from those of other tissues and distinguishing general 5’UTRs from randomly selected sequences of the genome. We attempted to interpret the models decision-making process by performing in silico mutagenesis (ISM) and visualizing DNABert-2’s attention scores. Our results showed that none of the models were able to successfully discriminate between brain-specific 5’UTRs and 5’UTRs from other tissues, while both DNABert-2 and the CNN showed excellent performance in correctly classifying general 5’UTRs against random sequences of the genome. The CNN was heavily relying on the GC-content of the sequence to predict class probability, while DNABert-2 was able to pick up more complex cues within the 5’UTR. We were eventually able to partially interpret DNABert-2’s predictions and uncover functionally known 5’UTR motifs, however, further model training and interpretation are needed to support the validity of our findings.