Unveiling Cryptic Regulatory Elements in 5’UTRs with DNABert-2: A Comparative Analysis with CNN Models

Babukhian, Miriam

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Kenna, Kevin
dc.contributor.author	Babukhian, Miriam
dc.date.accessioned	2025-02-27T00:02:10Z
dc.date.available	2025-02-27T00:02:10Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/48558
dc.description.abstract	5’ Untranslated Regions (UTRs) play a pivotal role in post-transcriptional regulation of gene expression, however, investigating these non-coding regions poses significant challenges. The advent of deep learning methods can help unveiling the complex nature of regulatory elements, including that of the 5’UTR. In this study we propose an explainable deep-learning-based approach to uncover functionally significant features in 5’UTRs from existing long-read RNA-seq data. We trained two models, namely DNABert-2 and a Convolutional Neural Network (CNN) to tackle two distinct classification tasks: discriminating brain-specific 5’UTRs from those of other tissues and distinguishing general 5’UTRs from randomly selected sequences of the genome. We attempted to interpret the models decision-making process by performing in silico mutagenesis (ISM) and visualizing DNABert-2’s attention scores. Our results showed that none of the models were able to successfully discriminate between brain-specific 5’UTRs and 5’UTRs from other tissues, while both DNABert-2 and the CNN showed excellent performance in correctly classifying general 5’UTRs against random sequences of the genome. The CNN was heavily relying on the GC-content of the sequence to predict class probability, while DNABert-2 was able to pick up more complex cues within the 5’UTR. We were eventually able to partially interpret DNABert-2’s predictions and uncover functionally known 5’UTR motifs, however, further model training and interpretation are needed to support the validity of our findings.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	The aim of this study was to unveil hidden regulatory elements within the 5' Untranslated Region (UTR). To this end, we proposed an explainable deep-learning-based approach to uncover functionally significant features in 5’UTRs from existing long-read RNA-seq data. Two different models, namely DNABert-2 and a Convolutional Neural Network (CNN, were used in the analysis.
dc.title	Unveiling Cryptic Regulatory Elements in 5’UTRs with DNABert-2: A Comparative Analysis with CNN Models
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	5’Untranslated Region; DNABert-2; Convolutional Neural Networks; Explainable AI
dc.subject.courseuu	Bioinformatics and Biocomplexity
dc.thesis.id	29560

Files in this item

Name:: Babukhian-M-report_major_402.pdf
Size:: 2.683Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record