View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Unveiling Cryptic Regulatory Elements in 5’UTRs with DNABert-2: A Comparative Analysis with CNN Models

        Thumbnail
        View/Open
        Babukhian-M-report_major_402.pdf (2.683Mb)
        Publication date
        2025
        Author
        Babukhian, Miriam
        Metadata
        Show full item record
        Summary
        5’ Untranslated Regions (UTRs) play a pivotal role in post-transcriptional regulation of gene expression, however, investigating these non-coding regions poses significant challenges. The advent of deep learning methods can help unveiling the complex nature of regulatory elements, including that of the 5’UTR. In this study we propose an explainable deep-learning-based approach to uncover functionally significant features in 5’UTRs from existing long-read RNA-seq data. We trained two models, namely DNABert-2 and a Convolutional Neural Network (CNN) to tackle two distinct classification tasks: discriminating brain-specific 5’UTRs from those of other tissues and distinguishing general 5’UTRs from randomly selected sequences of the genome. We attempted to interpret the models decision-making process by performing in silico mutagenesis (ISM) and visualizing DNABert-2’s attention scores. Our results showed that none of the models were able to successfully discriminate between brain-specific 5’UTRs and 5’UTRs from other tissues, while both DNABert-2 and the CNN showed excellent performance in correctly classifying general 5’UTRs against random sequences of the genome. The CNN was heavily relying on the GC-content of the sequence to predict class probability, while DNABert-2 was able to pick up more complex cues within the 5’UTR. We were eventually able to partially interpret DNABert-2’s predictions and uncover functionally known 5’UTR motifs, however, further model training and interpretation are needed to support the validity of our findings.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/48558
        Collections
        • Theses
        Utrecht university logo