View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Natural language processing strategies for discovery of cell type-specific DNA regulatory elements

        Thumbnail
        View/Open
        Final_Report_Rafaella.pdf (1.284Mb)
        Publication date
        2024
        Author
        Buzatu, Rafaella
        Metadata
        Show full item record
        Summary
        Understanding the gene transcription rules present in non-coding DNA is essential for unraveling the genetic code that establishes cellular fate. In this study, we aim to narrow down on regulatory regions and motifs within the central nervous system (CNS) that determine cell specificity. While the use of ATAC-seq data has been proven efficient in defining relevant regions of open chromatin, further analysis is required in order to obtain insights into specific regulatory elements. To that end, we propose a strategy involving natural language processing techniques to identify DNA transcription factor (TF) binding sites relevant to each cell type. We employ topic modelling for co-clustering of ATAC-seq peak sequences and cell types; as a result, we can retrieve ‘topics’ consisting of functionally related non-coding DNA regions, that provide a starting point for further analysis and identification of cell-specific feature combinations. Furthermore, we finetune a BigBird language model, pre-trained on the human genome, to distinguish between GABAergic, glutamatergic, and non-neuronal cells. The Byte-Pair Encoding tokenization method allows us to extract the most important DNA motifs for making the class predictions, as well as their corresponding attention scores, which can be mapped back to the peak sequences to identify TF binding sites. We show that this method allows identification of known regulatory elements and propose new strategies to extract more meaningful and specific information from the language models.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/45863
        Collections
        • Theses
        Utrecht university logo