Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributorWang Y., Kenna K.
dc.contributor.advisorKenna, Kevin
dc.contributor.authorBuzatu, Rafaella
dc.date.accessioned2024-02-01T01:01:01Z
dc.date.available2024-02-01T01:01:01Z
dc.date.issued2024
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/45863
dc.description.abstractUnderstanding the gene transcription rules present in non-coding DNA is essential for unraveling the genetic code that establishes cellular fate. In this study, we aim to narrow down on regulatory regions and motifs within the central nervous system (CNS) that determine cell specificity. While the use of ATAC-seq data has been proven efficient in defining relevant regions of open chromatin, further analysis is required in order to obtain insights into specific regulatory elements. To that end, we propose a strategy involving natural language processing techniques to identify DNA transcription factor (TF) binding sites relevant to each cell type. We employ topic modelling for co-clustering of ATAC-seq peak sequences and cell types; as a result, we can retrieve ‘topics’ consisting of functionally related non-coding DNA regions, that provide a starting point for further analysis and identification of cell-specific feature combinations. Furthermore, we finetune a BigBird language model, pre-trained on the human genome, to distinguish between GABAergic, glutamatergic, and non-neuronal cells. The Byte-Pair Encoding tokenization method allows us to extract the most important DNA motifs for making the class predictions, as well as their corresponding attention scores, which can be mapped back to the peak sequences to identify TF binding sites. We show that this method allows identification of known regulatory elements and propose new strategies to extract more meaningful and specific information from the language models.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis paper explores the use of topic modelling and large DNA laguage models transfer learning to identify CNS cell type specific features of non-coding DNA.
dc.titleNatural language processing strategies for discovery of cell type-specific DNA regulatory elements
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.courseuuBioinformatics and Biocomplexity
dc.thesis.id21541


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record