Natural language processing strategies for discovery of
 cell type-specific DNA regulatory elements

Buzatu, Rafaella

dc.rights.license	CC-BY-NC-ND
dc.contributor	Wang Y., Kenna K.
dc.contributor.advisor	Kenna, Kevin
dc.contributor.author	Buzatu, Rafaella
dc.date.accessioned	2024-02-01T01:01:01Z
dc.date.available	2024-02-01T01:01:01Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/45863
dc.description.abstract	Understanding the gene transcription rules present in non-coding DNA is essential for unraveling the genetic code that establishes cellular fate. In this study, we aim to narrow down on regulatory regions and motifs within the central nervous system (CNS) that determine cell specificity. While the use of ATAC-seq data has been proven efficient in defining relevant regions of open chromatin, further analysis is required in order to obtain insights into specific regulatory elements. To that end, we propose a strategy involving natural language processing techniques to identify DNA transcription factor (TF) binding sites relevant to each cell type. We employ topic modelling for co-clustering of ATAC-seq peak sequences and cell types; as a result, we can retrieve ‘topics’ consisting of functionally related non-coding DNA regions, that provide a starting point for further analysis and identification of cell-specific feature combinations. Furthermore, we finetune a BigBird language model, pre-trained on the human genome, to distinguish between GABAergic, glutamatergic, and non-neuronal cells. The Byte-Pair Encoding tokenization method allows us to extract the most important DNA motifs for making the class predictions, as well as their corresponding attention scores, which can be mapped back to the peak sequences to identify TF binding sites. We show that this method allows identification of known regulatory elements and propose new strategies to extract more meaningful and specific information from the language models.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This paper explores the use of topic modelling and large DNA laguage models transfer learning to identify CNS cell type specific features of non-coding DNA.
dc.title	Natural language processing strategies for discovery of cell type-specific DNA regulatory elements
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Bioinformatics and Biocomplexity
dc.thesis.id	21541

Files in this item

Name:: Final_Report_Rafaella.pdf
Size:: 1.284Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

Natural language processing strategies for discovery of cell type-specific DNA regulatory elements

Files in this item

This item appears in the following Collection(s)