Deep neural networks trained on DNA sequences to identify mutations that lead to Amyotrophic Lateral Sclerosis (ALS)

Josyula, A.V.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Dragan Doder, Albert Ali Salah, K.P. Kenna
dc.contributor.author	Josyula, A.V.
dc.date.accessioned	2021-05-19T18:00:12Z
dc.date.available	2021-05-19T18:00:12Z
dc.date.issued	2021
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/39427
dc.description.abstract	Amyotrophic lateral sclerosis (ALS) is a terminal disease whose onset may largely be determine by mutations in the non-coded region of deoxyribose nucleic acid (DNA). These mutations disrupt the transcription factors resulting in aberrant regulation of gene expression within the motor neurons resulting in neuro muscular degeneration. At the root of such mutations in DNA lie motif (Lanchatin, 2017), which are short conserved sub-sequences within DNA sequence in which mutations play a key role in regulating transcription. In this project, we build a box of motifs using deep learning which can identify the active DNA sequences that comprise of damage causing mutations. I build two deep learning networks. (1) Convolutional Neural Network (CNN) and (2) Hybrid model which is a combination of CNN and long short-term memory (LSTM). These architectures are trained on active motif reference sequences and inactive reference DNA sequences in the non-coding region of the DNA extracted from human reference genome. To determine the efficiency of the deep learning models in identifying mutations, I train the model architectures on blood. Mutations in blood are known to effect transcription. Next, we zoom in on the same region and train the models on regions around transcription start sit (TSS). These are regions where the mutations typically have strongest effect since these are sites where the process of transcription is initiated. To evaluate the model performance, I use a test set that comprises of (1) Genotype tissue expression (GTEx) which comprises of some motifs that could effect transcription as observed in people. Transcription is a process in which DNA gets converted into protein. Disrupting transcription leads to aberrant protein synthesis. These motifs are derived using traditional standard framework such as expression quantitative trait loci (eQTL) which comprise a list of effects of certain mutations across species which are known to affect a single cell gene expression. (2) Project MinE data consists of observed motifs in patients and controls in which some mutations may disrupt transcription leading to aberrant protein regulation which ultimately leads to ALS. Testing both the model architectures on GTEx and MinE shows the reliability of deep neural networks in identifying motif mutations which are likely to disrupt transcription. Having determined the performance of the deep learning models on blood, we test the efficiency of the models in identifying ALS mutations by training them on non-coding DNA sequences intrinsic to complex neuropsychiatric diseases from lower motor neuron and test the models on Project MinE data. Although previous deep learning models trained on motifs (Yue & Wang, 2018; Beer & Tavazoie, 2004; Alipanahi et al, 2015; Salekin & Zhang, 2017) show some success in predicting significant mutations that affect gene expression, we see in this project that both the models underperform in predicting significant mutations on the imbalanced GTEx and MinE datasets. The CNN model trained on blood has an average area under curve (AUC) of 0.42. The average AUC of the hybrid model on blood is 0.41. Similarly, the F1 score of the CNN on trained on blood is 0.07 and the F1 score of the hybrid model trained blood is also 0.07. The low AUC and F1 values show underperformance by the model. The CNN and hybrid models trained on lower motor neuron predict 12.39% and 6.40% of the active mutations in Project MinE.
dc.description.sponsorship	Utrecht University
dc.format.extent	1148410
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	Deep neural networks trained on DNA sequences to identify mutations that lead to Amyotrophic Lateral Sclerosis (ALS)
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Deep neural networks, CNN, CNN+LSTM, Blood lymphoblastoid, transcription start site, Lower motor neuron, DNA, transcription, ALS, motif, motif mutations,
dc.subject.courseuu	Artificial Intelligence

Files in this item

Name:: Deep Neural Networks trained on ...
Size:: 1.095Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record