Finding Patterns in DNA Sequences for Prediction of Errors

Au Yeung, L.Y.

View/Open

Masterthesis_LY_AuYeung.pdf (1.394Mb)

Publication date

2018

Author

Au Yeung, L.Y.

Metadata

Show full item record

Summary

With the use of DNA analysis it is possible to determine whether or not a suspect can be placed at a crime scene. In the field of forensics it is often the case that the DNA sample obtained is of low quality, the sample found could be from a little bit of saliva or small hairs left behind on the crime scene. Therefore DNA amplification is necessary. After DNA amplification, it is possible to sequence the DNA. This is a process to determine the order of the DNA sequence, which is a string made up of As, Cs, Gs and Ts. These methods are however prone to errors, making it more difficult to analyze the DNA correctly. For instance, the DNA could be a mixed sample where more than one individual’s DNA is found, then it will be harder to determine when a sequence is an erroneous sequence or a genuine sequence, belonging to an individual. We have used sequential pattern mining to determine whether it was possible to predict the reads that could be considered as non genuine sequences. This was done by taking segments of the DNA sequence before and after an error was made. Sequential pattern mining is a method in data mining to finding relevant patterns in a database of sequences. A pattern is considered relevant if it exceeds a threshold, which is user defined. This threshold is based on the number of times a pattern appears in the sequences in the database. We have found several patterns that were associated with a certain position on the DNA sequence. This certain position showed a bias towards one type of error. With the use of the pattern and the position associated with the pattern it might be possible to determine the number of sequences that are non genuine in a new sample. However, many other factors might contribute to whether or not a pattern shows this type of error, for instance, read orientation, error ratio and sequence length.

URI

https://studenttheses.uu.nl/handle/20.500.12932/29041

Collections

Theses