Bridging the gap between Big Genome Data Analysis and Database Management Systems

Cijvat, C.P.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Philippi, H.
dc.contributor.advisor	Zhang, Y.
dc.contributor.author	Cijvat, C.P.
dc.date.accessioned	2014-02-18T18:00:35Z
dc.date.available	2014-02-18T18:00:35Z
dc.date.issued	2014
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/16150
dc.description.abstract	The bioinformatics field has encountered a data deluge over the last years, due to increasing speed and decreasing cost of DNA sequencing technology. Today, sequencing the DNA of a single genome only takes about a week, and it can result in up to a terabyte of data. The sequencing data are usually stored in files, and specialized tools have been designed to analyze and manage them. Despite of these tools, bioinformaticians are still exposed to many data management hurdles when analyzing these files, which often leads to excessively time consuming tasks. In this thesis, we accurately map the needs of bioinformaticians by defining a set of use cases that reflect the everyday analysis that is applied on genetic data. We propose a modern-DBMS based approach, to analyze and manage genetic data file repositories. We identify the pros and cons of this method compared to the traditional file-based approach. Additionally, we experimented with a novel in-situ approach, where the DBMS applies Just-In-Time ETL (Extract-Transform-Load) on the original files instead of loading all data from these files up front. A major advantage of this approach is that it greatly reduces the data-to-query time, since not all data are loaded in the DBMS during initialization. Other advantages include the decrease in storage requirements and the reduced data duplication. With this project, we have taken the first step towards the adaptation of the state-of-the-art database technology to accelerate genetic data analytics. The preliminary results presented in this thesis are highly promising and they open up a plethora of new research opportunities.
dc.description.sponsorship	Utrecht University
dc.format.extent	3376627
dc.format.mimetype	application/pdf
dc.language.iso	en_US
dc.title	Bridging the gap between Big Genome Data Analysis and Database Management Systems
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Bioinformatics, DBMS, Database Management System, Genetic Data, BAM files, DNA Sequencing
dc.subject.courseuu	Computing Science

Files in this item

Name:: thesis.pdf
Size:: 3.220Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record