Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorPhilippi, H.
dc.contributor.advisorZhang, Y.
dc.contributor.authorCijvat, C.P.
dc.date.accessioned2014-02-18T18:00:35Z
dc.date.available2014-02-18T18:00:35Z
dc.date.issued2014
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/16150
dc.description.abstractThe bioinformatics field has encountered a data deluge over the last years, due to increasing speed and decreasing cost of DNA sequencing technology. Today, sequencing the DNA of a single genome only takes about a week, and it can result in up to a terabyte of data. The sequencing data are usually stored in files, and specialized tools have been designed to analyze and manage them. Despite of these tools, bioinformaticians are still exposed to many data management hurdles when analyzing these files, which often leads to excessively time consuming tasks. In this thesis, we accurately map the needs of bioinformaticians by defining a set of use cases that reflect the everyday analysis that is applied on genetic data. We propose a modern-DBMS based approach, to analyze and manage genetic data file repositories. We identify the pros and cons of this method compared to the traditional file-based approach. Additionally, we experimented with a novel in-situ approach, where the DBMS applies Just-In-Time ETL (Extract-Transform-Load) on the original files instead of loading all data from these files up front. A major advantage of this approach is that it greatly reduces the data-to-query time, since not all data are loaded in the DBMS during initialization. Other advantages include the decrease in storage requirements and the reduced data duplication. With this project, we have taken the first step towards the adaptation of the state-of-the-art database technology to accelerate genetic data analytics. The preliminary results presented in this thesis are highly promising and they open up a plethora of new research opportunities.
dc.description.sponsorshipUtrecht University
dc.format.extent3376627
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.titleBridging the gap between Big Genome Data Analysis and Database Management Systems
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsBioinformatics, DBMS, Database Management System, Genetic Data, BAM files, DNA Sequencing
dc.subject.courseuuComputing Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record