Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorBurriel Coll, V
dc.contributor.advisorVelegrakis, Y
dc.contributor.authorGeffen, Y.M. van
dc.date.accessioned2020-02-20T19:04:05Z
dc.date.available2020-02-20T19:04:05Z
dc.date.issued2019
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/34905
dc.description.abstractThis thesis is about data quality and automation of retrieval, within the domain of genomic information systems. In recent years, large scale genomic studies have become common due to lower cost and improved tools and software for analysis. With the relative ease of performing these studies, the pool of genomic research data has grown massively, to the point that information systems such as the GWAS Catalog and Ensembl are used to collect, manage, and distribute study results. Researchers and practitioners have to make sense of the data contained in these systems manually. This boils down to choosing which data is relevant to them, and which data is not, with the end goal of generating new knowledge. Apart from taking a lot of time, manual evaluation introduces errors. Automation is necessary to reduce errors and save valuable time. We explored the genetic information system domain using a bottom-up approach. The SILE method was used as a framework. The study focusses on the Identification step within this framework. An exploratory analysis was performed on the data contained in both the GWAS Catalog and the Ensembl genome browser. With the knowledge gained from this analysis, a solution is proposed to automate the selection process within these information systems. This solution involves a combined classification and regression model, ranking entries within the information system on relevance. We built these models by identifying relevant entries by hand and training the models on this manually created data set. The models then provided the ability to identify relevant entries with a high certainty in a, previously unseen, validation set. It is shown that an understanding of the domain with regards to data quality, is key to developing automated solutions. Important factors here are the difference in entries between phenotypes, and over time. Another important factor to consider is the difference between theoretical ideal measures, and the availability of these measures in practice. This study provides a basis for automation of relevant entry retrieval within the genomic information system domain.
dc.description.sponsorshipUtrecht University
dc.language.isoen
dc.titleThe Needle in a Haystack - How to find relevant information in Genomic Information Systems
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsGenomic Information Systems; genetics; information; data quality; information quality; regression model; classification model; Ensembl; GWAS; GWAS Catalog; Genome-wide association study; DNA; RNA; synthesis
dc.subject.courseuuBusiness Informatics


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record