Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorVelegrakis, Ioannis
dc.contributor.authorGrandi, Daniele Di
dc.date.accessioned2022-09-14T00:00:35Z
dc.date.available2022-09-14T00:00:35Z
dc.date.issued2022
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/42771
dc.description.abstractIn recent years, Information Extraction (IE) has become an increasingly important field due to the vast amount of data being produced at an ever-increasing rate. However, it was estimated that about 80-90% of data produced by companies are unstructured - such as PDF documents and natural language written texts - meaning that if not refined, valuable information and interesting patterns can remain hidden. The problem with unstructured data is that the information extraction process is a difficult task, due to their intrinsic uncertainty nature. This Thesis proposes a novel programming language - ProbQL - that is a rule-based query language through which it is possible to write queries able to extract information from a PDF document or a text file. The main idea of this language is to split the given document (or text) into a list of items, and then score each item with a probability value of being the correct data to extract, based on some rules defined by the final user, describing some properties that the desired data is thought to have. The language has been tested on the task of extracting 5 types of data from PDF documents - medicine reimbursement reports - obtaining an average extraction accuracy of 86% (with a minimum of 78.3% and a maximum of 99.1%). To the best of our knowledge, ProbQL is the first language for extracting information from PDF documents and texts that is probability based, which allows to perfectly deal with the uncertainty of a desired piece of data being located in different positions based on the type of considered PDF document.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis Thesis proposes a novel programming language - ProbQL - that is a rule-based query language through which it is possible to write queries able to extract information from a PDF document or a text file. The main idea of this language is to split the given document into a list of items, and then score each item with a probability value of being the correct data to extract, based on some rules defined by the final user, describing some properties that the desired data is thought to have.
dc.titleProbQL: A Probabilistic Query Language for Information Extraction from PDF Reports and Natural Language Written Texts
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsInformation extraction; PDF data; Text extraction; Query language; Probability language
dc.subject.courseuuComputing Science
dc.thesis.id10594


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record