ProbQL: A Probabilistic Query Language for Information Extraction from PDF Reports and Natural Language Written Texts
Summary
In recent years, Information Extraction (IE) has become an increasingly important field due to the vast amount of data being produced at an ever-increasing rate. However, it was estimated that about 80-90% of data produced by companies are unstructured - such as PDF documents and natural language written texts - meaning that if not refined, valuable information and interesting patterns can remain hidden. The problem with unstructured data is that the information extraction process is a difficult task, due to their intrinsic uncertainty nature.
This Thesis proposes a novel programming language - ProbQL - that is a rule-based query language through which it is possible to write queries able to extract information from a PDF document or a text file. The main idea of this language is to split the given document (or text) into a list of items, and then score each item with a probability value of being the correct data to extract, based on some rules defined by the final user, describing some properties that the desired data is thought to have. The language has been tested on the task of extracting 5 types of data from PDF documents - medicine reimbursement reports - obtaining an average extraction accuracy of 86% (with a minimum of 78.3% and a maximum of 99.1%). To the best of our knowledge, ProbQL is the first language for extracting information from PDF documents and texts that is probability based, which allows to perfectly deal with the uncertainty of a desired piece of data being located in different positions based on the type of considered PDF document.
Collections
Related items
Showing items related by title, author, creator and subject.
-
Armed with Language in Uruzgan Armed with Language in Uruzgan: A Study on Language Policy in Conflict Zones and English/Afghan Military Language Training
Qaume, A. (2017)This thesis aims to investigate whether the choices made in the language provision to Dutch military personnel deployed in Uruzgan between 2006 and 2010 had an impact on their interaction with the local population. Research ... -
Using the target language in the foreign language classroom English as a foreign language (EFL) at Dutch secondary schools
Brands, F.A. (2011)This master's thesis deals with the use of the target language in the foreign language classroom, specifically EFL classes at Dutch secondary schools. The use of the target language in class, as supported by SLA theories ... -
Exploring the relationship between native language skills and foreign language learning in children with language impairments.
Zoutenbier, I. (2015)Title: Exploring the relationship between native language skills and foreign language learning in children with language impairments. Background: Dutch children with language impairments (LI) in primary education are obliged ...