ProbQL: A Probabilistic Query Language for Information Extraction from PDF Reports and Natural Language Written Texts

Grandi, Daniele Di

View/Open

ProbQL A Probabilistic Query Language for Information Extraction from PDF Reports and Natural Language Written Texts_Daniele Di Grandi_7035616.pdf (1.245Mb)

Publication date

2022

Author

Grandi, Daniele Di

Metadata

Show full item record

Summary

In recent years, Information Extraction (IE) has become an increasingly important field due to the vast amount of data being produced at an ever-increasing rate. However, it was estimated that about 80-90% of data produced by companies are unstructured - such as PDF documents and natural language written texts - meaning that if not refined, valuable information and interesting patterns can remain hidden. The problem with unstructured data is that the information extraction process is a difficult task, due to their intrinsic uncertainty nature. This Thesis proposes a novel programming language - ProbQL - that is a rule-based query language through which it is possible to write queries able to extract information from a PDF document or a text file. The main idea of this language is to split the given document (or text) into a list of items, and then score each item with a probability value of being the correct data to extract, based on some rules defined by the final user, describing some properties that the desired data is thought to have. The language has been tested on the task of extracting 5 types of data from PDF documents - medicine reimbursement reports - obtaining an average extraction accuracy of 86% (with a minimum of 78.3% and a maximum of 99.1%). To the best of our knowledge, ProbQL is the first language for extracting information from PDF documents and texts that is probability based, which allows to perfectly deal with the uncertainty of a desired piece of data being located in different positions based on the type of considered PDF document.

URI

https://studenttheses.uu.nl/handle/20.500.12932/42771

Collections

Theses