View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        ProbQL: A Probabilistic Query Language for Information Extraction from PDF Reports and Natural Language Written Texts

        Thumbnail
        View/Open
        ProbQL A Probabilistic Query Language for Information Extraction from PDF Reports and Natural Language Written Texts_Daniele Di Grandi_7035616.pdf (1.245Mb)
        Publication date
        2022
        Author
        Grandi, Daniele Di
        Metadata
        Show full item record
        Summary
        In recent years, Information Extraction (IE) has become an increasingly important field due to the vast amount of data being produced at an ever-increasing rate. However, it was estimated that about 80-90% of data produced by companies are unstructured - such as PDF documents and natural language written texts - meaning that if not refined, valuable information and interesting patterns can remain hidden. The problem with unstructured data is that the information extraction process is a difficult task, due to their intrinsic uncertainty nature. This Thesis proposes a novel programming language - ProbQL - that is a rule-based query language through which it is possible to write queries able to extract information from a PDF document or a text file. The main idea of this language is to split the given document (or text) into a list of items, and then score each item with a probability value of being the correct data to extract, based on some rules defined by the final user, describing some properties that the desired data is thought to have. The language has been tested on the task of extracting 5 types of data from PDF documents - medicine reimbursement reports - obtaining an average extraction accuracy of 86% (with a minimum of 78.3% and a maximum of 99.1%). To the best of our knowledge, ProbQL is the first language for extracting information from PDF documents and texts that is probability based, which allows to perfectly deal with the uncertainty of a desired piece of data being located in different positions based on the type of considered PDF document.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/42771
        Collections
        • Theses

        Related items

        Showing items related by title, author, creator and subject.

        • Armed with Language in Uruzgan Armed with Language in Uruzgan: A Study on Language Policy in Conflict Zones and English/Afghan Military Language Training 

          Qaume, A. (2017)
          This thesis aims to investigate whether the choices made in the language provision to Dutch military personnel deployed in Uruzgan between 2006 and 2010 had an impact on their interaction with the local population. Research ...
        • Using the target language in the foreign language classroom English as a foreign language (EFL) at Dutch secondary schools 

          Brands, F.A. (2011)
          This master's thesis deals with the use of the target language in the foreign language classroom, specifically EFL classes at Dutch secondary schools. The use of the target language in class, as supported by SLA theories ...
        • Exploring the relationship between native language skills and foreign language learning in children with language impairments. 

          Zoutenbier, I. (2015)
          Title: Exploring the relationship between native language skills and foreign language learning in children with language impairments. Background: Dutch children with language impairments (LI) in primary education are obliged ...
        Utrecht university logo