View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Gaining Business Value from Unstructured Data

        Thumbnail
        View/Open
        Master Thesis Joris de Boer - 5930758.pdf (2.990Mb)
        Publication date
        2025
        Author
        Boer, Joris de
        Metadata
        Show full item record
        Summary
        This research aims to investigate how organizations can improve their business value when handling unstructured data. While they manage their product often in a structured form, unstructured forms are disregarded. The unstructured data is an untapped resource which can be 80% of the data of an organization. It contains a lot of knowledge and is at risk of being forgotten or ignored, requiring organizations to put in effort to investigate and document again and again. At first, a structured literature review was performed to understand the background. The literature provided insights on organizations and their lack of a standardized way of ensuring the quality of their unstructured data. When focus is put on interpretability, relevancy, and accuracy in an iterative manner, organizations are bound to improve the quality. In the context of this research recommendations are made to apply data curation teams ensuring the quality of metadata to improve accessibility, sharing, and management of data. For managing data, literature suggests to apply domainspecific methods to provide structure. To determine the impact of processing techniques on data quality, a comparison was made on quality metrics as a result of classifying differently processed datasets. Three methods were investigated, two with a different order of processing techniques - the methods of Barbantan and Lim - and one with a different set of steps altogether - the method of Sanchez-Segura. The results of the comparison show an overall lack of significant differences, indicating that the implemented processing techniques are not the sole reason for differences in quality metrics. Slight improvements in Accuracy and precision were observed with RF and SVM classifiers in the similarly structured methods, but large variations were found for recall and F1-scores for the NB and DT classifiers. Further research is necessary to gain full understanding of the potential impact of different processing techniques.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/48661
        Collections
        • Theses
        Utrecht university logo