Gaining Business Value from Unstructured Data

Boer, Joris de

View/Open

Master Thesis Joris de Boer - 5930758.pdf (2.990Mb)

Publication date

2025

Author

Boer, Joris de

Metadata

Show full item record

Summary

This research aims to investigate how organizations can improve their business value when handling unstructured data. While they manage their product often in a structured form, unstructured forms are disregarded. The unstructured data is an untapped resource which can be 80% of the data of an organization. It contains a lot of knowledge and is at risk of being forgotten or ignored, requiring organizations to put in effort to investigate and document again and again. At first, a structured literature review was performed to understand the background. The literature provided insights on organizations and their lack of a standardized way of ensuring the quality of their unstructured data. When focus is put on interpretability, relevancy, and accuracy in an iterative manner, organizations are bound to improve the quality. In the context of this research recommendations are made to apply data curation teams ensuring the quality of metadata to improve accessibility, sharing, and management of data. For managing data, literature suggests to apply domainspecific methods to provide structure. To determine the impact of processing techniques on data quality, a comparison was made on quality metrics as a result of classifying differently processed datasets. Three methods were investigated, two with a different order of processing techniques - the methods of Barbantan and Lim - and one with a different set of steps altogether - the method of Sanchez-Segura. The results of the comparison show an overall lack of significant differences, indicating that the implemented processing techniques are not the sole reason for differences in quality metrics. Slight improvements in Accuracy and precision were observed with RF and SVM classifiers in the similarly structured methods, but large variations were found for recall and F1-scores for the NB and DT classifiers. Further research is necessary to gain full understanding of the potential impact of different processing techniques.

URI

https://studenttheses.uu.nl/handle/20.500.12932/48661

Collections

Theses