Gaining Business Value from Unstructured Data
Summary
This research aims to investigate how organizations can improve their business
value when handling unstructured data. While they manage their product
often in a structured form, unstructured forms are disregarded. The
unstructured data is an untapped resource which can be 80% of the data
of an organization. It contains a lot of knowledge and is at risk of being
forgotten or ignored, requiring organizations to put in effort to investigate
and document again and again.
At first, a structured literature review was performed to understand the
background. The literature provided insights on organizations and their lack
of a standardized way of ensuring the quality of their unstructured data.
When focus is put on interpretability, relevancy, and accuracy in an iterative
manner, organizations are bound to improve the quality. In the context
of this research recommendations are made to apply data curation teams
ensuring the quality of metadata to improve accessibility, sharing, and management
of data. For managing data, literature suggests to apply domainspecific
methods to provide structure.
To determine the impact of processing techniques on data quality, a comparison
was made on quality metrics as a result of classifying differently
processed datasets. Three methods were investigated, two with a different
order of processing techniques - the methods of Barbantan and Lim - and one
with a different set of steps altogether - the method of Sanchez-Segura. The
results of the comparison show an overall lack of significant differences, indicating
that the implemented processing techniques are not the sole reason
for differences in quality metrics. Slight improvements in Accuracy and precision
were observed with RF and SVM classifiers in the similarly structured
methods, but large variations were found for recall and F1-scores for the NB
and DT classifiers. Further research is necessary to gain full understanding
of the potential impact of different processing techniques.