Row Merge: Data reduction through expansion into possible worlds for data sustainability

Mahhov, Peter

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Velegrakis, Ioannis
dc.contributor.author	Mahhov, Peter
dc.date.accessioned	2025-02-06T00:01:35Z
dc.date.available	2025-02-06T00:01:35Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/48465
dc.description.abstract	We have created Row Merge — a new way to save storage space in large datasets while minimizing the loss of information. In recent years, the amount of data collected in all aspects of life has increased not only beyond our ability to process it all, but to even store it. Existing data reduction methods mainly revolve around either the calculation of summary statistics or the deletion of less important components, both of which permanently erase attribute relationships. Our approach involves the merging of similar rows by replacing differing cells with null values, which allows information to be preserved as uncertain that would have instead been lost if tuples had been deleted. A database reduced in this manner will never give false negative answers to queries compared to the original. However, Row Merge does introduce false positive query answers. In this work, we have developed the theory behind the informational value of incomplete databases and designed several data reduction algorithms using this principle. We have created quantitative metrics to evaluate the amount of information in an incomplete table without the need to possess the original, and evaluated the quality and performance of several different novel approaches on both real and synthetic datasets. We have discovered the ideal use cases for each new algorithm, and showed that Row Merge surpasses deletion in preserving information after data reduction in all the real-world datasets we tested.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This work introduces Row Merge, a novel data reduction method designed to save storage space in structured datasets while minimizing information loss. By merging similar rows and replacing differing cells with null values, it preserves uncertain information that would be lost through deletion. The research establishes quantitative metrics to evaluate information retention in incomplete databases, creates new reduction algorithms, and tests their effectiveness on real and synthetic datasets.
dc.title	Row Merge: Data reduction through expansion into possible worlds for data sustainability
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Incomplete databases; Data reduction; Information preservation; Algorithm design
dc.subject.courseuu	Computing Science
dc.thesis.id	42739

Files in this item

Name:: 07_01_25_Peter_Data_Sustainabi ...
Size:: 913.3Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record