Row Merge: Data reduction through expansion into possible worlds for data sustainability
Summary
We have created Row Merge — a new way to save storage space in large datasets while minimizing the loss of information. In recent years, the amount of data collected in all aspects of life has increased not only beyond our ability to process it all, but to even store it. Existing data reduction methods mainly revolve around either the calculation of summary statistics or the deletion of less important components, both of which permanently erase attribute relationships. Our approach involves the merging of similar rows by replacing differing cells with null values, which allows information to be preserved as uncertain that would have instead been lost if tuples had been deleted. A database reduced in this manner will never give false negative answers to queries compared to the original. However, Row Merge does introduce false positive query answers.
In this work, we have developed the theory behind the informational value of incomplete databases and designed several data reduction algorithms using this principle. We have created quantitative metrics to evaluate the amount of information in an incomplete table without the need to possess the original, and evaluated the quality and performance of several different novel approaches on both real and synthetic datasets. We have discovered the ideal use cases for each new algorithm, and showed that Row Merge surpasses deletion in preserving information after data reduction in all the real-world datasets we tested.