Using Discretization and Resampling for Privacy Preserving Data Analysis: An experimental evaluation
MetadataShow full item record
Data analysis allows for the extraction of useful patterns or information from data. However, most data that is stored and processed contains personal information of individuals. The analysis of this data is therefore increasingly restricted by laws and regulations, and pressured by the public opinion. This calls for an approach that allows for performing data analysis, while protecting the privacy of individuals that are in the data. Such an approach would make storing, processing, exchanging and publishing data more feasible, and less restricted by regulations. This thesis report contributes to the field of Privacy Preserving Data Mining, by addressing the research question: How can data be accurately summarized by as few instances as possible to support data analysis, while preserving the privacy of individuals? It does so by introducing a novel approach towards data anonymization, that can be used to provide privacy guarantees, while mostly preserving the utility of the continuous data. The existing concept of Density Estimation Trees (DETs) is used for the multidimensional discretization of continuous attributes. This research proposes to achieve the privacy model k-anonymity by using k as a minimum leaf constraint, and a stopping rule during the creation of DETs. This discretization of continuous instances therefore yields a number of equivalence classes, where each equivalence class is defined by one of the DET's leaf nodes, and contains at least k instances. The proposed approach is validated through an experimental evaluation, by evaluating it using fifteen real-world, synthetic or mixed data sets, containing continuous attributes. The preservation of data utility is measured by comparing a classifier's performance achieved with the continuous data, and the performance with the anonymized data. The privacy level is expressed by k within the context of k-anonymity, which serves as an input parameter for the DET as well. The results of the evaluation show that with only three out of the fifteen data sets, there is a significant difference in classification accuracy when comparing the continuous and anonymized attributes. In addition, in ten out of fifteen cases, a k-value of at least 10 achieves the highest classification accuracy. It can be concluded that in most cases, the anonymization approach that is introduced succeeds to create an accurate representation of the continuous attributes that preserves data utility. In addition, it does so while providing privacy guarantees through k-anonymity for relatively high k-values.