Utilising Autoencoder Latent Representations to Pseudonymise Data whilst Retaining Data Utility
Summary
Autoencoders are small encoder-decoder pair networks that learn to compress data into a latent representation of smaller dimension. This thesis aims to outline the benefits and drawbacks of using latent representations as a utility-preserving data pseudonymisation method for machine learning. We consult existing anonymisation literature and EU legislature, followed by experiments on latent representation decoding, data utility and other latent representation properties. We found that without a leak of the original data along with its latent representation, it is difficult for an adversary to generate a well-performing reconstruction of the encoded dataset. This method is more effective if the latent representation is randomly permuted. This permutation is not easily reversed by a clustering algorithm. A latent representation preserves its data utility well for classification algorithms, even when permuted. Our experiments indicate that a dataset can be represented by multiple, well-performing latent representations, making it difficult for an adversary to discern which dataset was originally encoded. Autoencoders are quick to train, making it a quick method to pseudonymise data whilst retaining data utility for classification algorithms. As a pseudonymisation method, it is possible for the data holder to obtain a reconstruction of the data. However, latent representations would likely not be considered anonymised data by GDPR. Furthermore, regression algorithms perform worse than classification algorithms on latent representations. Finally, despite the popularity of mean squared error, we find that this loss function does not maximise data utility in latent representations.