Improving the Quality of Synthetic Data Generation with Application in Algorithmic Fairness
Summary
The development of machine learning algorithms has greatly influenced decisionmaking
at various levels. However, these algorithms tend to incorporate biases.
Racial profiling in legal and financial systems are the best-known examples of inequality
stemming from algorithm decisions. Previous research has shown that one of
the reasons for racial bias is imbalanced data. This research will focus on generating
synthetic data using Generative Adversarial Networks (GANs) to reduce bias. Inspired
by GANs, this paper proposes the Intag framework. This framework contains
a modified version of Pate-GAN for synthetic data generation. The main modification
from the original Pate-GAN is that the hard privacy constraint is dropped.
Other changes, such as changing the architecture of the network, such that a number
of hidden layers depends on the dimension of input data. Moreover, the framework
will incorporate undersampling techniques to ensure that the synthetic data samples
are of the highest quality. The framework’s performance is evaluated on the basis
of machine learning utility by checking the quality of the synthetic data generated
by different methods. It is shown that the modified Pate-GAN achieves the best
results. Furthermore, the framework improves the values of statistical parity and disparate
impact, the two measures of fairness used in this study. We conclude that our
proposed modification to Pate-GAN, and the framework in general, can be used for
synthetic data generation. Moreover, it could be used as an aid for data generation
to improve fairness in the case of an imbalanced dataset.