“Use your words”: Towards Gender Fairness for Multimodal Depression Detection
Summary
Depression is a prevalent mental health disorder affecting both patients and society. The ability to identify at-risk individuals early, accurately, and without human intervention can be considered an important task as it enables timely and appropriate intervention and treatment. In recent years, numerous models have shown to be successful in detecting depression based on audiovisual cues. However, the growing use of machine learning (ML) systems for this task has raised concerns about potential biases within these systems.
This thesis explores gender fairness in multimodal depression detection using the D-Vlog dataset, which comprises vlogs derived from social media (YouTube). This study addresses the gender bias observed in previous models, particularly the performance disparity between genders. While previous studies have effectively used textual data to detect depression from social media, no research has yet applied this approach to the D-Vlog dataset. This study integrates the textual modality, experiments with various fusion strategies, and evaluates multiple bias mitigation techniques, aiming to improve both the fairness and performance of depression detection models developed using the D-Vlog dataset.
The methodology involves extracting the textual modality from the vlogs in the form of transcripts, followed by preprocessing steps to obtain both word and sentence embeddings and to prevent potential data leakage. A modality-based approach analyses the impact of the textual modality on performance and fairness, where uni- and multimodal models are trained using different modalities and fusion approaches. After applying various bias mitigation methods, the study assesses their effects on fairness and performance.
Experimental results reveal that incorporating the textual modality boosts the performance of both uni- and multimodal depression detection models, though a trade-off between performance and fairness is observed. Moreover, it was found that the choice of modality and specific feature embeddings may introduce additional gender bias into the model. In line with previous studies, the bias mitigation techniques did not consistently reduce the existing gender bias.
Despite the promising results, the study faces several limitations. The D-Vlog’s collection and annotation process presents challenges such as self-disclosure bias, sampling bias, and label noise. Additionally, the model may be subject to conversational topic bias due to the collection process, despite the preprocessing steps taken to mitigate this effect.
This research provides a comprehensive assessment of the impact of incorporating the textual modality and various fusion approaches on the performance, bias, and fairness of depression detection models trained on the D-Vlog dataset. Furthermore, the research enhances the reproducibility of the experiments by open-sourcing the repository containing the re-implemented code for the D-Vlog model, addressing a gap left by previous studies that did not release their code.
Future research directions include the integration of existing video-language models or models specifically trained on multiple modalities, performing a cross-corpus validation using a clinically labelled dataset, and conducting a more in-depth analysis of textual features.