SAURON: Leveraging Semantically Similar Utterances to Enhance Writing Style Embedding Models
Summary
This thesis investigates the potential for enhancing transformer-based models, widely used in Natural Language Processing (NLP), for the task of writing style representation. I propose a novel approach wherein a RoBERTa model (Liu et al., 2019) is trained on the Contrastive Authorship Verification (CAV) task using semantically similar utterances. These are pairs of utterances that encapsulate the same semantic information but differ in their stylistic expression. This methodology encourages the model to concentrate more on style rather than content, fostering a more discerning representation of stylistic nuances. The training data comprised a broad array of conversations from the online platform Reddit, providing a wide representation of authorship and topics.
To assess the performance of the models, the STyle EvaLuation (STEL) framework (Wegmann and Nguyen, 2021) was utilized. The results of the STEL evaluation helped ascertain the models’ ability to accurately capture writing style and delineate the impact
of introducing semantically similar pairings. While incorporating semantically similar utterances greatly improved performance
over models without any form of content control, it was discovered that relying solely on semantically similar utterances was not the most efficient approach. Instead, the findings suggested that a combination of this technique with conversation-based sampling of examples could further enhance the models’ performance. Additionally, the research underlined various effective strategies for preparing input data, such as maintaining diversity in authorship and topics.
The final model, coined as the SAURON (Stylistic AUthorship RepresentatiON) model, considerably improved upon previous iterations. This advancement contributes to the advancement of style-content disentanglement tasks and paves the way for more nuanced and robust style representations.