SAURON: Leveraging Semantically Similar Utterances to Enhance Writing Style Embedding Models

Koornstra, Tim

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Nguyen, Dong
dc.contributor.author	Koornstra, Tim
dc.date.accessioned	2023-07-20T00:02:19Z
dc.date.available	2023-07-20T00:02:19Z
dc.date.issued	2023
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/44226
dc.description.abstract	This thesis investigates the potential for enhancing transformer-based models, widely used in Natural Language Processing (NLP), for the task of writing style representation. I propose a novel approach wherein a RoBERTa model (Liu et al., 2019) is trained on the Contrastive Authorship Verification (CAV) task using semantically similar utterances. These are pairs of utterances that encapsulate the same semantic information but differ in their stylistic expression. This methodology encourages the model to concentrate more on style rather than content, fostering a more discerning representation of stylistic nuances. The training data comprised a broad array of conversations from the online platform Reddit, providing a wide representation of authorship and topics. To assess the performance of the models, the STyle EvaLuation (STEL) framework (Wegmann and Nguyen, 2021) was utilized. The results of the STEL evaluation helped ascertain the models’ ability to accurately capture writing style and delineate the impact of introducing semantically similar pairings. While incorporating semantically similar utterances greatly improved performance over models without any form of content control, it was discovered that relying solely on semantically similar utterances was not the most efficient approach. Instead, the findings suggested that a combination of this technique with conversation-based sampling of examples could further enhance the models’ performance. Additionally, the research underlined various effective strategies for preparing input data, such as maintaining diversity in authorship and topics. The final model, coined as the SAURON (Stylistic AUthorship RepresentatiON) model, considerably improved upon previous iterations. This advancement contributes to the advancement of style-content disentanglement tasks and paves the way for more nuanced and robust style representations.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	In this thesis I try to create a general style representation model by using semantically similar utterances on the authorship verification task.
dc.title	SAURON: Leveraging Semantically Similar Utterances to Enhance Writing Style Embedding Models
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	transformers;AI;Machine Learning;Natural Language Processing;Artificial Intelligence;RoBERTa;writing style;style;representation;representation learning
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	19500

Files in this item

Name:: Tim_Koornstra-6435777-Master_T ...
Size:: 1.590Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record