Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorNguyen, Dong
dc.contributor.authorKoornstra, Tim
dc.date.accessioned2023-07-20T00:02:19Z
dc.date.available2023-07-20T00:02:19Z
dc.date.issued2023
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/44226
dc.description.abstractThis thesis investigates the potential for enhancing transformer-based models, widely used in Natural Language Processing (NLP), for the task of writing style representation. I propose a novel approach wherein a RoBERTa model (Liu et al., 2019) is trained on the Contrastive Authorship Verification (CAV) task using semantically similar utterances. These are pairs of utterances that encapsulate the same semantic information but differ in their stylistic expression. This methodology encourages the model to concentrate more on style rather than content, fostering a more discerning representation of stylistic nuances. The training data comprised a broad array of conversations from the online platform Reddit, providing a wide representation of authorship and topics. To assess the performance of the models, the STyle EvaLuation (STEL) framework (Wegmann and Nguyen, 2021) was utilized. The results of the STEL evaluation helped ascertain the models’ ability to accurately capture writing style and delineate the impact of introducing semantically similar pairings. While incorporating semantically similar utterances greatly improved performance over models without any form of content control, it was discovered that relying solely on semantically similar utterances was not the most efficient approach. Instead, the findings suggested that a combination of this technique with conversation-based sampling of examples could further enhance the models’ performance. Additionally, the research underlined various effective strategies for preparing input data, such as maintaining diversity in authorship and topics. The final model, coined as the SAURON (Stylistic AUthorship RepresentatiON) model, considerably improved upon previous iterations. This advancement contributes to the advancement of style-content disentanglement tasks and paves the way for more nuanced and robust style representations.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectIn this thesis I try to create a general style representation model by using semantically similar utterances on the authorship verification task.
dc.titleSAURON: Leveraging Semantically Similar Utterances to Enhance Writing Style Embedding Models
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordstransformers;AI;Machine Learning;Natural Language Processing;Artificial Intelligence;RoBERTa;writing style;style;representation;representation learning
dc.subject.courseuuArtificial Intelligence
dc.thesis.id19500


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record