View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        SAURON: Leveraging Semantically Similar Utterances to Enhance Writing Style Embedding Models

        Thumbnail
        View/Open
        Tim_Koornstra-6435777-Master_Thesis.pdf (1.590Mb)
        Publication date
        2023
        Author
        Koornstra, Tim
        Metadata
        Show full item record
        Summary
        This thesis investigates the potential for enhancing transformer-based models, widely used in Natural Language Processing (NLP), for the task of writing style representation. I propose a novel approach wherein a RoBERTa model (Liu et al., 2019) is trained on the Contrastive Authorship Verification (CAV) task using semantically similar utterances. These are pairs of utterances that encapsulate the same semantic information but differ in their stylistic expression. This methodology encourages the model to concentrate more on style rather than content, fostering a more discerning representation of stylistic nuances. The training data comprised a broad array of conversations from the online platform Reddit, providing a wide representation of authorship and topics. To assess the performance of the models, the STyle EvaLuation (STEL) framework (Wegmann and Nguyen, 2021) was utilized. The results of the STEL evaluation helped ascertain the models’ ability to accurately capture writing style and delineate the impact of introducing semantically similar pairings. While incorporating semantically similar utterances greatly improved performance over models without any form of content control, it was discovered that relying solely on semantically similar utterances was not the most efficient approach. Instead, the findings suggested that a combination of this technique with conversation-based sampling of examples could further enhance the models’ performance. Additionally, the research underlined various effective strategies for preparing input data, such as maintaining diversity in authorship and topics. The final model, coined as the SAURON (Stylistic AUthorship RepresentatiON) model, considerably improved upon previous iterations. This advancement contributes to the advancement of style-content disentanglement tasks and paves the way for more nuanced and robust style representations.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/44226
        Collections
        • Theses
        Utrecht university logo