Dutch style model trained using authorship verification
Summary
Currently, there are no models trained specifically trained on the creation of
representation of Dutch linguistic style. Neither has a task been developed to
evaluate and verify the created embeddings. In this thesis, I construct a model
that creates a style representation for Dutch and I create evaluation data to test
if the created representation truly represents style. To create these embeddings,
RobBERT-base is fine-tuned using the contrastive authorship verification task.
To find the best-performing model, two datasets are constructed, and the
loss function is experimented with as well as the value for the margin. The
performance of the fine-tuned models falls in line with the results that are
found in similar research for English style. For the evaluation, the STEL
dataframe is adapted to a Dutch version. Some categories are copied from the
English variant and translated to properly reflect Dutch style. Other categories
are novel in this version. There are two versions of the STEL task, one of
which controls for content to ensure that the embedding makes a decision
based on style. The performance of the embeddings on the STEL task shows
similarities to the results that are found in research into the English equivalent
and shows that for most tasks the fine-tuned model learns to perform better
on the tasks that control for style than the baseline model does. Therefore,
this thesis concludes that it is possible to utilize methods devised for creating
and evaluating English style representations and transform these into a Dutch
version that show similar results as the original do