What replication and localisation teach us: the case of semantic similarity measures
Summary
Many tasks in the field of Natural Language Processing make use of so-called
semantic similarity measures, which quantify the degree to which two concepts
are semantically similar. In order to know which of the semantic similarity
measures is to be used for Natural Language Processing tasks, they are generally
evaluated against human judgement. However, because human judgement is
subjective, gold standards are created by asking a group of people to indicate
the similarity of meaning of a set of word pairs. The correlation between these
gold standards and the output from the semantic similarity measures gives a
good indication as to which measure correlates best with human judgement.
Most research, for example Patwardhan and Pedersen (2006) and Peder-
sen (2010), has focused on English, using the English lexical semantic database
WordNet (Miller, 1995) to compute the scores for the semantic similarity mea-
sures. The main focus of this thesis is upon getting a better understanding
of the workings of semantic similarity measures by also using a diff erent lexi-
cal semantic database in a di fferent language, which is Cornetto (Vossen, 2006;
Vossen et al., 2007, 2008) for Dutch.
In order to get a better understanding of these measures, we first inspect the
previous English experiments and try to replicate them to be sure that we fully
understand the process. Furthermore, we will create a Dutch gold standard
and inspect the correlations between the output from the semantic similarity
measures using the Dutch lexical semantic database Cornetto and the newly
created Dutch gold standard.
For English, we will show that a group of semantic similarity measures ap-
proaches human judgement in a similar way. Moreover, we will stress the im-
portance of addressing every detail of the process that leads to the results by
showing that even if the main properties are kept stable, variations in minor
properties can lead to completely diff erent outcomes. Furthermore, we will
present our gold standard for Dutch and how it was created. In addition, we
will show that not only the properties of a semantic similarity measure deter-
mine its performance, but that the structure of the lexical semantic database
also plays a crucial role