Human evaluation of automatically generated classifiers in Mandarin Chinese
Summary
One of the difficulties of artificially generating Mandarin Chinese text is the question of which classifier - a linguistic unit obligatory in numeral expressions - to choose in a given context. Several algorithms for classifier choice have recently been developed and assessed using a corpus-based evaluation. The best-scoring algorithm was a BERT classification model. However, evaluating classifiers based on a corpus provides a conservative score: it classifies each non-matching classifier as incorrect, while native speakers might acknowledge multiple different classifiers as a correct option. Since the ultimate goal of NLG should be the generation of texts that are useful to humans, we decided to perform a human evaluation in addition to the corpus-based one. We conducted two experiments; the first was a standard NLG evaluation, and the second was a more linguistically motivated experiment focusing on only true classifiers (a specific subset of Mandarin classifiers). We found that, according to human readers, BERT consistently performs better than the other models, agreeing with the corpus-based evaluation. However, we found no difference in the evaluation scores between BERT and the human-produced sentences in the corpus. This is remarkable, because the corpus-based evaluation suggests a large gap between BERT’s score and the corpus’ score. This result suggests human readers are more accepting of variations in classifier choice than previously thought.