Validation of a Bayesian mixture model for language contact with the use of synthetic language data
Summary
Speaker communities typically have some level of interaction and are not
completely isolated. When individuals who speak different languages
come into contact, it is probable that their respective languages undergo
a process of convergence.
Ranacher et al. (2021) have developed a method, sBayes, to estimate the
relative role of language contact, as opposed to inheritance and universal
preference, in creating similarities between languages. The model promises
to identify contact areas from empirical data using (Bayesian) inference.
However, validation of the approach proves difficult since they use em-
pirical data of real-world language in which, by definition, actual contri-
butions of language contact, inheritance and universal preference are not
known.
To further validate the sBayes model, a dataset is needed from which we
know our expected descriptive contact, inheritance and universal prefer-
ence values prior to the model run. This dataset can then be compared to
the output of sBayes.
For this purpose, we created synthetic language datasets using an agent-
based model to test the accuracy of sBayes. Using these datasets we con-
ducted two experiments, one to validate sBayes ability to detect isolated
causal explanations per language feature. The second to test sBayes fit to
an artificial language dataset and in determining language areas (clusters)
and overall causality counts.
Our results suggest that synthetic language data can successfully be used
for validation purposes of the sBayes language model. sBayes accuracy on
identifying clearly isolated causalities has a combined mean squared error
of 0.05 in our simulations. In a simulated real life situation, the model find
a similar amount of contact areas. In addition, the overall distribution of
feature state causality is the same in our synthetic data when we compare
it to a benchmark experiment.