Content words as measure of structure in the science space Towards a classification of publications based on noun phrase occurrence
Summary
Recent developments in the production of scientific knowledge, namely an increase in interdisciplinary scientific research and interaction between academia and industry, pose challenges to citation-based methods for studying the structure of the science space. As an alternative, we ask the question whether it is possible to construct maps of science and find disciplinary similarity structures based not on citation, but on the content of publications’ titles and abstract. We present a theoretical framework in which we define disciplines as being distinct from one another based on their associated cognitive elements. Specifically, each discipline will have its own unique vocabulary used when communicating its research results in the form of publications. Linking this framework to text processing methods, we elaborate on how the occurrence of noun phrases within disciplines may be used to represent disciplines and documents as term-occurrence vectors in a high-dimensional vector space model.
From the Web of Science database, we collect over seven million publications spread over 33 disciplines. Comparing the angles between these disciplines’ term-occurrence vectors we construct a discipline similarity structure and use this structure to generate maps and a clustering solutions. We find that both the structure and the clusters are highly stable over time. We explore two different ways of computing a relevance score for noun phrases. One may be useful at finding discipline-specific cognitive content, the second is highly effective at removing low-relevance noun phrases from the vector space model while preserving the similarity structure. The effects of this pruning of low-relevance terms is further explored in the final experimental step of the research, where we divide the sample in a test and training sample and classify 1.4 million test publications based on their highest similarity to the training sample disciplines. We find encouraging classification performance, nearly similar with and without pruning of low-relevance terms.
Our results indicate that we can indeed derive a stable and meaningful structure of the science space from publications’ title and abstract text. The classification shows that this structure can subsequently be put to use to place new publications into this structure with encouraging accuracy. This is an important conclusion, as so far methods for mapping the science space have been mostly restricted to citation data. These results open new avenues for research, potentially into the systematic assessment of novelty and new combinations in science.