Content words as measure of structure in the science space
Towards a classification of publications based on noun phrase occurrence

Lamers, W.S.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Heimeriks, Dr G.J.
dc.contributor.advisor	Hoekman, Dr J.
dc.contributor.author	Lamers, W.S.
dc.date.accessioned	2015-11-25T18:00:28Z
dc.date.available	2015-11-25T18:00:28Z
dc.date.issued	2015
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/30389
dc.description.abstract	Recent developments in the production of scientific knowledge, namely an increase in interdisciplinary scientific research and interaction between academia and industry, pose challenges to citation-based methods for studying the structure of the science space. As an alternative, we ask the question whether it is possible to construct maps of science and find disciplinary similarity structures based not on citation, but on the content of publications’ titles and abstract. We present a theoretical framework in which we define disciplines as being distinct from one another based on their associated cognitive elements. Specifically, each discipline will have its own unique vocabulary used when communicating its research results in the form of publications. Linking this framework to text processing methods, we elaborate on how the occurrence of noun phrases within disciplines may be used to represent disciplines and documents as term-occurrence vectors in a high-dimensional vector space model. From the Web of Science database, we collect over seven million publications spread over 33 disciplines. Comparing the angles between these disciplines’ term-occurrence vectors we construct a discipline similarity structure and use this structure to generate maps and a clustering solutions. We find that both the structure and the clusters are highly stable over time. We explore two different ways of computing a relevance score for noun phrases. One may be useful at finding discipline-specific cognitive content, the second is highly effective at removing low-relevance noun phrases from the vector space model while preserving the similarity structure. The effects of this pruning of low-relevance terms is further explored in the final experimental step of the research, where we divide the sample in a test and training sample and classify 1.4 million test publications based on their highest similarity to the training sample disciplines. We find encouraging classification performance, nearly similar with and without pruning of low-relevance terms. Our results indicate that we can indeed derive a stable and meaningful structure of the science space from publications’ title and abstract text. The classification shows that this structure can subsequently be put to use to place new publications into this structure with encouraging accuracy. This is an important conclusion, as so far methods for mapping the science space have been mostly restricted to citation data. These results open new avenues for research, potentially into the systematic assessment of novelty and new combinations in science.
dc.description.sponsorship	Utrecht University
dc.format.extent	2376477
dc.format.extent	2440400
dc.format.mimetype	application/pdf
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	Content words as measure of structure in the science space Towards a classification of publications based on noun phrase occurrence
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	bibliometrics, vector space model, classification, disciplines
dc.subject.courseuu	Innovation Sciences

Files in this item

Name:: Lamers.pdf
Size:: 2.266Mb
Format:: PDF

View/Open

Name:: Wout Lamers master's thesis ...
Size:: 2.327Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

Content words as measure of structure in the science space Towards a classification of publications based on noun phrase occurrence

Files in this item

This item appears in the following Collection(s)