EXTRACTING BOOK TITLES FROM HISTORICAL NEWSPAPER ARCHIVES: A NAMED ENTITY RECOGNITION APPROACH

Bijl, Niels

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Eijnatten, J. van
dc.contributor.author	Bijl, Niels
dc.date.accessioned	2024-07-31T23:03:07Z
dc.date.available	2024-07-31T23:03:07Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/47009
dc.description.abstract	This thesis introduces a novel method for extracting book titles from Optical Character Recognition scanned historical newspaper archives using Named Entity Recognition (NER), a task not extensively explored in existing literature. By identifying books highlighted by reviewers and journalists, we can gain insights into the evolving cultural and literary tastes of society. Utilizing a dataset from the Leeuwarder Courant, the study applies various NER models, including BiLSTM-CRF and transformer-based models. The transformer-based models outperformed others, achieving an F1 score of 84.3% on the test dataset, demonstrating the effectiveness of these models in extracting text representing book titles from newspaper archives. In addition to assessing performance on a NER level, an evaluation was conducted to measure how well the best NER model could identify the actual discussed books. This was achieved by matching the extracted book title text to the titles in the Nederlandse Bibliografie Totaal (NBT), a comprehensive compilation of all books published by Dutch publishers. Despite high NER performance, the matching process yielded a suboptimal F1 score of 59.4%. This gap was primarily due to the training data not being specifically labeled for NER purposes, making its repurposing as a NER dataset inadequate. Consequently, the model often missed subtitles, resulting in incomplete title extraction. Further analysis showed that even with perfect NER predictions, matching titles to the NBT achieved an F1 score of only 65.5%. This finding highlights the need for additional information besides the main title, such as subtitles, authors, and potentially publishers, to improve the accuracy of title matching to the NBT.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This thesis introduces a novel method for extracting book titles from Optical Character Recognition scanned historical newspaper archives using Named Entity Recognition.
dc.title	EXTRACTING BOOK TITLES FROM HISTORICAL NEWSPAPER ARCHIVES: A NAMED ENTITY RECOGNITION APPROACH
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Applied Data Science
dc.thesis.id	34945

Files in this item

Name:: Thesis Final Niels Bijl.pdf
Size:: 1.319Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record