Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorEijnatten, J. van
dc.contributor.authorBijl, Niels
dc.date.accessioned2024-07-31T23:03:07Z
dc.date.available2024-07-31T23:03:07Z
dc.date.issued2024
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/47009
dc.description.abstractThis thesis introduces a novel method for extracting book titles from Optical Character Recognition scanned historical newspaper archives using Named Entity Recognition (NER), a task not extensively explored in existing literature. By identifying books highlighted by reviewers and journalists, we can gain insights into the evolving cultural and literary tastes of society. Utilizing a dataset from the Leeuwarder Courant, the study applies various NER models, including BiLSTM-CRF and transformer-based models. The transformer-based models outperformed others, achieving an F1 score of 84.3% on the test dataset, demonstrating the effectiveness of these models in extracting text representing book titles from newspaper archives. In addition to assessing performance on a NER level, an evaluation was conducted to measure how well the best NER model could identify the actual discussed books. This was achieved by matching the extracted book title text to the titles in the Nederlandse Bibliografie Totaal (NBT), a comprehensive compilation of all books published by Dutch publishers. Despite high NER performance, the matching process yielded a suboptimal F1 score of 59.4%. This gap was primarily due to the training data not being specifically labeled for NER purposes, making its repurposing as a NER dataset inadequate. Consequently, the model often missed subtitles, resulting in incomplete title extraction. Further analysis showed that even with perfect NER predictions, matching titles to the NBT achieved an F1 score of only 65.5%. This finding highlights the need for additional information besides the main title, such as subtitles, authors, and potentially publishers, to improve the accuracy of title matching to the NBT.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis thesis introduces a novel method for extracting book titles from Optical Character Recognition scanned historical newspaper archives using Named Entity Recognition.
dc.titleEXTRACTING BOOK TITLES FROM HISTORICAL NEWSPAPER ARCHIVES: A NAMED ENTITY RECOGNITION APPROACH
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.courseuuApplied Data Science
dc.thesis.id34945


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record