View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        EXTRACTING BOOK TITLES FROM HISTORICAL NEWSPAPER ARCHIVES: A NAMED ENTITY RECOGNITION APPROACH

        Thumbnail
        View/Open
        Thesis Final Niels Bijl.pdf (1.319Mb)
        Publication date
        2024
        Author
        Bijl, Niels
        Metadata
        Show full item record
        Summary
        This thesis introduces a novel method for extracting book titles from Optical Character Recognition scanned historical newspaper archives using Named Entity Recognition (NER), a task not extensively explored in existing literature. By identifying books highlighted by reviewers and journalists, we can gain insights into the evolving cultural and literary tastes of society. Utilizing a dataset from the Leeuwarder Courant, the study applies various NER models, including BiLSTM-CRF and transformer-based models. The transformer-based models outperformed others, achieving an F1 score of 84.3% on the test dataset, demonstrating the effectiveness of these models in extracting text representing book titles from newspaper archives. In addition to assessing performance on a NER level, an evaluation was conducted to measure how well the best NER model could identify the actual discussed books. This was achieved by matching the extracted book title text to the titles in the Nederlandse Bibliografie Totaal (NBT), a comprehensive compilation of all books published by Dutch publishers. Despite high NER performance, the matching process yielded a suboptimal F1 score of 59.4%. This gap was primarily due to the training data not being specifically labeled for NER purposes, making its repurposing as a NER dataset inadequate. Consequently, the model often missed subtitles, resulting in incomplete title extraction. Further analysis showed that even with perfect NER predictions, matching titles to the NBT achieved an F1 score of only 65.5%. This finding highlights the need for additional information besides the main title, such as subtitles, authors, and potentially publishers, to improve the accuracy of title matching to the NBT.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/47009
        Collections
        • Theses
        Utrecht university logo