View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        A Web Crawler for Automated Document Retrieval in Health Policy

        Thumbnail
        View/Open
        MichelleDonovan_thesis_ADS_v2.pdf (756.4Kb)
        Publication date
        2021
        Author
        Donovan, M.J.
        Metadata
        Show full item record
        Summary
        Document retrieval in Health Policy Research is labor-intensive and inefficient. To investigate the efficacy and transparency of health policy processes such as drug approval, reports are manually collected from the websites of health regulatory bodies. This paper discusses the configuration of a web crawler to automate this process. The usage of Apache Nutch to crawl the European Medicines Agency (EMA) and retrieve European Public Assessment Reports (EPAR) is detailed. The crawler is designed to be successful in the context of EMA but Nutch provides capabilities for wider applications which are also documented. The crawler was successful in gathering the correct URLs creating a database of the target reports. The scalability of this web crawler is apparent in terms of the Nutch capabilities however, some of the configurations remain context specific. The extensible nature of the crawler properties, although valuable, require extensive knowledge to implement. This paper provides a detailed description of how to crawl EMA and provides guidance on how this configuration can be applied to other contexts. Future research into the range of Nutch capabilities is recommended to ensure the tool is being used to full capacity.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/41197
        Collections
        • Theses
        Utrecht university logo