Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorVelegrakis, Y. I.
dc.contributor.advisorVreman, R. A.
dc.contributor.authorDonovan, M.J.
dc.date.accessioned2021-08-25T18:00:18Z
dc.date.available2021-08-25T18:00:18Z
dc.date.issued2021
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/41197
dc.description.abstractDocument retrieval in Health Policy Research is labor-intensive and inefficient. To investigate the efficacy and transparency of health policy processes such as drug approval, reports are manually collected from the websites of health regulatory bodies. This paper discusses the configuration of a web crawler to automate this process. The usage of Apache Nutch to crawl the European Medicines Agency (EMA) and retrieve European Public Assessment Reports (EPAR) is detailed. The crawler is designed to be successful in the context of EMA but Nutch provides capabilities for wider applications which are also documented. The crawler was successful in gathering the correct URLs creating a database of the target reports. The scalability of this web crawler is apparent in terms of the Nutch capabilities however, some of the configurations remain context specific. The extensible nature of the crawler properties, although valuable, require extensive knowledge to implement. This paper provides a detailed description of how to crawl EMA and provides guidance on how this configuration can be applied to other contexts. Future research into the range of Nutch capabilities is recommended to ensure the tool is being used to full capacity.
dc.description.sponsorshipUtrecht University
dc.format.extent774643
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.titleA Web Crawler for Automated Document Retrieval in Health Policy
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.courseuuApplied Data Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record