A Web Crawler for Automated Document Retrieval in Health Policy

Donovan, M.J.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Velegrakis, Y. I.
dc.contributor.advisor	Vreman, R. A.
dc.contributor.author	Donovan, M.J.
dc.date.accessioned	2021-08-25T18:00:18Z
dc.date.available	2021-08-25T18:00:18Z
dc.date.issued	2021
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/41197
dc.description.abstract	Document retrieval in Health Policy Research is labor-intensive and inefficient. To investigate the efficacy and transparency of health policy processes such as drug approval, reports are manually collected from the websites of health regulatory bodies. This paper discusses the configuration of a web crawler to automate this process. The usage of Apache Nutch to crawl the European Medicines Agency (EMA) and retrieve European Public Assessment Reports (EPAR) is detailed. The crawler is designed to be successful in the context of EMA but Nutch provides capabilities for wider applications which are also documented. The crawler was successful in gathering the correct URLs creating a database of the target reports. The scalability of this web crawler is apparent in terms of the Nutch capabilities however, some of the configurations remain context specific. The extensible nature of the crawler properties, although valuable, require extensive knowledge to implement. This paper provides a detailed description of how to crawl EMA and provides guidance on how this configuration can be applied to other contexts. Future research into the range of Nutch capabilities is recommended to ensure the tool is being used to full capacity.
dc.description.sponsorship	Utrecht University
dc.format.extent	774643
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	A Web Crawler for Automated Document Retrieval in Health Policy
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.courseuu	Applied Data Science

Files in this item

Name:: MichelleDonovan_thesis_ADS_v2.pdf
Size:: 756.4Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record