A Web Crawler for Automated Document Retrieval in Health Policy
Summary
Document retrieval in Health Policy Research is labor-intensive and inefficient.
To investigate the efficacy and transparency of health policy processes
such as drug approval, reports are manually collected from the websites of
health regulatory bodies. This paper discusses the configuration of a web
crawler to automate this process. The usage of Apache Nutch to crawl the
European Medicines Agency (EMA) and retrieve European Public Assessment
Reports (EPAR) is detailed. The crawler is designed to be successful in the
context of EMA but Nutch provides capabilities for wider applications which
are also documented. The crawler was successful in gathering the correct URLs
creating a database of the target reports. The scalability of this web crawler
is apparent in terms of the Nutch capabilities however, some of the configurations
remain context specific. The extensible nature of the crawler properties,
although valuable, require extensive knowledge to implement. This paper provides
a detailed description of how to crawl EMA and provides guidance on
how this configuration can be applied to other contexts. Future research into
the range of Nutch capabilities is recommended to ensure the tool is being
used to full capacity.