Geodata source retrieval in PDOK by multilingual/semantic query expansion
Summary
The volume of geodata available on Spatial Data infrastructures (SDIs) continues to grow, and there is an increasing problem with the abundance of geodata in terms of the discovery and accessibility in the distributed environments. It is difficult for end-users to find relevant content provided by different data providers. This problem becomes more challenging when it comes to supporting natural language in search engines; since the effectiveness and the findability of datasets rely on search techniques and the clarity of the queries. The keywords used by users are often different from the keywords recorded on metadata. However, the keywords submitted to the search engines may have semantically related to the content of metadata. Therefore, Natural Language Processing (NLP) techniques can be employed in conjunction with the technology used in the search engines to help different users with language limitations and specific domains by capturing the semantic and linguistic content in metadata. When a query executes poorly, the business logic behind the search engine reformulates and enriches queries based on the synonyms and relations gathered from the online data resources, which affects the recall and precision of geodata retrieval. This approach is a common technique and has been implemented for open-domain search engines such as Google and location-based services. However, spatial search and NLP techniques on the current Catalogue Services (CSs) are ongoing research topics and still required much work to be beneficial for users to take advantage of existing open government datasets. To address the limitations of search on the current SDIs and bridge between users’ minds and contents documented in metadata, in this research, we examined query expansion using WordNet and Google translate API to generate more semantic keywords. In this work, we proposed a corpus-based methodology for query keyword extraction. The corpus is gathered from real users’ questions in natural language. Then, these keywords are enriched using WordNet and Google translate APIs. Evaluation is carried out compared to a manual gold standard and baseline for the set of query keywords. Our empirical evaluation study on these tasks is executed by developing three retrieval platforms. Our study shows that the semantic keywords resulted from the local multilingual WordNet platform help users by reformulating good alternative queries. This approach also causes improvement in the precision and recall of geo-datasets by 1% and 22% respectively.