Data integration: a water quality case study
Summary
Data integration is becoming an increasingly important issue with the increased sharing of information as a result of linked data and Spatial Data Infrastructures. To investigate the potential issues a case study for the Water Quality Register (WQR) of the Informatiehuis Water is used. In this case study the data residing in three separate data sources (Water Framework Directive (WFD) Database, Bulkdatabase and Limnodata) is to be integrated into a single register (the WQR). A full integration requires harmonisation steps at the data model level (schema mapping and transformation) and at the instance level (instance matching).
Schema mapping involves the definition of correspondences between equivalent elements in two or more data models (schemas) defined in for example the Unified Modelling Language (UML) using class diagrams or the XML Schema Definition language (XSD). Correspondences need to be created between a source and a target schema. During this research the schemas of the data sources are documented using reverse engineering techniques as existing documentation is lacking. During the documentation it was found that none of the sources adhered (fully) to a known standard. Also referential integrity and the quality of data contents are lacking.
Because none of the existing schemas is suitable for data integration, a target model for the WQR is developed based on INSPIRE (themes Hydrography, Environmental Monitoring Facilities and Area Regulation, Restriction zones and Reporting Units), ISO19156 (“Observations and Measurements”), the WISE reporting sheets and Aquo. The conceptual target schema in UML is converted to an application schema in XSD. To document the correspondences a number of schema mapping languages exist. Only a few of these languages have practical tooling available however. As part of the research three options were further described and their applicability to the case study examined: Rule Interchange Format (RIF), Ontology Mapping Language (OML) and XSLT. For the case study XSLT (and XQuery) were chosen in combination with Altova MapForce as most suitable option for implementation.
The second part of a full integration is instance matching. The key spatial object in the case study is the monitoring location. During instance matching inconsistencies from double entries in the data source (conflation) and overlap between data sources (equivalence) are detected and resolved. This is done by matching the locations against each other using the geometry, geographical name and identifier. The resulting matches are used to create a single reference set with (unique) monitoring locations. Both the INSPIRE Geographical Names and Gazetteer schema are investigated for suitability as schema for the reference set. Preference is given to the Geographical Names schema because it allows for more semantic detail. Adaptations to the Geographical Names schema are suggested to make it more suitable as a reference schema.
Based on the user requirements from the case study, a hybrid approach is tested for data integration. This hybrid approach combines the use of a harmonised database (Water Database) for storing data collected after the formation of the Water Database, with the use of a mediated schema approach for queries involving data existing in the original data sources prior to the formation of the Water Database (historic data).
The Water Database is built using the WQR target schema and filled through an Extract, Load and Transform process with relevant data (surface water bodies and monitoring programs) from the WFD Database and the monitoring locations from the reference set.
The integration solution, the WQR mediated schema, uses the Water Database as a new source together with the existing Bulkdatabase and Limnodata data sources. The WQR mediated schema solution retrieves information from these data sources using XSLT and XQuery in a proof of concept. The mediated schema uses the INSPIRE Geographical Names monitoring locations reference set as a central reference for the geographic queries. The proof of concept is functional but is not practical due to long response times. This is a result from the use of file based XML data sources. Suggestions to improve performance are given but not tested.