An improved quality pipeline for Google Location History data
Summary
New advanced digital tools bring the researchers of human behavior a new way of collecting data they need by using digital trace data for their analysis. Data Donation can be an option for collection of data the human behavior researchers need without ethical risk. The application of Data Download Packages (DDPs) for Data Donation is popular in human behavior research for its secrecy, integrity, availability, and controllability features. But errors can occur in every step of data collection and wrangling. Google manages data obtained from user devices according to General Data Protection Regulation (GDPR). It creates DDPs according to its standard. One popular format among them is a Google Semantic Location History Format (GSLH) with a JSON file. The errors in the current DDPs by Google cause troubles for human behavior researchers using the Data Donation System as their expectation which means measuring the participants’ geo-data accurately as their analytical foundation.
In this paper, the key question will be how to handle the data error inside Google Semantic Location History Format. We investigate the quality of logs on two activity types, namely "Walking " and “Cycling” in the JSON file of GSLH DDPs to investigate whether these activity types are classified as reasonable in the data context. Through our designed workflow pipeline, namely flagging and imputation, clear data error should be limited, the quality of data should be improved and eventually, validation of analysis results for human behavior research should be enhanced. In the end, we find that a large proportion (22.6%) of data errors about activity type in our pilot data source through our processing pipeline, the imputation can be feasible although more resources are needed for ensuring the parameters and testing the validity.