Cross-modal recipe analysis for fine-grained geographical mapping of food consumption from images and supermarket sales
Summary
The proposed study analyses the applicability of a cross-modal deep learning model trained online recipe data (Salvador et al., 2017) by Li et al. (2020b) to recognise ingredients and their proportional amounts in food images on food images from social media to geographically map food consumption within a city. This new method can provide insights into food consumption at fine spatial granularity, contributing to global health, diminishing world hunger, and providing opportunities for local production and consumption. This study explored whether we can use cross-modal analysis of images and ingredient sets to estimate relative ingredient proportions given images from a specific geographical area. These ingredient amounts are compared to relative consumption rates given by a translated dataset of retail sales of supermarkets within the same region. Specifically, the research focuses on Baku, Azerbaijan, as case study. To study consumption in this region, a new dataset is presented, AzerFSQFood, containing food images from eating establishments around the city. A pretrained cross-modal neural network (Li et al., 2020b) is applied to the novel image dataset, and the results are compared to a new translated supermarket sales dataset, based on an existing Azerbaijani dataset from supermarkets in Baku (Zeynalov, 2020). Unfortunately, the existing cross-modal neural network performs poorly on the AzerFSQFood dataset regarding ingredient detection and relative amount prediction. This indicates that existing cross-modal models are not yet readily applicable to novel datasets, and highlights the need of available models and curated datasets. The study in addition finds a significant correlation between Foursquare images and supermarket sales around Baku in terms of relative ingredient consumption.