Spatio-temporal data mining for vineyard yield estimation (A Slovenian case study using linear regressions and self-organizing maps)
MetadataShow full item record
Crop yield estimation is important for a number of reasons. For instance, crop yield estimation can be used to plan harvest and storage requirements. The main objective of this thesis was to use spatio-temporal data mining to estimate vineyard yield in Slovenia. For this, an after-harvest yield model capable of estimating and verifying yield at any location was built. Data mining was used to discover and quantify spatio-temporal relations between vineyard yield and selected explanatory variables. The Goriška Brda wine district (western part of Slovenia) was selected as study area and yield was estimated for all grape varieties in the district and for the Rebula variety, which is the most common variety in the district. From a methodological point of view, the aim of this thesis was to explore different spatio-temporal data mining approaches. Thus, two types of regressions, namely the ordinary least squares (OLS) and the geographically weighted regression (GWR), and a type of neural network method, self-organizing maps (SOM), were explored. The data available for this study mainly derives from the Slovenian Ministry of Agriculture and Environment (MAE). From the available data, explanatory and yield data was extracted. The explanatory variables were phisiogeographical (slope, exposition, etc.), vineyard characteristics related (distance between rows, distance between vines, etc.) and socio-economical (e.g. area of vineyards cultivated by a farmer). The dependent variable is the after-harvest grape yield per vine declared by the farmers. The spatial unit for this research is the single vineyard field and the time span of the data used in the research is five years, from 2007 to 2011. OLS and GWR regression results were compared to identify the method that better explains grape yield variation. Regression results were also compared to selected meteorological characteristics to estimate their effect on the accuracy of yield estimation. After that an unsupervised SOM clustering was done (i.e. the dependent variable (yield) was not taken into account when performing the clustering). The resulting SOM clusters were projected into the geographical space in order to check for spatial patterns. Further, yield’s variation within clusters was investigated to asses the value of the clustering. Results for all three data mining methods and for all grape varieties as well as for the Rebula variety indicate that yield can not be properly estimated using the selected methods and / or the selected explanatory variables. When comparing the regression results, GWR reached better results than OLS although with an R2 of approximately 0.15 – 0.25 (depending of the year). The comparison of prediction accuracy with meteorological characteristics shows no relation either. The SOM results were similarly poor. Clusters were barely identifiable and though they did form certain geographical patterns, they did not reflect yield variation (i.e. average yield is similar for all clusters). Possible causes for these poor results are a) the absence of a clear yield pattern in the dataset, b) the suitability of the explanatory variables, c) the data preparation and the parameterization of methods. Nevertheless the research provides the tools to identify declaration errors that should be inspected and corrected to improve MAE registers’ accuracy.