Who can command the Random Forest and make the trees pull Data out of the earth?: Predicting soil types through Random Forest machine learning using open-source data
Summary
For military terrain analysis, a detailed soil map is needed to assess the terrain accessibility during a mission. Predicting soil classes where no data is available is a difficult task. For this reason, the Random Forest algorithm has been applied to predict the individual soil properties: sand, silt, clay, coarse fragments, organic content, and cation exchange capacity which can be combined into the Unified Soil Classification system (USCS) soil classification. Using open-source data points in combination with explanatory variables that are available for all of Europe, the model is trained to predict soil property values in areas where no soil samples are present at a spatial resolution of 30 meters. The predictors used include satellite imagery, spectral indices, hydrological data, digital elevation models and its derivatives. As the liquid limit and the plasticity index are both needed for the USCS, but are not included in European soil samples, they must be calculated using the other soil properties. From the clay content and the cation exchange capacity, a linear regression model was set up using US-data in order for the two properties to be predicted in Europe. The linear regression model reached an R-squared of 0.842 for the liquid limit and 0.895 for the plasticity index. In this study, an innovative method for model validation is used that ensures consistent validation statistics by generating subsets that each contain points that are well distributed over the entire range of values and are also geographically dispersed. It was found that hydrological predictors scored high in importance when predicting sand, silt, clay and cation exchange capacity. Moreover, variation in coarse fragments was mostly explained by the digital elevation model and its derivatives, and the nitrogen content of a soil reaped the highest importance in predicting organic content in soils. In predicting the individual soil properties, the highest amount of variation was explained for clay content, resulting in 68.5%. This is followed by sand and silt content (48.4% and 57.9% resp.). For cation exchange capacity and organic content, explained variations of 40% and 44.5% were attained. The lowest R-squared statistic was reached for coarse fragments where only 38.1 percent of the variation was explained. The Random Forest algorithm proved effective in predicting soil properties with limited samples available while maintaining a spatial resolution of 30 meters. Additionally, an improved method for determining the Atterberg limits was developed to be used in areas where no data on the limits is available. Furthermore, a validation method was constructed that provided consistent statistics describing explained variation.