Who can command the Random Forest and make the trees pull Data out of the earth?: Predicting soil types through Random Forest machine learning using open-source data

Molleman, D.M.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Oosterom, P.J.M.
dc.contributor.advisor	Balado, J.
dc.contributor.author	Molleman, D.M.
dc.date.accessioned	2021-08-23T18:00:24Z
dc.date.available	2021-08-23T18:00:24Z
dc.date.issued	2021
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/41034
dc.description.abstract	For military terrain analysis, a detailed soil map is needed to assess the terrain accessibility during a mission. Predicting soil classes where no data is available is a difficult task. For this reason, the Random Forest algorithm has been applied to predict the individual soil properties: sand, silt, clay, coarse fragments, organic content, and cation exchange capacity which can be combined into the Unified Soil Classification system (USCS) soil classification. Using open-source data points in combination with explanatory variables that are available for all of Europe, the model is trained to predict soil property values in areas where no soil samples are present at a spatial resolution of 30 meters. The predictors used include satellite imagery, spectral indices, hydrological data, digital elevation models and its derivatives. As the liquid limit and the plasticity index are both needed for the USCS, but are not included in European soil samples, they must be calculated using the other soil properties. From the clay content and the cation exchange capacity, a linear regression model was set up using US-data in order for the two properties to be predicted in Europe. The linear regression model reached an R-squared of 0.842 for the liquid limit and 0.895 for the plasticity index. In this study, an innovative method for model validation is used that ensures consistent validation statistics by generating subsets that each contain points that are well distributed over the entire range of values and are also geographically dispersed. It was found that hydrological predictors scored high in importance when predicting sand, silt, clay and cation exchange capacity. Moreover, variation in coarse fragments was mostly explained by the digital elevation model and its derivatives, and the nitrogen content of a soil reaped the highest importance in predicting organic content in soils. In predicting the individual soil properties, the highest amount of variation was explained for clay content, resulting in 68.5%. This is followed by sand and silt content (48.4% and 57.9% resp.). For cation exchange capacity and organic content, explained variations of 40% and 44.5% were attained. The lowest R-squared statistic was reached for coarse fragments where only 38.1 percent of the variation was explained. The Random Forest algorithm proved effective in predicting soil properties with limited samples available while maintaining a spatial resolution of 30 meters. Additionally, an improved method for determining the Atterberg limits was developed to be used in areas where no data on the limits is available. Furthermore, a validation method was constructed that provided consistent statistics describing explained variation.
dc.description.sponsorship	Utrecht University
dc.format.extent	11374228
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	Who can command the Random Forest and make the trees pull Data out of the earth?: Predicting soil types through Random Forest machine learning using open-source data
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	predicting soil properties, USCS, liquid limit, plasticity index, Random Forest, machine learning, remote sensing, open-source data.
dc.subject.courseuu	Geographical Information Management and Applications (GIMA)

Files in this item

Name:: Master Thesis Daan.pdf
Size:: 10.84Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record