Developing a foundation model on high-resolution satellite data of the Netherlands
Summary
Foundation models have proven their potential for being developed self-supervised on large datasets. They show unprecedented performance on various downstream tasks while requiring less labeled data than models trained specifically for those tasks. Thus, AI development has shifted from training on large human-annotated datasets to self- supervised training using a handful of labeled examples. This research aims to develop a foundation model using 1.2m high-resolution satellite data of the Netherlands. By combining a Convolutional Neural Network and a transformer, this foundation model will be able to capture both low- and high-frequency features in landscape characteristics. Leveraging temporal data as input, the model gains the ability to learn from a broader context. This approach enables optimal learning using less data and better generalization capabilities while learning richer features. The foundation model is tested on several downstream tasks, ranging from specific use cases in the Netherlands to global benchmarking datasets. The model demonstrates a noticeable performance improvement on the vegetation monitoring dataset from Rijkswaterstaat by implementing learning from temporal data instead of a single moment in time. It achieves results on global benchmarking datasets below those of state-of-the-art models but with a smaller model size using less pre-training data of solely the Netherlands. This approach demonstrates versatility and applicability, maintaining a manageable parameter size compared to relatively large computer vision models.