The increased availability of a large amount of spatio-temporal images is driving the need for adequate tools to process, analyse and extract actionable information from the data. Machine learning (ML) and, in particular, deep learning methods have become the state-of-the-art in many vision, language, and signal processing tasks, due to their ability to extract patterns from complex high-dimensional input data. Classical ML methods, such as random forest and support vector machines [1, 2] have been used in many EO applications to analyse temporal series of remote sensing images. On the other hand, convolutional neural networks have been employed to analyse the spatial correlations between neighbouring observations, however mainly in single temporal scene applications [3]. In this paper, similarly to the work of [4], we investigate a deep learning architecture capable of simultaneously analysing the spatio-temporal relationships of satellite image series. In this work we demonstrate its application on the land cover classification of the Republic of Slovenia, using annual Sentinel-2 satellite images for the year 2017.
METHOD
We present a Temporal Fully-Convolutional Network (TFCN) that extends the FCN architecture (a.k.a. U-Net [5]), which is currently the state-of-the-art in single scene semantic segmentation. Similar to [4], the architecture exploits spatio-temporal correlations to maximise the classification score, with the additional benefit of representing spatial relationships at different scales due to the encoding-decoding U-Net structure.
The input tensor of the time series has a shape of [T, H, W, D], where T is the number of time frames, and H, W and D are the height, width and number of bands (i.e. depth) of the input image tensor. The algorithm performs a 3D convolution in the spatial as well as the temporal dimension. By default, max-pooling is performed in the spatial domain only. As the target land cover labels are not-time dependent (i.e. one label per pixel is available for the entire time-series), 1D convolutions along the temporal dimension are performed in the decoding path of the architecture to linearly combine and reduce the temporal features. The schematic of the model architecture is shown in Figure 1. The output of the network results in a 2D label map which is compared to the ground-truth labels. The architecture performed three encoding and decoding steps (i.e. three max-pooling and three deconvolution layers), with a bank of two convolution layers at each encoding and decoding scale. The number of convolutional features was set to 16 at the original scale, with a factor of 2 applied at each deeper level, with a kernel width of 3. The Adam optimiser (learning rate 0.001) was employed. The TFCN model is implemented in TensorFlow. The code to reproduce the entire pipeline will be released at the time of the conference.
EXPERIMENTS
To evaluate the TFCN, we present an end-to-end workflow to generate a land cover map for Slovenia. We do this by using eo-learn, a framework to handle multi-temporal and multi- source satellite data, both in raster and vector format. The framework allows splitting the area-of-interest (AOI) into smaller patches, that can be processed with limited computational resources and allows automation of the processing pipeline. The framework consists of open- source Python packages and is designed to facilitate the prototyping and building of Earth Observation applications. We use this framework to process the data and train the ML models to predict classes of land use and land cover.
The framework was used to generate a land cover map of the Republic of Slovenia for the year 2017. The inputs to the framework are a shape-file defining the geometry of the AOI, the Sentinel-2 L1C images for the entire year, and a set of training labels. Avoiding to download and process entire tile products (e.g. Sentinel-2 granules) provides flexibility and facilitates an automation of the processing pipelines. A pipeline in eo-learn is defined as a connected acyclic graph of well-specified tasks to be performed on the data. eo-learn supports parallelisation of operations, such that the same workflow (e.g. data preparation for land cover classification) can be run in parallel for the smaller patches constituting the AOI. Logging and reporting allow to monitor and debug the execution of the processing pipeline. The automated pipeline for predicting the land-cover labels follows the following setup:
The boundary of the AOI was split into 25 × 17 equal parts, which resulted in about 300 patches of about 103 × 103 px2 at a 10 m resolution. The splitting choice depends on the amount of available resources. The output of this step is a list of bounding boxes, covering the AOI.
2. The bounding boxes were used to create patches, or containers, where the information for the corresponding area is stored. Given the time interval of interest (e.g. from 2017-01- 01 to 2017-12-31), eo-learn used the sentienlhub-py package to download the desired band data. The s2-cloudless cloud detector was subsequently used to obtain the cloud probabilities and cloud masks. The following Sentinel-2 features were input to the model (bands B02, B03, B04, B08, B11, B12, their euclidean distance NORM, and indices NDVI and NDWI).
3. Ground truth information was obtained in a vector format, which was then rasterised over the given bounding box and added to the patch. The ground-truth is openly available through the Slovenian Ministry of Agriculture, Forestry and Food, and is mapped to the following classes: cultivated land, forest, grassland, shrubland, water, wetlands, artificial surface, and bareland.
4. A random spatial sampling of patches was performed to select time-series of pixels which are used for ML model training. The random sampling is uniform throughout the patch and independent on the labels.
5. Due to the non-constant acquisition dates of the satellites and the irregular weather con-
ditions, missing data is common in EO. To deal with these, the mask of valid pixels in the time series is used to interpolate the values and fill the gaps due to the missing data. Linear interpolation was used, after which the time-series was sampled at uniform temporal dates (with a 16-day interval) to unify the dates among all the patches. This normalisation generalises the model to different AOI and time intervals.
6. The data was split into the train, cross-validation and test sets (60:20:20) and provided to the deep learning model, which was trained on an Amazon Web Services (AWS) instance.
RESULTS
The trained model was used to predict the labels on the test sample and the obtained results were then validated against the ground-truth. An overall accuracy of 84.4% and a weighted F1 score of 85.4% were achieved. In general, poor prediction was achieved for under-represented classes such as wetlands and shrubland. These results represent preliminary work on a prototype architecture which was not optimised for the task at hand. Despite this, results in line with previously reported work (e.g. Inglada et al. [1]) were achieved. Optimisation of the architecture (e.g. number of features, depth of the network, number of convolutions) and of the hyper-parameters (e.g. learning rate, number of epochs, class weighting) is required to fully assess the potential of TFCN.
DISCUSSIONS AND CONCLUSIONS
In this paper we have presented an approach to a land-use-land-cover classification using deep learning. While the architecture of the presented TFCN model was not fully optimised, it shows promising results, with the overall accuracy and weighted F1 score reaching levels of about 85%. A thorough and extensive evaluation comparing the TFCN with state-of-the-art methods will be presented at the symposium. Insights into good practices for training, evaluating, and deploying machine learning and deep learning models for EO applications will also be discussed. Source code to reproduce the framework and the models will be made available at the time of presentation to foster research and development in the exploitation of spatio-temporal data in EO applications.
REFERENCES
[1] Inglada J., Vincent A., Arias M., Tardy B., Morin D., and Rodes I., “Operational High-Resolution Land Cover Map Production at the Country Scale Using Satellite Image Time Series“, Remote Sensing 9(12), DOI: 10.3390/rs9010095, 2017.
[2] Karakizi C., Karantzalos K., Vakalopoulou M., Antoniou G., “Detailed Land Cover Mapping from Multitemporal Landsat-8 Data of Different Cloud Cover“, Remote Sensing 10(8), DOI: 10.3390/rs10081214, 2018.
[3] Helber P., Bischke B., Dengel A., and Borth D., “EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification“, arxiv: 1709.00029, 2017.
[4] Russwurm M., and Koerner, M.,“Multi-Temporal Land Cover Classification with Sequential Recur- rent Encoders“, ISPRS International Journal of Geo-Information 7(4),DOI: 10.3390/ijgi7040129, 2018.
[5] Long J., Shelhamer E., and Darrell T., “Fully convolutional networks for semantic seg- mentation“, 2015 IEEE Conference on Computer Vision and Pattern Recognition, DOI: 10.1109/CVPR.2015.7298965, 2015.