SUMMARY
Recent implementations of machine learning (ML) tools running on large scale data have been enabled by vast improvements in computing capability and parallelization, theoretical advances in ML models (in particular in the field of deep learning), and most importantly, an explosion of available data – the fuel of any ML engine.
Copernicus' unparalleled volume and quality of Earth observation (EO) data has been paramount in driving recent improvements; however, the EO field has not yet seen significant uptake of ML methods due to several shortcomings, which we will address in the presentation:
•There are only a few, high-quality sources of annotated data currently available for training supervised algorithms on satellite imagery. This data is critical to the advancement of ML research. We will present a tool for crowd-sourced classification and provide a means to integrate with existing open datasets (OpenStreetMap, SpaceNet, etc.), thus creating a large, new corpus of data for researchers to make use of.
•Most available ML frameworks require special approaches to accept EO data. The most challenging aspect is handling time-series data, which is critical for the classification of vegetation. We will present an integration of various tools (eo-learn, sat-utils and label maker) to leverage popular open source libraries like TensorFlow and MXNet EO ML.
•Many ML efforts in the EO field are limited to a very specific geographic area. To scale results on a global scale, we are relying on AWS infrastructure, including its SageMaker platform, and Sentinel Hub satellite imagery processing services.
All tools and datasets developed are released under open-source licenses. This provides an opportunity to open Copernicus data for further research and ML applications by providing annotated data and software tools. Due to the temporal and spatial resolution of these products, the data is especially well suited for creating new and novel ML algorithms and EO applications.
MACHINE LEARNING
With the availability of massive volumes of data, ML has become an important tool for analysis, ranging from various random forest algorithms to complex convolutional neural networks. Satellite imagery displays characteristic issues and artefacts: clouds, atmospheric effects and inaccurate geolocation are distorting the data, missing or cloudy scenes create gaps, etc., making it difficult to exploit well-known frameworks such as TensorFlow, MXNet, and others. Furthermore, lack of ground truth for training and validation is one of the major challenges preventing from efficient use of ML tools.
We try to address these elements within the Query Planet project, running within ESA's Φ-Lab, which is developing the following open-source tools:
•eo-learn, package acting as a bridge between EO data and existing ML and computer vision tools
•classification application and label maker enhancements
•repository for ground truth data
Importantly, the above-mentioned elements are integrated and demonstrated on a number of use-cases, mainly related to land cover classification and global water monitoring. These use-cases are available in an open-source manner, including training sets and networks, which makes it extremely convenient for anyone to use, modify, and extend.
EO-LEARN
eo-learn is a collection of open source Python packages that have been developed to seamlessly access and process spatio-temporal image sequences acquired by any satellite fleet in a timely and automatic manner. eo-learn is easy to use, its design is modular, and encourages collaboration – sharing and reusing of specific tasks in a typical EO-value-extraction workflows, such as cloud masking, image co-registration, feature extraction, classification, etc. Everyone is free to use any of the available tasks and is encouraged to improve on them, develop new ones and share them with the rest of the community.
eo-learn makes extraction of valuable information from satellite imagery as easy as defining a sequence of operations to be performed. Figure 1 illustrates a processing chain that executes automatic classification of land cover in a user specified region of interest.
The eo-learn library acts as a bridge between the EO/Remote sensing field and the Python ecosystem for data science and ML. The library uses NumPy arrays and Shapely polygons to store and handle remote sensing data. Its aim is to make entry easier for non-experts to the field of remote sensing on one hand, and bring state-of-the-art tools for computer vision, ML, and deep learning existing in Python ecosystem to remote sensing experts.
The design of the eo-learn library follows the dataflow programing paradigm and consists of three building blocks:
•EOPatch - common data-object for spatio-temporal EO and non-EO data, and their derivatives; it contains multi-temporal remotely sensed data of a single patch (area) of Earth’s surface, typically defined by a bounding box in a specific coordinate reference system, both in raster and vector format. EOPatch is completely sensor-agnostic, meaning that imagery from different sensors (satellites) or sensor types (optical, synthetic-aperture radar, etc.) can be added to an EOPatch.
•EOTask - a single, well-defined operation being performed on input EOPatch(es), which returns a modified EOPatch. EOTasks are the heart of the eo-learn library. They define in what way the available satellite imagery can be manipulated in order to extract valuable information out of it. Typical users will most often be interested in what kind of tasks are already implemented but can also write custom EOTasks, if their desired functionality doesn’t yet exist.
•EOWorkflow is a collection of EOTasks that together represent an EO-value-adding-processing chain or EO-value-extraction pipeline by chaining or connecting sequence of EOTasks. The EOWorkflow takes care that the EOTasks are executed in the correct order and with correct parameters. Under the hood the EOWorkflow builds a directed acyclic graph. There’s no limitation on the number of nodes or the graph’s topology, as long as the graph is acyclic.
There are several existing packages, covering common EO analysis steps:
•eo-learn-core, the main subpackage which implements basic building blocks, commonly used functionalities, and logging/reporting.
•eo-learn-coregistration, dealing with image co-registration to correct gelocation errors.
•eo-learn-features is a collection of utilities for extracting data properties and feature manipulation.
•eo-learn-geometry is used for geometric transformation and conversion between vector and raster data.
•eo-learn-io, input/output subpackage that deals with obtaining data from various data source services or saving and loading data locally. It provides seamless access to global archive of Sentinel-1 GRD, Sentinel-2 (L1C and L2A), Sentinel-3 OLCI, Sentinel-5P, Landsat-8, MODIS, Envisat MERIS and ESA archive of Landsat-5 and -7 through the Sentinel Hub services. Open-source libraries sat-utils are used to work with locally stored or remotely accessible GeoTiff files and OpenStreetMap data.
•eo-learn-mask, used for masking of data and calculation of cloud masks.
•eo-learn-ml-tools - set of tools that can be used before or after the ML process.
The eo-learn package can be easily integrated with other Python packages, e.g. within EOTask node. We will demonstrate how it works with TensorFlow and other ML libraries. Jupyter Notebook is used as the main "IDE".
GROUND TRUTH LABELS
We are addressing lack of ground truth data required for training and analysis in two ways - by identifying openly available regional and global datasets of proper quality, which can be used as an input, and by creating a classification app, which can be used by experts or crowds to collect missing inputs. OpenStreetMap, SpaceNet, Corine land-cover, various official register datasets (buildings, roads, farm parcels) and similar can be efficiently used to create training data. These sources will be freely available to use on the Geopedia cloud-based GIS.
The classification app is a web-based tool, which requires user authentication, in order to associate individual records with a specific user for labelling quality assessment. Users can define their own labelling campaigns, and make them public or private. Campaigns define which satellite images will be annotated (e.g. Sentinel-2 or other data-source, based on the use-case), the size of the area (e.g. 512x512 px), and the sampling method of areas selected (e.g. random selection, mis-classified area by ML model). Completeness of the labelling over the whole area is required in cases where one wants to avoid vaguely defined data while completeness is not enforced in other cases, where we are looking only for specific elements (e.g. built-up areas). Users are able to explore the area around the dedicated tile, and visualise various band combinations (e.g. NDVI, false colour, NDWI, custom option). Campaigns allow to address various use-cases (label options, area limitations, satellite imagery sources, supporting datasets). The open-source nature of the tool allows further customization. Classified data can be exported using a dedicated API (integrated with eo-learn) or exported in standard formats (e.g. SHP, GeoTiff, GeoJSON).
CONCLUSION
Volume, availability and quality of open EO data have reached the level where big data methods are not just a meaningful option, but a necessity. However, there are not yet many established options freely available and we believe that Query Planet is addressing these needs. We will demonstrate its usability in two start-to-end use-cases – land cover classification on country scale and global monitoring of water reservoirs.
This announcement, part way through the project, is meant to call for cooperation with other researchers in the field, so that we can produce results fitting their requirements if possible.