When we began to design our data processing pipeline, one of the first questions we had to answer was, “Which database will best serve the needs of the pipeline?”
In order to fully appreciate the issues and considerations, we have to take into consideration all the constituents of the data processing workflow. As a multi-source satellite imagery analytics platform, the first link in our data chain is the satellite imagery itself. PlanetWatchers works with a number of different radar and optical imagery providers in order to get as complete a picture as possible of the resources being monitored.
In addition to the satellite imagery, there are two other important sources of data: client data, and labeling data. The client sends us datasets that describe the resources they want monitored: first and foremost the locations of the resources, as well as any metadata they want to track and correlate. When our remote sensing researchers — be they humans or algorithmic robots — identify interesting artifacts in the satellite imagery, they mark it off, or label it, with the relevant values and metadata. For instance, a forestry client is interested in tracking growth, harvest progress, and potential damage from pests and disease. All the identified information is tracked in the database as well.
Framing the Database Requirements
After reflecting on our essential interactions with a database in the envisioned pipeline, we came to the conclusion that there are two major criteria that the database must fulfill. The first is that the structure should be flexible, since the varying data sources that to be tracked in the database are complex and the relationships between them may change significantly in the future. For instance, most of our clients organize the areas to be monitored in a hierarchy of many plots (or stands) in a project (or sector), but some have other hierarchies. Likewise, the labeling data is highly variable in both scope and structure from client to client, depending on the client’s needs and interests. Since we are interested in a single pipeline that will be able to equally serve the needs of all our clients, we need a database that will support a flexible data structure.
Our second big “must” was support for geospatial queries in the database. A geospatial query is one in which the parameters that the data is filtered by are geographical coordinates (or sets of coordinates), and the filter is applied according to a specific rule, for instance, proximity between points, or the intersection between two sets of points. All the data we track — be it imagery, labeling data, or client-supplied data points from the field — have geographical coordinates associated with them. Any database that doesn’t support geospatial queries would leave us with a gaping hole in our data ecosystem.
Support for geospatial queries allows us to query the data for all imagery that contains a particular client plot (or group of plots), and likewise to query for all plots, or labels, contained in the area covered by a particular image. These geospatial filters can be combined with more standard filters, such as date of the image, label value, or any metadata associated with the plot. By combining standard and geospatial filters we can easily isolate well-defined datasets for deeper analysis, quickly validate algorithmic labeling values, and zero in on narrow ranges of values for further research.
Comparing the Database Candidates
Once we established the essential requirements for our database, we looked around at the possible options. Elasticsearch was suggested, since it has a schema-less structure, but it doesn’t support geospatial queries. Postgres looked like a good candidate, since the PostGIS module can be installed to add geospatial support, and JSON fields can be used to achieve a dynamic, nested data structure. Another potentially good match was MongoDB, which has a schema-less structure with JSON-like database objects and fully supports geo-queries right out of the box.
The database schema is an enforced structure the data must conform to in order to be accepted into the database. All the well-known relational databases, such as MySql, Postgres, Oracle, and MS SQL Server, are built on enforced schemas, which allow the relationships to be enforced. Since the era of Big Data, other flavors of databases have become popular, that provide more efficient ways of moving the data in and out of the database, such as the document-based structure that MongoDB adopted. Nothing in life is free, of course, not even database performance, so this ease of input/output comes at a price. Part of that price is the lack of a backbone. Without an enforceable schema, NoSQL (Not Only SQL) databases are more vulnerable to data corruption, and relationships between different data collections (the equivalent of tables) become trickier.
When weighing the pros and cons of Postgres vs. Mongo, each contender had something going for them that the other didn’t. On the one hand, Postgres’ powerful GIS functions make it very appealing. On the other hand, MongoDB’s ability to easily handle nested objects in the database with no special query syntax and no other caveats make it no less worthy a choice. Since we already do all the GIS heavy lifting in python and don’t really need in the database layer, Postgres’ main advantage was less relevant and MongoDB became our first choice.
Prior to integrating with a database, our researchers were using various manual methods to access and manipulate the data, with team-wide conventions to regulate placement of files and naming of fields and objects. In order to integrate the database as an end-to-end backend acting as a “single source of truth” for our data pipeline, a number of significant changes were necessary. First of all, we created an HTTP API that serves as both a universal access point (when reading from the database) and a gateway to enforce conventions (when sending data to the database).
The next step was the non-trivial task of refactoring the algorithm code to work with the API instead of hard-coded directory paths and GeoJSON files. Our remote sensing researchers also need access to the database for all their development and research tasks. So we built a console that creates a GeoJSON format file on the fly from parameterized filters, and has an import feature for updating the database based on a modified GeoJSON file.
From Zero to Atlas in Ten Days
PlanetWatchers is excited to be accepted into the MongoDB Startup Accelerator. We are proud to be a part of the program and optimistic about the many benefits our company and our clients will reap from MongoDB’s rich feature set and Atlas’ powerful infrastructure.