Skip to content

Latest commit

 

History

History

preprocess

Preprocess toolkit documentation

The purpose of this pipeline is to merge raw ground truth data in raster or vector format with satellite imagery for the production of a set of standardized image tiles for the purpose of training a machine learning model.

This particular implementation assumes a desired output of paired (data, mask) tilesets.

Data

Inputs

Name Data Type Description
Ground Truth spatial Raster File (.tif, .jp2, etc) Contains information about ground state (in this case, snow presence or absence) as measured by another method (e.g. ASO/SnowEX lidar). Can either be a GeoJson binary vector or a raw or binary raster.
Date string or datetime date of ground truth data acqusition. Used to determine which imagery to acquire.
Date Range integer number of days around ground truth acqusition date to search for imagery.

Outputs

Name Data Type Description
Image Tiles Cloud Storage bucket (S3) Bucket with either {z}/{x}/{y}.tif structure or {z}_{x}_{y}.tif filenames containing tiles with same data architecture as original imagery.
Mask Tiles Cloud Storage Bucket (S3, GCS) Bucket with either {z}/{x}/{y}.tif structure or {z}_{x}_{y}.tif filenames containing binary masks for training.

The output Image Tiles are cropped to the extent of the ground truth information. The set of Image Tiles and Mask Tiles is identical.

Toolkit

The primary steps to completing this data transformation are 1) ground truth pre-processing, 2) image acquisition and storage, 3) image preprocessing, 4) image and mask tiling.

gt_pre: Ground Truth Preprocessing

input parameter description
--gt_file ground truth data file, as below
--threshold (optional) threshold for real-valued raster input
--dst_crs (optional) EPSG code to reproject input into. Default is original CRS
output_dir (required) directory for output

Output: this stage of the pipeline outputs the binary raster produced via this processing step as <output_dir>/<gt_file>_binary.tif for use by future steps. It also produces a .GeoJSON file containing the spatial extent of the ground truth for use by the image acquisition step.

tile: Spherical Mercator Tiling

input parameter description
--zoom zoom level for output tiles
--indexes raster band indices to include in tiles
--quant (optional) value to divide bands with, if input data is quantized
--aws_profile (optional) aws profile name for s3:// destinations
--skip_blanks (optional) skip blank tiles.
--cover (optional) csv file containing tiles to produce (default: all)
files file or files to tile
output_dir place to put tile directory. can be s3://

Output: A directory at zoom level <zoom> containing GeoTIFF files representing original input imagery.


All details below are out of date but kept for reference

click to expand ### Image Acquisition
input parameter description
--gt_date date of ground truth acquisition
--date_range number of days around gt_date to search for imagery in catalog
--max_images (optional) used to constrain the number of images downloaded

Using the gt_date and date_range parameters we compute a date range to search the imagery catalog for. We also use the GeoJSON output from step 1 to geographically constrain the imagery search. Eventually this process will be imagery-agnostic, but we currently implement using the Planet Labs API.

Open Questions:

  • How do we select what images to download? Cloud cover? Sort by date?
  • Do we select images that overlap spatially but not in time? (e.g. what if the same meter of Earth is covered by several images on different days --- do we just select closest to gt_date?)

Output: A cloud storage bucket containing GeoTIFF files representing raw 4-band images cropped to the extent of the ground truth dataset.

Image Preprocessing

Not totally sure what goes in here but I'm sure we're going to want to do something to the imagery before it gets tiled. Perhaps a TOA correction or some such thing. Wanted to leave room for it.

Tiling

input parameter description

Three steps here:

  1. Tile the binary raster data mask into cloud storage bucket.
  2. Tile all images into cloud storage bucket.
  3. Be sure that all image tiles have a paired ground truth tile and that there are no orphan tiles.
  4. Come up with some sort of standardized directory structure (maybe best to stick with XYZ/OSM tiles here and reorganize later for training?)

Output: A cloud storage bucket containing an /images and /masks directory with some sort of standardized directory structure.

GCP Implementation Design

The major steps in this pipeline will be implemented as containerized Python modules and be linked together with the Kubeflow pipeline system. Some of the components may contain some Cloud Dataflow (i.e. Apache Beam) workflow elements.

This document outlines the containers which will be connected together to perform the intermediary operations.

gt_pre

Consumes ground truth data as above (--gt_raw) and outputs {gt_raw}_gt_binary_raster and {gt_raw}_gt_footprint into a directory (--output_dir). Binary raster is created either by:

  • rasterizing a polygon
  • thresholding a real-valued raster via --threshold arg, or
  • doing nothing (returning input binary raster)

{gt_raw}_gt_binary_raster.tif and {gt_raw}_gt_footprint.geojson are placed into /gt_processed either in a cloud storage bucket or local folder (KubeFlow global pipeline variable output_dir).

How in particular do we pass around the variables / inputs / outputs?

get_images

Consumes {gt_raw}_gt_footprint.geojson (--footprint), along with --date and --date_range arguments and queries image search API to identify download candidates. Selects candidates (several options available here for this, potentially: --max_images, --max_cloud, etc) and uses Planet Clips API to download imagery within bounds of data footprint. Imagery with ID = ID is unzipped and placed into /images/{ID} within local storage or a cloud storage bucket (--output_dir).

tile

still in progress here –– not entirely sure whether it's best to keep each image tiled in its own directory or try to merge all images together. seems like keeping image tiles in their own directory allows for more downstream flexibility.

The purpose of this We'll use this container for two steps in the pipeline: first to tile the binary mask raster, and again to run a distributed tiling operation on the images in /image/{ids}.

As a result, this container will contain two related but distinct python functions. The first will tile a single image, and the second will be a Cloud Dataflow operation to tile a while directory of images. The Pipeline will run these two distinct operations seperately but both derived from this tile container.

Single Image tiler: will take in --image and perhaps --zoom_level and produce an XYZ/OSM tile structure from the image. Except: these images will likely remain as TIFF files so we can use multiple bands in training, rather than the typical PNG format used for OSM tiles.

Multiple Image tiler: TBD, still not quite sure how to structure the beam dataflow here.

TODO: split

This module is responsible for creating a train-validation split of the images