Name	Name	Last commit message	Last commit date
parent directory ..
__pycache__	__pycache__
containers	containers
test	test
README.md	README.md
__main__.py	__main__.py
file_utils.py	file_utils.py
get_images.py	get_images.py
gt_pre.py	gt_pre.py
stats.py	stats.py
test.py	test.py
test_deploy.py	test_deploy.py
tile.py	tile.py

Preprocess toolkit documentation

The purpose of this pipeline is to merge raw ground truth data in raster or vector format with satellite imagery for the production of a set of standardized image tiles for the purpose of training a machine learning model.

This particular implementation assumes a desired output of paired (data, mask) tilesets.

Data

Inputs

Name	Data Type	Description
Ground Truth	spatial Raster File (`.tif`, `.jp2`, etc)	Contains information about ground state (in this case, snow presence or absence) as measured by another method (e.g. ASO/SnowEX lidar). Can either be a GeoJson binary vector or a raw or binary raster.
Date	string or `datetime`	date of ground truth data acqusition. Used to determine which imagery to acquire.
Date Range	integer	number of days around ground truth acqusition date to search for imagery.

Outputs

Name	Data Type	Description
Image Tiles	Cloud Storage bucket (S3)	Bucket with either `{z}/{x}/{y}.tif` structure or `{z}_{x}_{y}.tif` filenames containing tiles with same data architecture as original imagery.
Mask Tiles	Cloud Storage Bucket (S3, GCS)	Bucket with either `{z}/{x}/{y}.tif` structure or `{z}_{x}_{y}.tif` filenames containing binary masks for training.

The output Image Tiles are cropped to the extent of the ground truth information. The set of Image Tiles and Mask Tiles is identical.

Toolkit

The primary steps to completing this data transformation are 1) ground truth pre-processing, 2) image acquisition and storage, 3) image preprocessing, 4) image and mask tiling.

`gt_pre`: Ground Truth Preprocessing

input parameter	description
`--gt_file`	ground truth data file, as below
`--threshold` (optional)	threshold for real-valued raster input
`--dst_crs` (optional)	EPSG code to reproject input into. Default is original CRS
output_dir (required)	directory for output

Output: this stage of the pipeline outputs the binary raster produced via this processing step as <output_dir>/<gt_file>_binary.tif for use by future steps. It also produces a .GeoJSON file containing the spatial extent of the ground truth for use by the image acquisition step.

`tile`: Spherical Mercator Tiling

input parameter	description
`--zoom`	zoom level for output tiles
`--indexes`	raster band indices to include in tiles
`--quant` (optional)	value to divide bands with, if input data is quantized
`--aws_profile` (optional)	aws profile name for s3:// destinations
`--skip_blanks` (optional)	skip blank tiles.
`--cover` (optional)	csv file containing tiles to produce (default: all)
files	file or files to tile
output_dir	place to put tile directory. can be s3://

Output: A directory at zoom level <zoom> containing GeoTIFF files representing original input imagery.

All details below are out of date but kept for reference

click to expand

### Image Acquisition

input parameter	description
`--gt_date`	date of ground truth acquisition
`--date_range`	number of days around `gt_date` to search for imagery in catalog
`--max_images` (optional)	used to constrain the number of images downloaded

Using the gt_date and date_range parameters we compute a date range to search the imagery catalog for. We also use the GeoJSON output from step 1 to geographically constrain the imagery search. Eventually this process will be imagery-agnostic, but we currently implement using the Planet Labs API.

Open Questions:

How do we select what images to download? Cloud cover? Sort by date?
Do we select images that overlap spatially but not in time? (e.g. what if the same meter of Earth is covered by several images on different days --- do we just select closest to gt_date?)

Output: A cloud storage bucket containing GeoTIFF files representing raw 4-band images cropped to the extent of the ground truth dataset.

Image Preprocessing

Not totally sure what goes in here but I'm sure we're going to want to do something to the imagery before it gets tiled. Perhaps a TOA correction or some such thing. Wanted to leave room for it.

Tiling

input parameter	description

Three steps here:

Tile the binary raster data mask into cloud storage bucket.
Tile all images into cloud storage bucket.
Be sure that all image tiles have a paired ground truth tile and that there are no orphan tiles.
Come up with some sort of standardized directory structure (maybe best to stick with XYZ/OSM tiles here and reorganize later for training?)

Output: A cloud storage bucket containing an /images and /masks directory with some sort of standardized directory structure.

GCP Implementation Design

The major steps in this pipeline will be implemented as containerized Python modules and be linked together with the Kubeflow pipeline system. Some of the components may contain some Cloud Dataflow (i.e. Apache Beam) workflow elements.

This document outlines the containers which will be connected together to perform the intermediary operations.

`gt_pre`

Consumes ground truth data as above (--gt_raw) and outputs {gt_raw}_gt_binary_raster and {gt_raw}_gt_footprint into a directory (--output_dir). Binary raster is created either by:

rasterizing a polygon
thresholding a real-valued raster via --threshold arg, or
doing nothing (returning input binary raster)

{gt_raw}_gt_binary_raster.tif and {gt_raw}_gt_footprint.geojson are placed into /gt_processed either in a cloud storage bucket or local folder (KubeFlow global pipeline variable output_dir).

How in particular do we pass around the variables / inputs / outputs?

`get_images`

Consumes {gt_raw}_gt_footprint.geojson (--footprint), along with --date and --date_range arguments and queries image search API to identify download candidates. Selects candidates (several options available here for this, potentially: --max_images, --max_cloud, etc) and uses Planet Clips API to download imagery within bounds of data footprint. Imagery with ID = ID is unzipped and placed into /images/{ID} within local storage or a cloud storage bucket (--output_dir).

`tile`

still in progress here –– not entirely sure whether it's best to keep each image tiled in its own directory or try to merge all images together. seems like keeping image tiles in their own directory allows for more downstream flexibility.

The purpose of this We'll use this container for two steps in the pipeline: first to tile the binary mask raster, and again to run a distributed tiling operation on the images in /image/{ids}.

As a result, this container will contain two related but distinct python functions. The first will tile a single image, and the second will be a Cloud Dataflow operation to tile a while directory of images. The Pipeline will run these two distinct operations seperately but both derived from this tile container.

Single Image tiler: will take in --image and perhaps --zoom_level and produce an XYZ/OSM tile structure from the image. Except: these images will likely remain as TIFF files so we can use multiple bands in training, rather than the typical PNG format used for OSM tiles.

Multiple Image tiler: TBD, still not quite sure how to structure the beam dataflow here.

TODO: `split`

This module is responsible for creating a train-validation split of the images

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preprocess

preprocess

README.md

Preprocess toolkit documentation

Data

Toolkit

`gt_pre`: Ground Truth Preprocessing

`tile`: Spherical Mercator Tiling

Image Preprocessing

Tiling

GCP Implementation Design

`gt_pre`

`get_images`

`tile`

TODO: `split`

Files

preprocess

Directory actions

More options

Directory actions

More options

Latest commit

History

preprocess

Folders and files

parent directory

README.md

Preprocess toolkit documentation

Data

Toolkit

gt_pre: Ground Truth Preprocessing

tile: Spherical Mercator Tiling

Image Preprocessing

Tiling

GCP Implementation Design

gt_pre

get_images

tile

TODO: split

`gt_pre`: Ground Truth Preprocessing

`tile`: Spherical Mercator Tiling

`gt_pre`

`get_images`

`tile`

TODO: `split`