Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datacube #27

Merged
merged 21 commits into from
Nov 15, 2023
Merged
Changes from 7 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
484caf6
combining data arrays into multi-sensor data cube
lillythomas Nov 9, 2023
6eba841
implement cql2-json filters for S2 and S1
lillythomas Nov 9, 2023
047e955
script to generate merged datacube
lillythomas Nov 10, 2023
7ccc23d
wip function for calculating sentinel 1 scene with max coverage of bbox
lillythomas Nov 10, 2023
f4d769c
formatting
lillythomas Nov 10, 2023
2662c6f
move script and remove notebook
lillythomas Nov 10, 2023
bd5a718
use geom instead of geodataframe for initial aoi
lillythomas Nov 10, 2023
803c7e7
use args in funcs
lillythomas Nov 10, 2023
2ccda53
use CENTROID for geom in cql2 query
lillythomas Nov 10, 2023
87f61f0
Use mosaic method, set singular time dimension based on Sentinel 2
lillythomas Nov 10, 2023
fd23e2c
add configurable args for cloud cover percentage and nodata percentage
lillythomas Nov 14, 2023
89aeef1
use epsg code derived from Sentinel-2 properties, filter by best clou…
lillythomas Nov 14, 2023
81d1b02
remove extra filter
lillythomas Nov 14, 2023
0c66e0f
map s2 for best image using datetime to id, set s2 bands as unique va…
lillythomas Nov 14, 2023
b3b5581
assign S2 time as dataset variable
lillythomas Nov 14, 2023
19d72a7
remove orbit filter
lillythomas Nov 14, 2023
2536b8b
wrap example in main
lillythomas Nov 14, 2023
7a902e5
move script to subdir
lillythomas Nov 15, 2023
74ccc40
use cloud variable
lillythomas Nov 15, 2023
c3fc516
use cloud variable
lillythomas Nov 15, 2023
a14377f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
382 changes: 382 additions & 0 deletions datacube.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have this file in the top-level directory, or in another folder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think we can have this in a more appropriate subdir. I just put it here as we haven't decided on the structure yet. I'm happy to make the call and we can modify as we see fit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put it in a subdir, will leave you up to you on the naming 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think @yellowcap is putting things under scripts/ in #29, so we could follow that?

Original file line number Diff line number Diff line change
@@ -0,0 +1,382 @@
"""
STAC Data Processing Script

This Python script processes Sentinel-2, Sentinel-1, and DEM (Digital Elevation Model) data. It utilizes the Planetary Computer API for data retrieval and manipulation.

Constants:
- STAC_API: Planetary Computer API endpoint
- S2_BANDS: Bands used in Sentinel-2 data processing

Functions:
- random_date(start_year, end_year): Generate a random date within a specified range.
- get_week(year, month, day): Get the week range for a given date.
- get_conditions(year1, year2): Get random conditions (date, year, month, day, cloud cover) within a specified year range.
- search_sentinel2(week, aoi_gdf): Search for Sentinel-2 items within a given week and area of interest.
- search_sentinel1(BBOX, catalog, week): Search for Sentinel-1 items within a given bounding box, STAC catalog, and week.
- search_sentinel1_calc_max_area(BBOX, catalog, week): Search for Sentinel-1 items within a given bounding box, STAC catalog, and week, selecting the scene with the maximum overlap.
- search_dem(BBOX, catalog, week, s2_items): Search for DEM items within a given bounding box, STAC catalog, week, and Sentinel-2 items.
- make_dataarrays(s2_items, s1_items, dem_items, BBOX, resolution, epsg): Create xarray DataArrays for Sentinel-2, Sentinel-1, and DEM data.
- merge_datarrays(da_sen2, da_sen1, da_dem): Merge xarray DataArrays for Sentinel-2, Sentinel-1, and DEM.
- process(year1, year2, aoi, resolution, epsg): Process Sentinel-2, Sentinel-1, and DEM data for a specified time range, area of interest, resolution, and EPSG code.
"""

import random
from datetime import datetime, timedelta
import geopandas as gpd
import pystac
import pystac_client
import planetary_computer as pc
import stackstac
import xarray as xr
import numpy as np
from shapely.geometry import box, shape, Point
import leafmap
from rasterio.errors import RasterioIOError
from rasterio.warp import transform_bounds
from getpass import getpass
import os
import json

STAC_API = "https://planetarycomputer.microsoft.com/api/stac/v1"
S2_BANDS = ["B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B11", "B12", "SCL"]


def random_date(start_year, end_year):
"""
Generate a random date within the specified range.

Parameters:
- start_year (int): The starting year of the date range.
- end_year (int): The ending year of the date range.

Returns:
- datetime: A randomly generated date within the specified range.
"""
start_date = datetime(start_year, 1, 1)
end_date = datetime(end_year, 12, 31)
days_between = (end_date - start_date).days
random_days = random.randint(0, days_between)
random_date = start_date + timedelta(days=random_days)
return random_date


def get_week(year, month, day):
"""
Get the week range (start_date/end_date) for a given date.

Parameters:
- year (int): The year of the date.
- month (int): The month of the date.
- day (int): The day of the date.

Returns:
- str: A string representing the start and end dates of the week in the format 'start_date/end_date'.
"""
date = datetime(year, month, day)
start_of_week = date - timedelta(days=date.weekday())
end_of_week = start_of_week + timedelta(days=6)
start_date_str = start_of_week.strftime('%Y-%m-%d')
end_date_str = end_of_week.strftime('%Y-%m-%d')
return f"{start_date_str}/{end_date_str}"


def get_conditions(year1, year2):
"""
Get random conditions (date, year, month, day, cloud cover) within the specified year range.

Parameters:
- year1 (int): The starting year of the date range.
- year2 (int): The ending year of the date range.

Returns:
- tuple: A tuple containing date, year, month, day, and a constant cloud cover value.
"""
date = random_date(year1, year2)
YEAR = date.year
MONTH = date.month
DAY = date.day
CLOUD = 50
return date, YEAR, MONTH, DAY, CLOUD


def search_sentinel2(week, aoi):
"""
Search for Sentinel-2 items within a given week and area of interest (AOI).

Parameters:
- week (str): The week in the format 'start_date/end_date'.
- aoi (shapely.geometry.base.BaseGeometry): Geometry object for an Area of Interest (AOI).

Returns:
- tuple: A tuple containing the STAC catalog, Sentinel-2 items, Sentinel-2 GeoDataFrame, and the bounding box (BBOX).
"""

CENTROID = aoi.centroid
BBOX = aoi.bounds

geom_CENTROID = Point(CENTROID.x, CENTROID.y) # Create point geom object from centroid

catalog = pystac_client.Client.open(STAC_API, modifier=pc.sign_inplace)

search: pystac_client.item_search.ItemSearch = catalog.search(
filter_lang="cql2-json",
filter={
"op": "and",
"args": [
{
"op": "s_intersects",
"args": [{"property": "geometry"}, geom_CENTROID.__geo_interface__],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CENTROID is a point object, may be we can use that directly instead of geom_CENTROID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Yes we can. Implemented.

},
{"op": "anyinteracts", "args": [{"property": "datetime"}, week]},
{"op": "=", "args": [{"property": "collection"}, "sentinel-2-l2a"]},
{"op": "<=", "args": [{"property": "eo:cloud_cover"}, 50]},
{"op": "<=", "args": [{"property": "s2:nodata_pixel_percentage"}, 20]},
weiji14 marked this conversation as resolved.
Show resolved Hide resolved
],
},
)

s2_items = search.get_all_items()
print(f"Found {len(s2_items)} Sentinel-2 items")

s2_items_gdf = gpd.GeoDataFrame.from_features(s2_items.to_dict())

BBOX = s2_items_gdf.iloc[0].geometry.bounds

return catalog, s2_items, s2_items_gdf, BBOX


def search_sentinel1(BBOX, catalog, week):
"""
Search for Sentinel-1 items within a given bounding box (BBOX), STAC catalog, and week.

Parameters:
- BBOX (tuple): Bounding box coordinates in the format (minx, miny, maxx, maxy).
- catalog (pystac.Catalog): STAC catalog containing Sentinel-1 items.
- week (str): The week in the format 'start_date/end_date'.

Returns:
- tuple: A tuple containing Sentinel-1 items and Sentinel-1 GeoDataFrame.
"""

geom_BBOX = box(*BBOX) # Create poly geom object from the bbox

search: pystac_client.item_search.ItemSearch = catalog.search(
filter_lang="cql2-json",
filter={
"op": "and",
"args": [
{
"op": "s_intersects",
"args": [{"property": "geometry"}, geom_BBOX.__geo_interface__],
},
{"op": "anyinteracts", "args": [{"property": "datetime"}, week]},
{"op": "<=", "args": [{"property": "sat:orbit_state"}, "descending"]},
weiji14 marked this conversation as resolved.
Show resolved Hide resolved
{"op": "=", "args": [{"property": "collection"}, "sentinel-1-rtc"]},
],
},
)
s1_items = search.item_collection()
print(f"Found {len(s1_items)} Sentinel-1 items")

s1_gdf = gpd.GeoDataFrame.from_features(s1_items.to_dict())
# TODO: Find a way to get minimal number of S1 tiles to cover the entire S2 bbox
s1_gdf["overlap"] = s1_gdf.intersection(box(*BBOX)).area
s1_gdf.sort_values(by="overlap", inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pick tiles from ascending or descending capture based on their overlap with the S2 bbox.

state = (s1_gdf[["sat:orbit_state", "overlap"]]
             .groupby(["sat:orbit_state"])
             .sum()
             .sort_values(by="overlap", ascending=False)
             .index[0])

state = ascending OR descending.

Filter the s1_gdf by state: s1_gdf = s1_gdf[s1_gdf["sat:orbit_state"] == state]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented


return s1_items, s1_gdf


def search_sentinel1_calc_max_area(BBOX, catalog, week):
"""
Search for Sentinel-1 items within a given bounding box (BBOX), STAC catalog, and week.

Parameters:
- BBOX (tuple): Bounding box coordinates in the format (minx, miny, maxx, maxy).
- catalog (pystac.Catalog): STAC catalog containing Sentinel-1 items.
- week (str): The week in the format 'start_date/end_date'.

Returns:
- tuple: A tuple containing the selected Sentinel-1 item and Sentinel-1 GeoDataFrame.
"""

geom_BBOX = box(*BBOX) # Create poly geom object from the bbox

search: pystac_client.item_search.ItemSearch = catalog.search(
filter_lang="cql2-json",
filter={
"op": "and",
"args": [
{
"op": "s_intersects",
"args": [{"property": "geometry"}, geom_BBOX.__geo_interface__],
},
{"op": "anyinteracts", "args": [{"property": "datetime"}, week]},
{"op": "=", "args": [{"property": "collection"}, "sentinel-1-rtc"]},
],
},
)
s1_items = search.get_all_items()
print(f"Found {len(s1_items)} Sentinel-1 items")

s1_gdf = gpd.GeoDataFrame.from_features(s1_items.to_dict())
s1_gdf["overlap"] = s1_gdf.intersection(box(*BBOX)).area

# Choose the scene with the maximum overlap
selected_scene = s1_gdf.loc[s1_gdf["overlap"].idxmax()]
print(selected_scene)
selected_item_id = selected_scene["id"]
selected_item = catalog.get_item(selected_item_id)

return selected_item, s1_gdf


def search_dem(BBOX, catalog, week, s2_items):
"""
Search for Digital Elevation Model (DEM) items within a given bounding box (BBOX), STAC catalog, week, and Sentinel-2 items.

Parameters:
- BBOX (tuple): Bounding box coordinates in the format (minx, miny, maxx, maxy).
- catalog (pystac.Catalog): STAC catalog containing DEM items.
- week (str): The week in the format 'start_date/end_date'.
- s2_items (list): List of Sentinel-2 items.

Returns:
- tuple: A tuple containing DEM items and DEM GeoDataFrame.
"""
search = catalog.search(
collections=["cop-dem-glo-30"], bbox=BBOX
)
dem_items = search.item_collection()
print(f"Found {len(dem_items)} items")

dem_gdf = gpd.GeoDataFrame.from_features(dem_items.to_dict())

dem_gdf.set_crs(epsg=4326, inplace=True)
dem_gdf = dem_gdf.to_crs(epsg=s2_items[0].properties["proj:epsg"])
return dem_items, dem_gdf


def make_dataarrays(s2_items, s1_items, dem_items, BBOX, resolution, epsg):
"""
Create xarray DataArrays for Sentinel-2, Sentinel-1, and DEM data.

Parameters:
- s2_items (list): List of Sentinel-2 items.
- s1_items (list): List of Sentinel-1 items.
- dem_items (list): List of DEM items.
- BBOX (tuple): Bounding box coordinates in the format (minx, miny, maxx, maxy).
- resolution (int): Spatial resolution.
- epsg (int): EPSG code for the coordinate reference system.

Returns:
- tuple: A tuple containing xarray DataArrays for Sentinel-2, Sentinel-1, and DEM.
"""
da_sen2: xr.DataArray = stackstac.stack(
items=s2_items[0],
epsg=26910, # UTM Zone 10N
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we read the projection from the s2 item collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assigned this to the same projection used for the S2 item.

assets=S2_BANDS,
bounds_latlon=BBOX, # W, S, E, N
resolution=10, # Spatial resolution of 10 metres
xy_coords="center", # pixel centroid coords instead of topleft corner
dtype=np.float32,
fill_value=np.nan,
)

da_sen1: xr.DataArray = stackstac.stack(
items=s1_items[1:], # To only accept the same orbit state and date. Need better way to do this.
assets=["vh", "vv"], # SAR polarizations
epsg=26910, # UTM Zone 10N
bounds_latlon=BBOX, # W, S, E, N
xy_coords="center", # pixel centroid coords instead of topleft corner
dtype=np.float32,
fill_value=np.nan,
)

# To fix TypeError: Invalid value for attr 'spec'
da_sen1.attrs["spec"] = str(da_sen1.spec)

# To fix ValueError: unable to infer dtype on variable None
for key, val in da_sen1.coords.items():
if val.dtype == "object":
print("Deleting", key)
da_sen1 = da_sen1.drop_vars(names=key)

# Create xarray.Dataset datacube with VH and VV channels from SAR
da_vh: xr.DataArray = da_sen1.sel(band="vh", drop=True).rename("vh")
da_vv: xr.DataArray = da_sen1.sel(band="vv", drop=True).rename("vv")
ds_sen1: xr.Dataset = xr.merge(objects=[da_vh, da_vv], join="override")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

da_sen1 = stackstac.mosaic(da_sen1, dim="time") will create a composite of S1 tiles overlapping the S2 bbox. We can use that directly in the merge, and don't have to worry about pixel picking stratergy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Word. I like this. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So stackstac.mosaic returns an xarray.DataArray, whereas xarray.merge returns an xarray.Dataset. The main difference is that xarray.Dataset allows us to clearly show named data variables (e.g. VV, VH, B1, B2, etc), whereas xarray.DataArray stores these in the 'band' dimension. Do we have a preference for either format?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having named data vars helps - stackstac.mosaic takes care of merging the pixels along the given dim (time in our case) which we don't have to handle at our end.
What if we create the data arrays & then concat them together as a dataset with vars attached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still retain the named data vars I think. Merged data array using mosaic method for Sentinel 1:

Merged datarray:  <xarray.Dataset>
Dimensions:                                     (time: 1, band: 14, x: 11365,
                                                 y: 11363)
Coordinates: (12/67)
  * time                                        (time) datetime64[ns] 2018-06...
  * band                                        (band) <U4 'B02' 'B03' ... 'vv'
  * x                                           (x) float64 2.973e+05 ... 4.1...
  * y                                           (y) float64 4.202e+06 ... 4.0...
    id                                          (time) <U54 'S2B_MSIL2A_20180...
    s2:generation_time                          <U24 '2020-10-12T02:49:45.926Z'
    ...                                          ...
    sar:instrument_mode                         <U2 'IW'
    sar:product_type                            <U3 'GRD'
    s1:product_timeliness                       <U8 'Fast-24h'
    s1:resolution                               <U4 'high'
    description                                 (band) object nan ... 'Terrai...
    proj:shape                                  object {3600}
Data variables:
    stackstac-95bc7f7975c3ca621cbffb094f33316e  (time, band, y, x) float32 dask.array<chunksize=(1, 1, 1024, 1024), meta=np.ndarray>
    stackstac-80df347b660c0b771e7ba977f16777bc  (band, y, x) float32 dask.array<chunksize=(13, 1024, 1024), meta=np.ndarray>
    stackstac-602439b1960e897b8044c8056eb3ce96  (band, y, x) float32 dask.array<chunksize=(14, 1024, 1024), meta=np.ndarray>
Attributes:
    spec:        RasterSpec(epsg=32611, bounds=(297340, 4088320, 410990, 4201...
    crs:         epsg:32611
    transform:   | 10.00, 0.00, 297340.00|\n| 0.00,-10.00, 4201950.00|\n| 0.0...
    resolution:  10


# print(ds_sen1)

# da_sen1 = da_sen1.drop([da_sen1.time[0].values], dim='time') # Remove first scene which has ascending orbit

da_dem: xr.DataArray = stackstac.stack(
items=dem_items,
epsg=26910, # UTM Zone 10N
bounds_latlon=BBOX, # W, S, E, N
resolution=10, # Spatial resolution of 10 metres
xy_coords="center", # pixel centroid coords instead of topleft corner
dtype=np.float32,
fill_value=np.nan,
)

_, index = np.unique(da_dem['time'], return_index=True) # Remove redundant time
da_dem = da_dem.isel(time=index)
weiji14 marked this conversation as resolved.
Show resolved Hide resolved

return da_sen2, da_sen1, da_dem


def merge_datarrays(da_sen2, da_sen1, da_dem):
"""
Merge xarray DataArrays for Sentinel-2, Sentinel-1, and DEM.

Parameters:
- da_sen2 (xr.DataArray): xarray DataArray for Sentinel-2 data.
- da_sen1 (xr.DataArray): xarray DataArray for Sentinel-1 data.
- da_dem (xr.DataArray): xarray DataArray for DEM data.

Returns:
- xr.DataArray: Merged xarray DataArray.
"""
da_merge = xr.merge([da_sen2, da_sen1, da_dem], compat='override')
weiji14 marked this conversation as resolved.
Show resolved Hide resolved
print(da_merge)
return da_merge


def process(year1, year2, aoi, resolution, epsg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about needing to pass in an EPSG code here. For wide areas (in the West-to-East direction) that can span multiple UTM zones, wouldn't this force a single EPSG instead of multiple?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to overlap data from S1 & DEMs over S2 tiles. As S2 tiles are following the MGRS format, each tile will fall under a single UTM zone (this is my understanding, correct me if I am wrong).
S1 & DEMs might span across multiple UTM zones, but stackstac should reproject all of them together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I guess I should switch this to just read the projection based on the MGRS tile's UTM zone instead of have a user argument for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the EPSG code directly from the Sentinel-2 item properties now.

"""
Process Sentinel-2, Sentinel-1, and DEM data for a specified time range, area of interest (AOI),
resolution, and EPSG code.

Parameters:
- year1 (int): The starting year of the date range.
- year2 (int): The ending year of the date range.
- aoi (shapely.geometry.base.BaseGeometry): Geometry object for an Area of Interest (AOI).
- resolution (int): Spatial resolution.
- epsg (int): EPSG code for the coordinate reference system.

Returns:
- xr.DataArray: Merged xarray DataArray containing processed data.
"""

date, YEAR, MONTH, DAY, CLOUD = get_conditions(year1, year2)
week = get_week(YEAR, MONTH, DAY)

catalog, s2_items, s2_items_gdf, BBOX = search_sentinel2(week, aoi)

s1_items, s1_gdf = search_sentinel1(BBOX, catalog, week)
# s1_items, s1_gdf = search_sentinel1_calc_max_area(BBOX, catalog, week) # WIP


dem_items, dem_gdf = search_dem(BBOX, catalog, week, s2_items)

da_sen2, da_sen1, da_dem = make_dataarrays(s2_items, s1_items, dem_items, BBOX, resolution, epsg)

da_merge = merge_datarrays(da_sen2, da_sen1, da_dem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a sanity check, could you plot each invidual band from this datacube and see if they look ok? I've had issues with striped outputs when running xr.merge before on Sentinel-1 and Copernicus DEM, and just want to double check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed in #27 (comment)

return da_merge

# EXAMPLE
california_tile = gpd.read_file("ca.geojson")
sample = california_tile.sample(1)
aoi = sample.iloc[0].geometry
process(2017, 2023, aoi, 10, 26910)