chore: Release 0.4.0 (#34)

* feat(algorithm_inputs): Added 'doi' to algorithm inputs which users can specify * chore(algorithm_inputs): Remove defaults, simplify process * test: remove outdated doi conftest * docs: Updated for DOI algorithm input * feat: DOI algorithm input * chore: Pull Request changes * chore: minor PR changes * docs: Updated for optional query input parameter * test: Updated for optional query input parameter * feat: Query input parameter now optional * docs: Clarifying inputs * feat: select columns from 2D datasets (#16) Fixes: #5, #7 * fix: skip granule files that cannot be opened (#18) Granule files that cannot be successfully read are skipped, rather than causing job failure. Offending files are retained to facilitate analysis. Fixes #17 * feat: lat lon algorithm inputs (#20) * docs: lat/lon algorithm inputs additions * test: lat/lon algorithm inputs additions * feat: lat/lon algorithm inputs additions * feat: support L1B and L2B collections (#21) Fixes #19 * Prepare for 0.3.0 release * Prepare for further development * feat: user input to filter beams (#26) * docs: User-supplied beams specification * test: Testing various beams input * feat: User-specified beams * docs: updated beams documentation * test: simple beams fail test * test: check_beams_option * docs: additional docstring changes * test: additional tests ... * fix: n_expected algorithm inputs (#27) * fix: n_expected algorithm inputs (#33) * chore: Release 0.4.0 (#28) * chore: Release 0.4.0 * chore: minor changes for next release * docs: changes for 0.4.0 release * chore: Release 0.4.0 Co-authored-by: Chuck Daniels <[email protected]>
MAAP-Project · Nov 14, 2022 · 33138d6 · 33138d6
1 parent 7ba904e
commit 33138d6
Show file tree

Hide file tree

Showing 10 changed files with 277 additions and 59 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,12 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog], and this project adheres to
 [Semantic Versioning].
 
+## [0.4.0] - 2022-11-14
+
+### Added
+- [#6](https://github.com/MAAP-Project/gedi-subsetter/issues/6): Allow user to
+  specify which BEAMs to subset
+
 ## [0.3.0] - 2022-10-31
 
 ### Fixed

diff --git a/README.md b/README.md
@@ -49,6 +49,10 @@ must be supplied for every input):
 
 - `lon`: Name of the dataset used for longitude.
 
+- `beams`: Which beams to include in the subset. Must be `all`, `coverage`,
+  `power`, _OR_ a comma-separated list of beam names, with or without the `BEAM`
+  prefix (e.g., `BEAM0000,BEAM0001` or `0000,0001`)
+
 - `columns`: Comma-separated list of column names to include in the output file.
   These names correspond to the variables (layers) within the data files, and
   vary from collection to collection.  Consult the documentation for a list of
@@ -200,7 +204,7 @@ Here are some sample input values per DOI:
 - **doi:** `L4A`, `l4a`, or a specific DOI name
 - **lat**: `lat_lowestmode`
 - **lon**: `lon_lowestmode`
-- **columns:** `agbd, agbd_se, sensitivity, sensitivity_a2`
+- **columns:** `agbd, agbd_se, sensitivity, geolocation/sensitivity_a2`
 - **query:** ``l2_quality_flag == 1 & l4_quality_flag == 1 & sensitivity > 0.95 & `geolocation/sensitivity_a2` > 0.95``
 
 ## Running a GEDI Subsetting DPS Job
@@ -227,6 +231,7 @@ inputs = dict(
    doi="<DOI>",
    lat="<LATITUDE>",
    lon="<LONGITUDE>",
+   beams="<BEAMS>",
    columns="<COLUMNS>",
    query="<QUERY>",
    limit = 10_000

diff --git a/algorithm_config.yaml b/algorithm_config.yaml
@@ -1,6 +1,6 @@
 description: Subset GEDI L1B, L2A, L2B, or L4A granules within an area of interest (AOI)
 algo_name: gedi-subset
-version: 0.3.0
+version: 0.4.0
 environment: ubuntu
 repository_url: https://repo.ops.maap-project.org/data-team/gedi-subsetter.git
 docker_url: mas.dit.maap-project.org/root/maap-workspaces/base_images/r:dit
@@ -17,6 +17,8 @@ inputs:
     download: False
   - name: lon
     download: False
+  - name: beams
+    download: False
   - name: columns
     download: False
   - name: query

diff --git a/notebooks/GEDI_L4A_Subsetting.ipynb b/notebooks/GEDI_L4A_Subsetting.ipynb
@@ -161,25 +161,7 @@
    "source": [
     "## Submit a Job\n",
     "\n",
-    "When supplying input values for a GEDI subsetting job, to use the default value\n",
-    "for a field (where indicated), use a dash (`\"-\"`) as the input value.\n",
-    "\n",
-    "- `aoi` (required): URL to a GeoJSON file representing your area of interest,\n",
-    "  as explained above.\n",
-    "\n",
-    "- `columns`: Comma-separated list of column names to include in the output file.\n",
-    "  (Default: `\"agbd, agbd_se, l2_quality_flag, l4_quality_flag, sensitivity, sensitivity_a2\"`)\n",
-    "\n",
-    "- `query`: Query expression for subsetting the rows in the output file.\n",
-    "  (Default: `\"l2_quality_flag == 1 and l4_quality_flag == 1 and sensitivity > 0.95 and sensitivity_a2 > 0.95\"`)\n",
-    "\n",
-    "  **IMPORTANT**: The `columns` input must contain at least all of the columns\n",
-    "  that appear in this `query` expression, otherwise an error will occur.\n",
-    "\n",
-    "- `limit`: Maximum number of GEDI granule data files to download (among those\n",
-    "  that intersect the specified AOI). (Default: 10000)\n",
-    "\n",
-    "It is recommended to use `maap-dps-worker-16gb` or `maap-dps-worker-32gb` queues when submitting a job with a large aoi."
+    "See README.md for documentation regarding the inputs"
    ]
   },
   {
@@ -191,22 +173,26 @@
    "source": [
     "inputs = dict(\n",
     "    aoi=aoi,\n",
-    "    columns=\"-\",\n",
-    "    query=\"-\",\n",
-    "    limit=\"-\",\n",
+    "    doi=\"L4A\",\n",
+    "    lat=\"lat_lowestmode\",\n",
+    "    lon=\"lon_lowestmode\",\n",
+    "    beams=\"coverage\",\n",
+    "    columns=\"agbd, agbd_se, sensitivity, geolocation/sensitivity_a2\",\n",
+    "    query=\"l2_quality_flag == 1 and l4_quality_flag == 1 and sensitivity > 0.95 and `geolocation/sensitivity_a2` > 0.95\",\n",
+    "    limit=10_000,\n",
     ")\n",
     "\n",
     "result = maap.submitJob(\n",
     "    identifier=\"gedi-subset\",\n",
     "    algo_id=\"gedi-subset_ubuntu\",\n",
-    "    version=\"gedi-subset-0.2.7\",\n",
-    "    queue=\"maap-dps-worker-8gb\",\n",
+    "    version=\"0.4.0\",\n",
+    "    queue=\"maap-dps-worker-32gb\",\n",
     "    username=username,\n",
     "    **inputs,\n",
     ")\n",
     "\n",
     "job_id = result[\"job_id\"]\n",
-    "job_id"
+    "job_id or result"
    ]
   },
   {
@@ -342,6 +328,7 @@
     "    )\n",
     "else:\n",
     "    gedi_gdf = gpd.read_file(output_file)\n",
+    "    print(gedi_gdf.head())\n",
     "    agbd_colors = plt.cm.get_cmap(\"viridis_r\")\n",
     "    gedi_gdf.plot(column=\"agbd\", cmap=agbd_colors)"
    ]

diff --git a/src/gedi_subset/gedi_utils.py b/src/gedi_subset/gedi_utils.py
@@ -3,7 +3,7 @@
 import os
 import os.path
 import warnings
-from typing import Any, Mapping, Optional, Sequence, Union
+from typing import Any, Callable, Mapping, Optional, Sequence, Union
 
 import h5py
 import numpy as np
@@ -15,6 +15,7 @@
 from shapely.geometry import Polygon
 from shapely.geometry.base import BaseGeometry
 
+import gedi_subset.fp as fp
 from gedi_subset.h5frame import H5DataFrame
 
 # Suppress UserWarning: The Shapely GEOS version (3.10.2-CAPI-1.16.0) is incompatible
@@ -124,11 +125,28 @@ def spatial_filter(beam, aoi):
     return indices
 
 
+def is_coverage_beam(beam: h5py.Group) -> bool:
+    return "COVERAGE" in beam.attrs.get("description", "").upper()
+
+
+def is_power_beam(beam: h5py.Group) -> bool:
+    return "POWER" in beam.attrs.get("description", "").upper()
+
+
+def beam_filter_from_names(names: Sequence[str]):
+    def is_named_beam(beam: h5py.Group) -> bool:
+        return any(name.upper() in beam.name.upper() for name in names)
+
+    return is_named_beam
+
+
 def subset_hdf5(
     hdf5: h5py.Group,
+    *,
     aoi: gpd.GeoDataFrame,
     lat: str,
     lon: str,
+    beam_filter: Callable[[h5py.Group], bool] = fp.always(True),
     columns: Sequence[str],
     query: Optional[str],
 ) -> gpd.GeoDataFrame:
@@ -139,7 +157,8 @@ def subset_hdf5(
     that fall within the specified area of interest (AOI) and also satisfy the specified
     query criteria.  The resulting ``geopandas.GeoDataFrame`` is further reduced to
     include only the specified columns, which must be names of datasets within the
-    HDF5 group (specifically, datasets within subgroups named with the prefix `"BEAM"`).
+    HDF5 group (specifically, datasets within subgroups named with the prefix `"BEAM"`
+    for which invocation of the specified ``beam_filter`` callable returns ``True``).
 
     To illustrate, assume an HDF5 file (`hdf5`) structured like so (values are for
     illustration purposes only):
@@ -181,9 +200,10 @@ def subset_hdf5(
     Assumptions:
 
     - The HDF5 group/file contains subgroups that are named with the prefix `"BEAM"`.
-    - Every `"BEAM*"` subgroup contains datasets named `lat_lowestmode` and
-      `lon_lowestmode`, representing the latitude and longitude, respectively, which are
-      used for the geometry of the resulting ``GeoDataFrame``.
+    - Every `"BEAM*"` subgroup contains degree unit datasets with names given by the
+      specified ``lat`` and ``lon`` parameters, representing the latitude and longitude,
+      respectively, used to create the ``geometry`` column of the resulting
+      ``GeoDataFrame``.
     - For every column name in `columns` and every column name appearing in the `query`
       expression, every `"BEAM*"` subgroup contains a dataset of the same name.
 
@@ -198,8 +218,21 @@ def subset_hdf5(
         HDF5 group to subset (typically an ``h5py.File`` instance).
     aoi : gpd.GeoDataFrame
         Area of Interest.  The subset is limited to data points that fall within this
-        area of interest, as determined by the `lat_lowestmode` and `lon_lowestmode`
-        datasets of each `"BEAM*"` group within the HDF5 file.
+        area of interest, as determined by the latitude and longitude datasets of each
+        `"BEAM*"` group within the HDF5 file.
+    lat: str
+        Name of the latitude dataset used for the resulting ``GeoDataFrame`` geometry.
+    lon: str
+        Name of the longitude dataset used for the resulting ``GeoDataFrame`` geometry.
+    beam_filter: Callable[[h5py.Group], bool] = fp.always(True)
+        Callable used to determine whether or not a top-level BEAM subgroup within the
+        specified ``hdf5`` group should be included in the subset. This callable is
+        called once for each subgroup that has a name prefixed with `"BEAM"`. If not
+        supplied, the default callable always returns ``True``, such that every
+        ``"BEAM*"`` subgroup is included. For convenience, the predicate functions
+        py:`is_coverage_beam` and py:`is_power_beam` may be used. Further, the function
+        returned by calling py:`beam_filter_from_names` with a specific list of BEAM
+        names may be used.
     columns : Sequence[str]
         Column names to be included in the subset.  The specified column names must
         match dataset names within the `"BEAM*"` groups of the HDF5 file.  Although the
@@ -219,7 +252,10 @@ def subset_hdf5(
         GeoDataFrame containing the subset of the data from the HDF5 group/file that
         fall within the specified area of interest and satisfy the specified query.
         Columns are limited to the specified sequence of column names, along with
-        `filename` (str) and `BEAM` (str) columns.
+        `filename` (str) and `BEAM` (str) columns. Further, the query is applied to, and
+        the columns are selected from, only the top-level subgroups that have names
+        prefixed with ``"BEAM"`` and for which the ``beam_filter`` function returns
+        ``True``.
 
     Examples
     --------
@@ -236,12 +272,14 @@ def subset_hdf5(
     ...     group.create_dataset("lat_lowestmode", data=[-1.82556, -9.82514, -1.82471])
     ...     group.create_dataset("lon_lowestmode", data=[12.06648, 12.06678, 12.06707])
     ...     group.create_dataset("sensitivity", data=[0.9, 0.97, 0.99])
-    ...     group = hdf5.create_group("BEAM0001")
+    ...     group.attrs.create("description", "Coverage beam")
+    ...     group = hdf5.create_group("BEAM1011")
     ...     group.create_dataset("agbd", data=[1.1715966, 1.630395, 3.5265787])
     ...     group.create_dataset("l2_quality_flag", data=[0, 1, 1], dtype="i1")
     ...     group.create_dataset("lat_lowestmode", data=[-1.82557, -9.82515, -1.82472])
     ...     group.create_dataset("lon_lowestmode", data=[12.06649, 12.06679, 12.06708])
     ...     group.create_dataset("sensitivity", data=[0.93, 0.96, 0.98])
+    ...     group.attrs.create("description", "Full power beam")
     <HDF5 dataset "agbd": ...>
     <HDF5 dataset "l2_quality_flag": ...>
     <HDF5 dataset "lat_lowestmode": ...>
@@ -292,26 +330,32 @@ def subset_hdf5(
     ... }}])
 
     We can now subset the data in the HDF5 file to points that fall within the AOI,
-    selecting only the desired columns (i.e., named datasets within the HDF5 file), and
-    selecting only the rows that satisfy the specified query:
+    selecting only the desired columns (i.e., named datasets within the HDF5 file),
+    selecting only the coverage beams, and selecting only the rows that satisfy
+    the specified query:
 
     >>> with h5py.File(bio) as hdf5:
     ...     gdf = subset_hdf5(
-    ...         hdf5, aoi, ["agbd", "sensitivity"],
-    ...         "l2_quality_flag == 1 and sensitivity > 0.95"
+    ...         hdf5,
+    ...         aoi=aoi,
+    ...         lat="lat_lowestmode",
+    ...         lon="lon_lowestmode",
+    ...         beam_filter=is_coverage_beam,
+    ...         columns=["agbd", "sensitivity"],
+    ...         query="l2_quality_flag == 1 and sensitivity > 0.95"
     ...     )
     ...     # Since the source of our HDF5 file is an ``io.BytesIO``, we'll drop the
     ...     # `filename` column (which refers to the memory location of the
     ...     # ``io.BytesIO``, not a filename).
     ...     gdf.drop(columns=["filename"])
        BEAM      agbd  sensitivity                   geometry
     0  0000  1.116093         0.99  POINT (12.06707 -1.82471)
-    1  0001  3.526579         0.98  POINT (12.06708 -1.82472)
 
     Note that the resulting ``geopandas.GeoDataFrame`` contains only the specified
-    columns (`agbd` and `sensitivity`), and only the rows (only 1 from each "beam" in
-    this example) that have a geometry that falls within the AOI and also satisfy the
-    query (i.e., `l2_quality_flag == 1` and `sensitivity > 0.95`).
+    coverage `BEAM`s, specified columns (`agbd` and `sensitivity`), and only the
+    rows (only 1 from each "beam" in this example) that have a geometry that falls
+    within the AOI and also satisfy the query
+    (i.e., `l2_quality_flag == 1` and `sensitivity > 0.95`).
 
     Note also that although the `l2_quality_flag` was specified in the query, it does
     not appear in the result because it was not specified in the sequence of column
@@ -336,7 +380,11 @@ def subset_beam(beam: h5py.Group) -> gpd.GeoDataFrame:
         # Clip subset to the area of interest
         return gpd.clip(gdf, aoi.set_crs(epsg=4326))
 
-    beams = (group for name, group in hdf5.items() if name.startswith("BEAM"))
+    beams = (
+        group
+        for name, group in hdf5.items()
+        if name.startswith("BEAM") and beam_filter(group)
+    )
     beams_gdf = pd.concat(map(subset_beam, beams), ignore_index=True, copy=False)
     beams_gdf.insert(0, "filename", os.path.basename(hdf5.file.filename))