Minor update (#3)

* feat: Add UC Merced support * feat: Add EuroSAT-MS support * internal: move to devenv * docs: add citation information
kai-tub · Jul 8, 2024 · 66a2e98 · 66a2e98
1 parent 9896e38
commit 66a2e98
Show file tree

Hide file tree

Showing 18 changed files with 1,372 additions and 197 deletions.
diff --git a/.envrc b/.envrc
@@ -1 +1 @@
-use flake .#
+use flake .# --impure
diff --git a/.github/workflows/nix.yml b/.github/workflows/nix.yml
@@ -19,7 +19,9 @@ jobs:
       - name: Run `nix fmt`
         run: nix fmt -- --check *
       - name: Run `flake checks`
-        run: nix flake check -L
+        # impure is required as the devshell is tested as well
+        # and the devenv devshell requires the `impure` flag
+        run: nix flake check --impure -L
       - name: Create AppImage
         run: nix build .#rico-hdl-AppImage
       - name: Test appimage

diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 /target
 .direnv
 .ipynb_checkpoints
+.devenv
 /result
 .virtual_documents/integration_tests
 *.AppImage

diff --git a/README.md b/README.md
@@ -46,6 +46,8 @@ Currently, `rico-hdl` supports:
 - [BigEarthNet-S2 v2.0][ben]
 - [BigEarthNet-MM v2.0][ben]
 - [HySpecNet-11k][hyspecnet]
+- [UC Merced Land Use][ucmerced]
+- [EuroSAT][euro]
 
 Additional datasets will be added in the near future.
 
@@ -164,7 +166,8 @@ rico-hdl hyspecnet-11k --dataset-dir <HYSPECNET_ROOT_DIR> --dataset-dir Encoded-
 
 In [HySpecNet-11k][hyspecnet], each patch contains 224 bands.
 The encoder will convert each patch into a [safetensors][s]
-dictionary, where the band index prefixed with `B` is the key (for example, `B1`, `B201`).
+dictionary, where the band index prefixed with `B` is the key (for example, `B1`, `B201`)
+of the safetensor dictionary.
 
 <details>
   <summary>Example Input</summary>
@@ -265,6 +268,181 @@ tensor = np.stack([safetensor_dict[f"B{k}"] for k in hyspecnet_bands if k not in
 assert tensor.shape == (202, 128, 128)
 ```
 
+### [UC Merced Land Use][ucmerced] Example
+
+First, [download the rico-hdl](#Download) binary and install
+the Python [lmdb][pyl] and [saftensors][pys] packages.
+Then, to convert the patches from the [UC Merced Land Use][ucmerced]
+dataset into the optimized format, call the application with:
+
+```bash
+rico-hdl uc-merced --dataset-dir <UC_MERCED_LAND_USE_ROOT_DIR> --dataset-dir Encoded-UC-Merced
+```
+
+In [UC Merced][ucmerced], each patch contains 3 bands (RGB).
+The encoder will convert each patch into a [safetensors][s]
+dictionary, where the band's color interpretation is the key (one of `Red`, `Green`, `Blue`)
+of the safetensor dictionary.
+
+<details>
+  <summary>Example Input</summary>
+
+```
+integration_tests/tiffs/UCMerced_LandUse
+└── Images
+   ├── airplane
+   │  ├── airplane00.tif
+   │  └── airplane42.tif
+   └── forest
+      ├── forest10.tif
+      └── forest99.tif
+```
+</details>
+
+<details>
+  <summary>LMDB Result</summary>
+
+```
+'airplane00':
+  {
+    'Red':   <256x256 uint8 safetensors image data>
+    'Green': <256x256 uint8 safetensors image data>
+    'Blue':  <256x256 uint8 safetensors image data>
+  },
+'airplane42':
+  {
+    'Red':   <256x256 uint8 safetensors image data>
+    'Green': <256x256 uint8 safetensors image data>
+    'Blue':  <256x256 uint8 safetensors image data>
+  },
+'forest10':
+  {
+    'Red':   <256x256 uint8 safetensors image data>
+    'Green': <256x256 uint8 safetensors image data>
+    'Blue':  <256x256 uint8 safetensors image data>
+  },
+'forest99':
+  {
+    'Red':   <256x256 uint8 safetensors image data>
+    'Green': <256x256 uint8 safetensors image data>
+    'Blue':  <256x256 uint8 safetensors image data>
+  }
+```
+
+</details>
+
+```python
+import lmdb
+import numpy as np
+# import desired deep-learning library:
+# numpy, torch, tensorflow, paddle, flax, mlx
+from safetensors.numpy import load
+from pathlib import Path
+
+encoded_path = "Encoded-UC-Merced"
+
+# Make sure to only open the environment once
+# and not everytime an item is accessed.
+env = lmdb.open(str(encoded_path), readonly=True)
+
+with env.begin() as txn:
+  # string encoding is required to map the string to an LMDB key
+  safetensor_dict = load(txn.get("airplane00".encode()))
+
+tensor = np.stack([safetensor_dict[key] for key in ["Red", "Green", "Blue"]])
+assert tensor.shape == (3, 256, 256)
+```
+
+### [EuroSAT][euro] Example
+
+First, [download the rico-hdl](#Download) binary and install
+the Python [lmdb][pyl] and [saftensors][pys] packages.
+Then, to convert the patches from the [EuroSAT][euro] multi-spectral
+dataset into the optimized format, call the application with:
+
+```bash
+rico-hdl eurosat-multi-spectral --dataset-dir <EURO_SAT_MS_ROOT_DIR> --dataset-dir Encoded-EuroSAT-MS
+```
+
+In [EuroSAT][euro], each patch contains 13 bands from a Sentinel-2 L1C tile.
+The encoder will convert each patch into a [safetensors][s]
+where the dictionary's key is the band name (`B01`, `B02`,..., `B10`, `B11`, `B12`, `B8A`)
+of the safetensor dictionary.
+
+<details>
+  <summary>Example Input</summary>
+
+```
+integration_tests/tiffs/EuroSAT_MS
+├── AnnualCrop
+│  └── AnnualCrop_1.tif
+├── Pasture
+│  └── Pasture_300.tif
+└── SeaLake
+   └── SeaLake_3000.tif
+```
+</details>
+
+<details>
+  <summary>LMDB Result</summary>
+
+```
+'AnnualCrop_1':
+  {
+    'B01':   <64x64 uint16 safetensors image data>,
+    'B02':   <64x64 uint16 safetensors image data>,
+    'B03':   <64x64 uint16 safetensors image data>,
+    'B04':   <64x64 uint16 safetensors image data>,
+    'B05':   <64x64 uint16 safetensors image data>,
+    'B06':   <64x64 uint16 safetensors image data>,
+    'B07':   <64x64 uint16 safetensors image data>,
+    'B08':   <64x64 uint16 safetensors image data>,
+    'B09':   <64x64 uint16 safetensors image data>,
+    'B10':   <64x64 uint16 safetensors image data>,
+    'B11':   <64x64 uint16 safetensors image data>,
+    'B12':   <64x64 uint16 safetensors image data>,
+    'B08A':  <64x64 uint16 safetensors image data>,
+  },
+```
+
+</details>
+
+```python
+import lmdb
+import numpy as np
+# import desired deep-learning library:
+# numpy, torch, tensorflow, paddle, flax, mlx
+from safetensors.numpy import load
+from pathlib import Path
+
+encoded_path = "Encoded-EuroSAT-MS"
+
+# Make sure to only open the environment once
+# and not everytime an item is accessed.
+env = lmdb.open(str(encoded_path), readonly=True)
+
+with env.begin() as txn:
+  # string encoding is required to map the string to an LMDB key
+  safetensor_dict = load(txn.get("AnnualCrop_1".encode()))
+
+tensor = np.stack([safetensor_dict[key] for key in [
+  "B01",
+  "B02",
+  "B03",
+  "B04",
+  "B05",
+  "B06",
+  "B07",
+  "B08",
+  "B09",
+  "B10",
+  "B11",
+  "B12",
+  "B08A"
+]])
+assert tensor.shape == (13, 64, 64)
+```
+
 
 ## Design
 
@@ -304,10 +482,27 @@ These characteristics make array-structured data formats less suitable for deep-
 
 </details>
 
+## Citation
+
+If you use this work, please cite:
+
+```bibtex
+@article{clasen2024refinedbigearthnet,
+  title={reBEN: Refined BigEarthNet Dataset for Remote Sensing Image Analysis},
+  author={Clasen, Kai Norman and Hackel, Leonard and Burgert, Tom and Sumbul, Gencer and Demir, Beg{\"u}m and Markl, Volker},
+  year={2024},
+  eprint={2407.03653},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV},
+  url={https://arxiv.org/abs/2407.03653},
+}
+```
 
 [ben]: https://bigearth.net
 [LMDB]: https://www.symas.com/lmdb
 [s]: https://huggingface.co/docs/safetensors/index
 [hyspecnet]: https://hyspecnet.rsim.berlin/
 [pyl]: https://lmdb.readthedocs.io/en/release/
 [pys]: https://github.com/huggingface/safetensors
+[ucmerced]: http://weegee.vision.ucmerced.edu/datasets/landuse.html
+[euro]: https://zenodo.org/records/7711810