Integration of auxiliary data (metadata) in models (#109)

* Fixup bg percent check in no-bg cases * Fixup non-bg/bg sample percent check * Fixup sample script utils import for consistency * Update sample class count check to remove extra data loop * Update sample dataset resize func w/ generic version * Add missing debug/scale to main test config * Add missing loss/optim/ignoreidx vals to main test cfg * Move sample metadata fields to parallel hdf5 datasets The previous implementation would overwrite the metadata attributes each time a new raster was parsed; this version allows multiple versions to exist in parallel. The metadata itself is tied to each sample using an index that corresponds to the position of the metadata string in the separate dataset. This implementation also stores the entire raster YAML metadata dict as a single string that may be eval'd and reinstantiated as needed at runtime. * Remove top-level trn/val/tst split config * Remove useless class weight vector from test config * Update segmentation dataset interface to load metadata * Add metadata unpacking in segm dataset interface * Fix parameter check to support zero-based values The previous implementation did not allow null values to actually be assigned to some non-null default hyperparameters. For example, when the 'ignore_index' was set to '0' (which is totally valid), it would be skipped and the default value of '-100' would remain. * Update hdf5 label map dtypes to int16 * Add coordconv layers & utils in new module * Add metadata-enabled segm dataset parsing interface * Add util function for fetching vals in a dictionary * Update model_choice to allow enabling coordconv via config * Cleanup dataset creation util func w/ subset loop * Refactor image reader & vector rasterizer utilities The current version of these functions is now more generic than before. The rasterization utility function (vector_to_raster) is now located in the 'utils' package and supports the burning of vectors into separate layers as well as in the same layer (which is the original behavior). The new multi-layer behavior is used in the updated 'image_reader_as_array' utility function to (optionally) append new layers to raw imagery. The refactoring also allowed the cleanup of the 'assert_band_number' utility function, and simplification of the code in both the inference script (inference.py') and dataset preparation script ('image_to_samples.py'). * Update meta-segm dataset parser to make map optional * Cleanup SegmDataset to clearly only handle zero-dontcare differently * Refactor 'create_dataloader' function in training script The current version now inspects the parameter dictionary to see if a 'meta_map' is provided. If so, the segmentation dataset parser will be replaced by its upgraded version that can append extra (metadata) layers onto loaded tensors based on that predefined mapping. The refactoring also now includes the 'get_num_samples' call directly into the 'create_dataloader' function. * Update create_dataloader util to force-fix dontcare val * Update read_csv to parse metadata config file with raster path The current version now allows a metadata (yaml) file to be associated with each raster file that will be split into samples. The previous version only allowed a global metadata file to be parsed. * Cleanup package imports & add missing import to utils * Refactor meta-segm-dataset parser to expose meta layer append util * Move meta_map param from training cfg to global cfg * Add meta-layer support to inference.py * Move meta-layer concat to proper location in inference script * Update meta-enabled config for unet tests * Move meta-segm cfg to proper dir & add coordconv cfg * Update csv column count check to allow extras * Update i2s and inf band count checks to account for meta layers * Fixup missing meta field in csv parsing output dicts * Fixup band count in coordconv ex config * Fixup image reader util to avoid double copies * Cleanup vector rasterization utils & recursive key getter * Update aux distmap computing to make target ids optional & add log maps * Add canvec aux test config and cleanup aux params elsewhere * Add download links for external (non-private) files * Re-add previously deleted files from default gdl data dir * Update i2s/train/inf scripts to support single class segm * Fixup gpu stats display when gpu is disabled * Add missing empty metadata fields in test CSVs * Fixup improper device upload in classif inference * Update travis to use recent pkgs in conda-forge
NRCan · Nov 6, 2019 · 2d04470 · 2d04470
1 parent 278fd9a
commit 2d04470
Show file tree

Hide file tree

Showing 17 changed files with 837 additions and 222 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+.idea/
+.vscode/
+__pycache__
diff --git a/.travis.yml b/.travis.yml
@@ -6,11 +6,13 @@ install:
   - bash miniconda.sh -b -p $HOME/miniconda
   - export PATH="$HOME/miniconda/bin:$PATH"
   - hash -r
-  - conda config --set always_yes yes --set changeps1 no
+  - conda config --set always_yes yes
+  - conda config --set changeps1 no
+  - conda config --prepend channels conda-forge
+  - conda config --prepend channels pytorch
   - conda update -q conda
   - conda info -a
-
-  - conda create -q -n ci_env python=3.6 pytorch-cpu torchvision-cpu torchvision ruamel_yaml h5py scikit-image scikit-learn fiona rasterio tqdm -c pytorch
+  - conda create -q -n ci_env python=3.6 pytorch-cpu torchvision-cpu torchvision ruamel_yaml h5py>=2.10 scikit-image scikit-learn fiona rasterio tqdm
   - source activate ci_env
 before_script:
   - unzip ./data/massachusetts_buildings.zip -d ./data

diff --git a/conf/config.canvecaux.yaml b/conf/config.canvecaux.yaml
@@ -0,0 +1,69 @@
+# Deep learning configuration file ------------------------------------------------
+# Five sections :
+#   1) Global parameters; those are re-used amongst the next three operations (sampling, training and inference)
+#   2) Sampling parameters
+#   3) Training parameters
+#   4) Inference parameters
+#   5) Model parameters
+
+# Global parameters
+
+global:
+  samples_size: 256
+  num_classes: 5
+  data_path: ./data/kingston_wv2_40cm/images
+  number_of_bands: 4
+  model_name: unet     # One of unet, unetsmall, checkpointed_unet or ternausnet
+  bucket_name:   # name of the S3 bucket where data is stored. Leave blank if using local files
+  task: segmentation  # Task to perform. Either segmentation or classification
+  num_gpus: 1
+  aux_vector_file: ./data/canvec_191031_127357_roads.gpkg  # https://drive.google.com/file/d/1PCxn2197NiOVKOxGgQIA__w69jAJmjXp
+  aux_vector_dist_maps: true
+  meta_map:
+  scale_data: [0,1]
+  debug_mode: True
+  coordconv_convert:
+  coordconv_scale:
+
+# Sample parameters; used in images_to_samples.py -------------------
+
+sample:
+  prep_csv_file: ./data/trn_val_tst_kingston.csv  # https://drive.google.com/file/d/1uNizOAToa-R_sik0DvBqDUVwjqYdOALJ
+  samples_dist: 200
+  min_annotated_percent: 10 # Min % of non background pixels in stored samples. Default: 0
+  mask_reference: False
+
+# Training parameters; used in train_model.py ----------------------
+
+training:
+  state_dict_path:
+  output_path: ./data/output
+  num_trn_samples:
+  num_val_samples:
+  num_tst_samples:
+  batch_size: 8
+  num_epochs: 100
+  loss_fn: Lovasz # One of CrossEntropy, Lovasz, Focal, OhemCrossEntropy (*Lovasz for segmentation tasks only)
+  optimizer: adam # One of adam, sgd or adabound
+  learning_rate: 0.0001
+  weight_decay: 0
+  step_size: 4
+  gamma: 0.9
+  class_weights:
+  batch_metrics:    # (int) Metrics computed every (int) batches. If left blank, will not perform metrics. If (int)=1, metrics computed on all batches.
+  ignore_index: 0    # Specifies a target value that is ignored and does not contribute to the input gradient. Default: None
+  augmentation:
+    rotate_limit: 45
+    rotate_prob: 0.5
+    hflip_prob: 0.5
+  dropout:
+  dropout_prob:
+
+# Inference parameters; used in inference.py --------
+
+inference:
+  img_dir_or_csv_file: ./data/trn_val_tst_kingston.csv  # https://drive.google.com/file/d/1uNizOAToa-R_sik0DvBqDUVwjqYdOALJ
+  working_folder: ./data/output
+  state_dict_path: ./data/output/checkpoint.pth.tar
+  chunk_size: 256 # (int) Size (height and width) of each prediction patch. Default: 512
+  overlap: 10 # (int) Percentage of overlap between 2 chunks. Default: 10
diff --git a/conf/config.coordconv.yaml b/conf/config.coordconv.yaml
@@ -0,0 +1,69 @@
+# Deep learning configuration file ------------------------------------------------
+# Five sections :
+#   1) Global parameters; those are re-used amongst the next three operations (sampling, training and inference)
+#   2) Sampling parameters
+#   3) Training parameters
+#   4) Inference parameters
+#   5) Model parameters
+
+# Global parameters
+
+global:
+  samples_size: 256
+  num_classes: 5
+  data_path: ./data/kingston_wv2_40cm/images
+  number_of_bands: 3
+  model_name: unet     # One of unet, unetsmall, checkpointed_unet or ternausnet
+  bucket_name:   # name of the S3 bucket where data is stored. Leave blank if using local files
+  task: segmentation  # Task to perform. Either segmentation or classification
+  num_gpus: 1
+  aux_vector_file:
+  aux_vector_dist_maps:
+  meta_map:
+  scale_data: [0,1]
+  debug_mode: True
+  coordconv_convert: true
+  coordconv_scale: 0.4
+
+# Sample parameters; used in images_to_samples.py -------------------
+
+sample:
+  prep_csv_file: ./data/trn_val_tst_kingston.csv  # https://drive.google.com/file/d/1uNizOAToa-R_sik0DvBqDUVwjqYdOALJ
+  samples_dist: 200
+  min_annotated_percent: 10 # Min % of non background pixels in stored samples. Default: 0
+  mask_reference: False
+
+# Training parameters; used in train_model.py ----------------------
+
+training:
+  state_dict_path:
+  output_path: ./data/output
+  num_trn_samples:
+  num_val_samples:
+  num_tst_samples:
+  batch_size: 8
+  num_epochs: 100
+  loss_fn: Lovasz # One of CrossEntropy, Lovasz, Focal, OhemCrossEntropy (*Lovasz for segmentation tasks only)
+  optimizer: adam # One of adam, sgd or adabound
+  learning_rate: 0.0001
+  weight_decay: 0
+  step_size: 4
+  gamma: 0.9
+  class_weights:
+  batch_metrics:    # (int) Metrics computed every (int) batches. If left blank, will not perform metrics. If (int)=1, metrics computed on all batches.
+  ignore_index: 0    # Specifies a target value that is ignored and does not contribute to the input gradient. Default: None
+  augmentation:
+    rotate_limit: 45
+    rotate_prob: 0.5
+    hflip_prob: 0.5
+  dropout:
+  dropout_prob:
+
+# Inference parameters; used in inference.py --------
+
+inference:
+  img_dir_or_csv_file: ./data/trn_val_tst_kingston.csv  # https://drive.google.com/file/d/1uNizOAToa-R_sik0DvBqDUVwjqYdOALJ
+  working_folder: ./data/output
+  state_dict_path: ./data/output/checkpoint.pth.tar
+  chunk_size: 256 # (int) Size (height and width) of each prediction patch. Default: 512
+  overlap: 10 # (int) Percentage of overlap between 2 chunks. Default: 10
diff --git a/conf/config.metasegm.yaml b/conf/config.metasegm.yaml
@@ -0,0 +1,70 @@
+# Deep learning configuration file ------------------------------------------------
+# Five sections :
+#   1) Global parameters; those are re-used amongst the next three operations (sampling, training and inference)
+#   2) Sampling parameters
+#   3) Training parameters
+#   4) Inference parameters
+#   5) Model parameters
+
+# Global parameters
+
+global:
+  samples_size: 256
+  num_classes: 5
+  data_path: ./data/kingston_wv2_40cm/images
+  number_of_bands: 5
+  model_name: unet     # One of unet, unetsmall, checkpointed_unet or ternausnet
+  bucket_name:   # name of the S3 bucket where data is stored. Leave blank if using local files
+  task: segmentation  # Task to perform. Either segmentation or classification
+  num_gpus: 1
+  aux_vector_file:
+  aux_vector_dist_maps:
+  meta_map:
+    "properties/eo:gsd": "scaled_channel"
+  scale_data: [0,1]
+  debug_mode: True
+  coordconv_convert:
+  coordconv_scale:
+
+# Sample parameters; used in images_to_samples.py -------------------
+
+sample:
+  prep_csv_file: ./data/trn_val_tst_kingston.csv  # https://drive.google.com/file/d/1uNizOAToa-R_sik0DvBqDUVwjqYdOALJ
+  samples_dist: 200
+  min_annotated_percent: 10 # Min % of non background pixels in stored samples. Default: 0
+  mask_reference: False
+
+# Training parameters; used in train_model.py ----------------------
+
+training:
+  state_dict_path:
+  output_path: ./data/output
+  num_trn_samples:
+  num_val_samples:
+  num_tst_samples:
+  batch_size: 8
+  num_epochs: 100
+  loss_fn: Lovasz # One of CrossEntropy, Lovasz, Focal, OhemCrossEntropy (*Lovasz for segmentation tasks only)
+  optimizer: adam # One of adam, sgd or adabound
+  learning_rate: 0.0001
+  weight_decay: 0
+  step_size: 4
+  gamma: 0.9
+  class_weights:
+  batch_metrics:    # (int) Metrics computed every (int) batches. If left blank, will not perform metrics. If (int)=1, metrics computed on all batches.
+  ignore_index: 0    # Specifies a target value that is ignored and does not contribute to the input gradient. Default: None
+  augmentation:
+    rotate_limit: 45
+    rotate_prob: 0.5
+    hflip_prob: 0.5
+  dropout:
+  dropout_prob:
+
+# Inference parameters; used in inference.py --------
+
+inference:
+  img_dir_or_csv_file: ./data/trn_val_tst_kingston.csv  # https://drive.google.com/file/d/1uNizOAToa-R_sik0DvBqDUVwjqYdOALJ
+  working_folder: ./data/output
+  state_dict_path: ./data/output/checkpoint.pth.tar
+  chunk_size: 256 # (int) Size (height and width) of each prediction patch. Default: 512
+  overlap: 10 # (int) Percentage of overlap between 2 chunks. Default: 10
diff --git a/conf/config_ci_segmentation_local.yaml b/conf/config_ci_segmentation_local.yaml
@@ -10,7 +10,7 @@
 
 global:
   samples_size: 256
-  num_classes: 2
+  num_classes: 1  # will automatically create a 'background' class
   data_path: ./data
   number_of_bands: 3
   model_name: checkpointed_unet     # One of unet, unetsmall, checkpointed_unet, ternausnet, fcn_resnet101, deeplabv3_resnet101
@@ -47,7 +47,7 @@ training:
   dropout_prob: False    # (float) Set dropout probability, e.g. 0.5
   class_weights: [1.0, 2.0]
   batch_metrics: 1
-  ignore_index: 0 # Specifies a target value that is ignored and does not contribute to the input gradient
+  ignore_index:  # Specifies a target value that is ignored and does not contribute to the input gradient
   augmentation:
     rotate_limit: 45
     rotate_prob: 0.5

diff --git a/data/images_to_samples_ci_csv.csv b/data/images_to_samples_ci_csv.csv
@@ -1,4 +1,4 @@
-./data/22978945_15.tif,./data/massachusetts_buildings.gpkg,class,trn
-./data/23429155_15.tif,./data/massachusetts_buildings.gpkg,class,val
-./data/23429155_15.tif,./data/massachusetts_buildings.gpkg,class,val
-./data/23429155_15.tif,./data/massachusetts_buildings.gpkg,class,tst
+./data/22978945_15.tif,,./data/massachusetts_buildings.gpkg,properties/class,trn
+./data/23429155_15.tif,,./data/massachusetts_buildings.gpkg,properties/class,val
+./data/23429155_15.tif,,./data/massachusetts_buildings.gpkg,properties/class,val
+./data/23429155_15.tif,,./data/massachusetts_buildings.gpkg,properties/class,tst
diff --git a/data/inference_classif_ci_csv.csv b/data/inference_classif_ci_csv.csv
@@ -1,3 +1,3 @@
-./data/classification/135.tif
-./data/classification/408.tif
-./data/classification/2533.tif
+./data/classification/135.tif,
+./data/classification/408.tif,
+./data/classification/2533.tif,
diff --git a/data/inference_sem_seg_ci_csv.csv b/data/inference_sem_seg_ci_csv.csv
@@ -1,2 +1,2 @@
-./data/22978945_15.tif
-./data/23429155_15.tif
+./data/22978945_15.tif,
+./data/23429155_15.tif,