Skip to content
/ HER Public

An information theoretic alternative for geostatistics

Notifications You must be signed in to change notification settings

KIT-HYD/HER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HER: an information theoretic alternative for geostatistics

In HER method, we propose a stochastic, geostatistical estimator which combines information theory with probability aggregation methods for minimizing predictive uncertainty, and predicting distributions directly based on empirical probability. Histogram via entropy reduction (HER) relaxes parametrizations, avoiding the risk of adding information not present in data (or losing available information). It provides a framework for uncertainty estimation that takes into account both spatial configuration and data values, while allowing to infer (or introduce) continuous or discontinuous characteristics of the field. We investigate the framework utility using synthetically generated datasets from Gaussian Processes with different sample sizes and data properties (different spatial correlation distances and addition of noise). HER method brings a new perspective of spatial interpolation and uncertainty analysis to geostatistics and statistical learning, using the lens of information theory.

The code and datasets are complementary parts of the study proposed by Thiesen, Vieira, Mälicke, Loritz, Wellmann and Ehret (2020):

Thiesen, S.; Vieira, D.; Mälicke, M.; Loritz, R.; Wellmann, J. F.; Ehret, U. Histogram via entropy reduction (HER): an information-theoretic alternative for geostatistics, Hydrol. Earth Syst. Sci., https://doi.org/10.5194/hess-24-4523-2020, 24(9), 4523–4540, 2020.

License agreement

The HER method comes with ABSOLUTELY NO WARRANTY. You are welcome to modify and redistribute it within the license agreement. The HER method is published under the CreativeCommons "CC-BY-4.0" license together with a ready-to-use sample data set. To view a full version of the license agreement please visit CC-BY-4.0.

Requisites

  • MATLAB (tested on 2018b).

Usage

See HER.m

File structure

  • HER script ... .m
  • functions/ ... .m
  • datasets/ ... .mat

HER

The script is divided in 7 sections:

1. Load dataset Loads the dataset.

2. Define infogram and Geo3 properties Definition of the infogram properties, aggregation method, z threshold (optional).

3. Geo1: Spatial characterization Extracts spatial correlation patterns. f_her_infogram.m

4. Geo2: Weight optimization Optimizes weights for the aggregation method based on entropy minimization. f_her_weight.m

5. Geo3: z PMF prediction Applies spatial characterization and optimal weights for PMF prediction. f_her_predict.m

6. Extract PMF statistics Obtains mean, median, mode and probability of a z threshold (optional) of the predicted z PMFs and plots the results.

7. Calculate performance metrics Calculates Root Mean Square Error (RMSE), Mean Error (ME), Mean Absolute Error (MAE), Nash-Sutcliffe model efficiency and scoring rule (DKL) of the validation set.

8. Clear Clears intermediate variables.

Functions

The functions are detailed in their own source code body. Examples of how to use them are available in the HER.m script.

f_DKL_w_AND.m
f_DKL_w_OR.m
f_diff.m
f_entropy.m
f_euclidean_dist.m
f_linear_aggregation.m
f_loglinear_aggregation.m
f_performance_det.m
f_performance_prob.m
f_her_infogram.m
f_her_weight.m
f_her_predict.m
f_extract_pmf_statistics.m
f_plot_infogram.m
f_plot_weights.m
f_plot_prediction.m
f_plot_probabilitymap.m

f_plot functions were specifically built for the dataset of the study.

Dataset of the study

The folder contains synthetic observations used in the paper case study. Four synthetic 2D spatial datasets with grid size 100x100 were generated from known Gaussian processes. We use rational quadratic kernel as the covariance function, with correlation lengths of 6 and 18 units. For both, short- and long-range fields, a white noise was introduced given by Gaussian distribution with mean 0 and standard deviation equal to 0.5.

The generated sets comprise: * SR0: short-range field without noise * SR1: short-range field with noise * LR0: long-range field without noise * LR1: long-range field with noise We randomly shuffled the data, and then divided it in three mutually exclusive sets: one to generate the calibration subsets (sizes of 200, 400, 600, 800, 1000, 1500, and 2000), one for validation (containing 2000 data points), and another 2000 data points as test set.

Each dataset file contains:

  • idx_rand_full: index of the randomly shuffled data (same for all files)
  • sample_size: all calibration sizes available of the dataset (same for all files)
  • data: matrix with z values of the full generated dataset
  • txt: dataset type (SR0, SR1, LR0, LR1)
  • idx_cal: index of the calibration set
  • idx_val: index of the validation set
  • idx_test: index of the test set
  • x: matrix with x coordinates of the full dataset
  • x_cal: vector with x coordinates of the calibration set (x_cal=x(idx_cal))
  • x_val: vector with x coordinates of the validation set (x_val=x(idx_val))
  • x_test: vector with x coordinates of the test set (x_test=x(idx_test))
  • y: matrix with y coordinates of the full dataset
  • y_cal: vector with y coordinates of the calibration set (y_cal=y(idx_cal))
  • y_val: vector with y coordinates of the validation set (y_val=y(idx_val))
  • y_test: vector with y coordinates of the test set (y_test=y(idx_test))
  • z: matrix with z values of the full generated dataset (z=data)
  • z_cal: vector with z values of the calibration dataset (z_cal=z(idx_cal))
  • z_val: vector with z values of the validation dataset (z_val=z(idx_val))
  • z_test: vector with z values of the test dataset (z_test=z(idx_test))
  • dim_cal: size of the calibration set (dim_cal=length(idx_cal))
  • dim_val: size of the validation set (dim_val=length(idx_val))
  • dim_test: size of the test set (dim_test=length(idx_test))

The synthetic field generator, using Gaussian processes, is available in scikit-learn (Pedregosa et al., 2011), while the code producing the fields can be found at https://github.com/mmaelicke/random_fields.

Contact

Stephanie Thiesen | [email protected] Uwe Ehret | [email protected]

About

An information theoretic alternative for geostatistics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages