Basin and timestep information for error messages and warnings #723

SnowHydrology · 2024-02-07T17:40:41Z

The hydrologic models running in NextGen may fail for many reasons. Some of them have error messages that print to the console, which is helpful for diagnosing model-specific issues, but not for determining where and when the model failed.

Current behavior

Error messages and warning write to the console without location or time information.

Expected behavior

Concatenate basin and timestamp information to warning and error messages. Something like:

cat-<id>, timestep <model-timestep>: <error-message>

The text was updated successfully, but these errors were encountered:

PhilMiller · 2024-02-07T19:02:11Z

I strongly agree having that information would improve the ability to isolate, diagnose, and fix issues within realization modules.

As a matter of implementation, prefixing the error message uniformly might be pretty hard without close cooperation with all the formulation implementers. For C++ and Python formulations, if they throw an exception, we can catch it and handle accordingly. For all languages, if they return BMI_FAILURE from the call to update() or update_until(), we can print the basin and timestep after whatever the module itself may have printed.

In theory, we could add verbose debug printing to emit basin, timestep, and module name before each call we make, but that would only be affordable for runs where the user knows they need it, because printing is expensive (relatively speaking).

hellkite500 · 2024-02-07T19:20:13Z

I've been looking at possible mechanisms for intercepting exit calls from the loaded library modules that may allow us to unwind the stack and provide additional context -- this would let us more gracefully shutdown when a loaded module calls exit or, in fortran stop or abort.

PhilMiller · 2024-02-07T19:45:35Z

It would be pretty easy to at least set some variables indicating the relevant debugging variables, and add an atexit() handler to print them if exit() is called unexpectedly. I think that's a pretty good idea.

I would be really hesitant to intercept exit() or the like.

I did a slight bit of looking, and it may be workable to throw an exception in an atexit() handler, to skip directly exiting. That feels super sketchy to me, but ...

SnowHydrology · 2024-02-07T21:25:21Z

@PhilMiller For the legacy models that we've updated and implemented with BMI (e.g. Noah-OWP-Modular), we have a lot of flexibility in modernizing code. If there's anything specific we should do to make the models more framework-friendly, please let us know.

ajkhattak · 2024-02-07T21:33:55Z

@hellkite500 I was thinking of a simpler workflow that Keith and Phill somehow described -- verbosity at the ngen level with/without getting any information from the models.

For each timestep, before calling update or update_until, ngen should screen print metadata about that timestep such as: catchment_id, nexus_id, timestep, model, inputs_to_model, etc. Now when ngen calls a model to advance its timestep and the model crashes for some reason, we will already know the location (catchment, nexus) and the model that caused the crash. Once we know that, we can run the model in standalone mode for further debugging. Since printing is expensive, so this should be optional, controlled through some verbosity options in the realization file.

Other options could be based on BMI_FAILURE (as Phill mentioned) or each model should implement its own ValidTimeStep() function that could be used to validate if the model has taken a successful timestep or not. I think this will require a little bit of work, but doable...

PhilMiller · 2024-02-07T21:59:03Z

I think the friendliest thing to do on the model side would be to make sure that the call returns BMI_FAILURE, possibly with a preceding print of the nature of the error.

For languages that can throw an exception, that's perfectly fine too, since those are cheap and easy to catch and can be handled the same way as a returned BMI_FAILURE.

I don't think we can do a ValidTimeStep() sort of thing, since that would be a function added to BMI, breaking all sorts of compatibility constraints.

SnowHydrology · 2024-02-08T11:52:29Z

@PhilMiller, thanks for that. Below are my assumptions for implementing this in Noah-OWP-Modular for the error @yuqiong77 saw (linked here). Please correct anything that's wrong:

In this case, we would remove the STOP from the Noah-OWP-Modular code
Save the error message and make a flag
Add code to bmi::update that checks the error flag value and returns BMI_FAILURE and writes the message

Or something like that.

ajkhattak · 2024-02-08T14:00:37Z

@SnowHydrology if I am understanding this correctly, won't removing STOP impact the functionality in standalone mode?

@PhilMiller sorry if I was not clear enough, the ValidTimeStep() will be part of the model and not the BMI, it will be called at the end of the update or update_until to set the flag BMI_FAILURE which will be returned to the ngen... does it make sense?

SnowHydrology · 2024-02-08T14:14:58Z

@ajkhattak It shouldn't matter in the case of Noah-OM because we run the main program using BMI commands. So if we implement BMI_FAILURE correctly, it should exit the update routine.

But I haven't tested it, so the "shoulds" might not correspond to reality.

PhilMiller · 2024-02-08T15:43:32Z

That all sounds good to me.

SnowHydrology · 2024-02-12T20:37:18Z

@GreyEvenson-NOAA Can you open an issue on Noah-OWP-Modular to test this new implementation of error handling?

SnowHydrology added the enhancement New feature or request label Feb 7, 2024

hellkite500 mentioned this issue Feb 7, 2024

Failure running t-route in ngen worker image NOAA-OWP/DMOD#472

Closed

GreyREvenson mentioned this issue Feb 13, 2024

Revise error handling/reporting NOAA-OWP/noah-owp-modular#99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basin and timestep information for error messages and warnings #723

Basin and timestep information for error messages and warnings #723

SnowHydrology commented Feb 7, 2024

PhilMiller commented Feb 7, 2024

hellkite500 commented Feb 7, 2024

PhilMiller commented Feb 7, 2024

SnowHydrology commented Feb 7, 2024

ajkhattak commented Feb 7, 2024

PhilMiller commented Feb 7, 2024

SnowHydrology commented Feb 8, 2024

ajkhattak commented Feb 8, 2024

SnowHydrology commented Feb 8, 2024

PhilMiller commented Feb 8, 2024

SnowHydrology commented Feb 12, 2024

Basin and timestep information for error messages and warnings #723

Basin and timestep information for error messages and warnings #723

Comments

SnowHydrology commented Feb 7, 2024

Current behavior

Expected behavior

PhilMiller commented Feb 7, 2024

hellkite500 commented Feb 7, 2024

PhilMiller commented Feb 7, 2024

SnowHydrology commented Feb 7, 2024

ajkhattak commented Feb 7, 2024

PhilMiller commented Feb 7, 2024

SnowHydrology commented Feb 8, 2024

ajkhattak commented Feb 8, 2024

SnowHydrology commented Feb 8, 2024

PhilMiller commented Feb 8, 2024

SnowHydrology commented Feb 12, 2024