Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the /diagnostics_toplevel_state ERROR when one of the diagnostics is STALE #297

Open
Rayman opened this issue Apr 18, 2023 · 7 comments
Labels
ros1 PR tackling a ROS1 branch ros2 PR tackling a ROS2 branch

Comments

@Rayman
Copy link
Contributor

Rayman commented Apr 18, 2023

In the current code the toplevel state is only STALE when ALL the diagnostics are STALE. Example:

group_analyzer

What I think is more logical is that the state is STALE when one of the diagnostics is STALE and none of them ERROR:
group_analyzer2

What do you think about this logic? I've implemented this in my fork, but it is a breaking change

@g-gemignani
Copy link
Collaborator

Hi @Rayman and thank you for your comment. I agree with you that this makes more sense but, as you stated, this is a breaking change so for Noetic I do not believe it can be merged as is. I am not sure how the ros2 maintainers would see this change for the ros2 version...

@ct2034
Copy link
Collaborator

ct2034 commented Apr 24, 2023

Hi @Rayman. Thanks for your suggestion. Just to be clear: The stale state is set by the generic analyzer if no message was received within a given timeout:

stale = (clock_->now() - item->getLastUpdateTime()).seconds() > timeout_;

Which can be useful information on the actuality of a state.

If I get it correctly, your suggestion is to treat it in aggregation like the other levels and aggregate it in the group. There, I would honestly have a hard time to rate it in severity between the other levels. Currently, it is level 3 which reads as the highest priority. This means you would only see stale on the highest level, even if another item in that group is in the error state. This can not be the intended behavior. Changing these levels would be a SERIOUS breaking change.

What is your take on that?

@ct2034 ct2034 added ros2 PR tackling a ROS2 branch ros1 PR tackling a ROS1 branch labels Apr 24, 2023
@Rayman
Copy link
Contributor Author

Rayman commented Apr 24, 2023

The toplevel state is not just the maximum of all the levels. Its calculated with the following algorithm

if maximum_level > ERROR and minimum_level <= ERROR
	# one or more STALE, but not all of them
	level = ERROR
else:
	level = maximum_level

I would propose to change this to the following, because I think it's more logical.

if maximum_level == STALE and maximum_level_without_stale < ERROR
	# one or more STALE, but no errors
	level = STALE
else:
	level = maximum_level_without_stale

This will be the difference between the two algorithms:

diagnostic1 diagnostic2 current proposed
stale ok error stale
stale warn error stale
stale error error error
stale stale stale stale

@asymingt
Copy link
Contributor

asymingt commented Sep 20, 2023

What I find counter-intuitive about the current behavior is that if you have three leaf diagnostics rolled up into a group, the discard_stale doesn't seem to have an impact on the parent status. For example, if bar and baz in the example below go stale, but foo is OK, I would intuitively think that the part group should also be OK. However, what I'm seeing is that foo currently gets marked as ERROR.

diagnostics_aggregator:
  ros__parameters:
    pub_rate: 1.0
    path: 'robot'
    analyzers:
      part:
        type: 'diagnostic_aggregator/AnalyzerGroup'
        path: 'part'
        foo:
          type: 'diagnostic_aggregator/GenericAnalyzer'
          path: 'foo'
          find_and_remove_prefix: ['/foo:']
          num_items: 1
        bar:
          type: 'diagnostic_aggregator/GenericAnalyzer'
          path: 'bar'
          find_and_remove_prefix: ['/foo:']
          discard_stale: true
        baz:
          type: 'diagnostic_aggregator/GenericAnalyzer'
          path: 'baz'
          find_and_remove_prefix: ['/baz:']
          discard_stale: true

@Rayman
Copy link
Contributor Author

Rayman commented Sep 21, 2023

I did not want to propose to merge a breaking change in noetic, so I've implemented my proposed change in our fork: https://github.com/nobleo/diagnostics. Feel free to use it

I've implemented the change for the toplevel diagnostics and for the AnalyserGroup.

@asymingt
Copy link
Contributor

asymingt commented Sep 21, 2023

I also added my implementation here: #315

@Timple
Copy link
Contributor

Timple commented Jan 19, 2024

Since this was a breaking change for Noetic, it probably also is for Humble and Iron?

Does it make sense to put some effort into this before ROS Jazzy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ros1 PR tackling a ROS1 branch ros2 PR tackling a ROS2 branch
Projects
None yet
Development

No branches or pull requests

5 participants