Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEDUP not Working for HOST #146

Open
sdouce opened this issue May 5, 2023 · 4 comments
Open

DEDUP not Working for HOST #146

sdouce opened this issue May 5, 2023 · 4 comments

Comments

@sdouce
Copy link

sdouce commented May 5, 2023

Hello ,

Strange issue with activating DEDUP on Stream Connector on CENTREON 22.10 and 21.04
It works perfectly when dedup is activated for SERVICES but for HOST no.

Nothing is send .

I notice this behavior when polling :
image
First check give us 1/2(S)
Second give 2/2(H)
Third give 1/2(H)

Logs :
image

When i repass to green my Host (change IP) so it should be a state change !
Logs :
image

And when i disable dedup for host .... i see stream-connector sending HOST event as well ....

Do not hesitate to contact me or ask question .

@sdouce
Copy link
Author

sdouce commented May 5, 2023

I SIMULATE RETURN OK for HOST:
First occurence:
[1683316810] [14499] HOST: xxxxxx, ATTEMPT=1/2, CHECK TYPE=ACTIVE, STATE TYPE=HARD, OLD STATE=1, NEW STATE=0
[1683316810] [14499] Host was DOWN/UNREACHABLE.
[1683316810] [14499] Host experienced a HARD recovery (it's now UP).

Second one :
[1683316875] [14499] HOST: nanul11.eus.cloud.cheops.fr, ATTEMPT=1/2, CHECK TYPE=ACTIVE, STATE TYPE=HARD, OLD STATE=0, NEW STATE=0
[1683316875] [14499] Host was UP.
[1683316875] [14499] Host is still UP.

Stream log say always the same :
[sc_event:is_host_status_event_duplicated]: host_id: 1764 is sending a duplicated event. Dedup option (enable_host_status_dedup) is set to: 1

@sdouce
Copy link
Author

sdouce commented May 9, 2023

Ok , maybe found a solution for host dedup . I want confirmation this good for you .

I change in sc_event.lua in function ScEvent:is_host_status_event_duplicated()

on line 1102 :
if self.event.last_hard_state_change == self.event.last_check or self.event.last_hard_state_change == self.event.last_update then
by :
if self.event.last_hard_state_change == self.event.last_time_down or self.event.last_hard_state_change == self.event.last_time_up or self.event.last_hard_state_change == self.event.last_time_unreachable then

I test UP and DOWN with dedup and this is working ... (I put the unreacheable with an or in case .)

@sdouce
Copy link
Author

sdouce commented May 10, 2023

And for services i changed to :

if self.event.current_state ~= 0 then
if self.event.last_hard_state_change == self.event.last_time_critical or self.event.last_hard_state_change == self.event.last_time_warning or self.event.last_hard_state_change == self.event.last_time_unknown then
--self.sc_logger:warning("JE PASSE KO:")
return false
end
else
--self.sc_logger:warning("JE PASSE OK:")
if self.event.last_hard_state_change == self.event.last_time_ok then
return false
end
end

@tanguyvda
Copy link
Contributor

tanguyvda commented May 11, 2023

Here are some tests I have done without changing the code

for a service:

[number] last_time_critical: 0
[number] last_time_ok: 0
[number] last_time_unknown: 1682516859
[number] last_time_warning: 1683818636
[number] last_hard_state_change: 1683818636
[number] last_check: 1683818636

last_check is equal to last_hard_state_change it is therefore a new event and not a duplicated one. We can also see that according to your tests, we can also notice that we could have a much more specific condition based on last_time_[warning|unknown|ok|critical]

for a host

[number] last_time_unreachable: 0
[number] last_time_up: 1683818385
[number] last_time_down: 1683818425
[number] last_hard_state_change: 1683818425
[number] last_check: 1683818423

and here lies the issue. I can assure you that my host went from something to down but the last_check is not equal to the last_hard_state_change. Therefore, it is considered as a duplicated event which it is not.

In that case, the only solution is to check each last_time_[up|down|unreachable] timestamp and check if it matches the last_hard_start_change which is what you've done.

conclusion

While it was easy to reproduce for hosts, it didn't happen for services. It doesn't mean it can't happen so I'm going to change it for a more reliable method.

Thanks a lot for your feedback. I'll keep you updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants