Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent replication errors when running ipa-healthcheck #283

Open
Kivernitas opened this issue Jan 3, 2023 · 9 comments
Open

Intermittent replication errors when running ipa-healthcheck #283

Kivernitas opened this issue Jan 3, 2023 · 9 comments

Comments

@Kivernitas
Copy link

Issue

Intermittent replication errors when running ipa-healthcheck.
Running ipa-healthcheck every x minutes provides unreliable ReplicationChecks results.
From what I've read on https://access.redhat.com/solutions/359683, getting a "replica is busy" is considered "normal".
This make it difficult to monitor for actual replication errors.

Actual behaviour

  {
    "source": "ipahealthcheck.ds.replication",
    "check": "ReplicationCheck",
    "result": "ERROR",
    "uuid": "94548c4b-ca49-4f8a-bd2e-1953fba9f767",
    "when": "20230103141508Z",
    "duration": "0.304435",
    "kw": {
      "key": "DSREPLLE0003",
      "items": [
        "Replication",
        "Agreement"
      ],
      "msg": "The replication agreement (ipa-2.test.io-to-ipa-3.test.io) under \"dc=test,dc=io\" is not in synchronization.\nStatus message: error (1) can't acquire busy replica (unable to acquire replica: the replica is currently being updated by another supplier.)"
    }

Similar to the above error can happen intermittently on every freeipa server on a 3 node cluster.
There aren't any replication errors most of the time.

Expected behavior

It should not report an error.
A warning would be more suitable.

Version/Release/Distribution

Rocky Linux 8.6
Source : ipa-healthcheck-0.7-14.module+el8.7.0+1075+05db0c1d.src.rpm (latest available)
FreeIPA: 4.9
@rcritten
Copy link
Collaborator

rcritten commented Jan 5, 2023

This check is provided by 389 itself. I suppose we could consider reducing the severity to WARNING but I'd leave that as a call to them. @mreynolds389 what do you think?

@mreynolds389
Copy link
Contributor

This check is provided by 389 itself. I suppose we could consider reducing the severity to WARNING but I'd leave that as a call to them. @mreynolds389 what do you think?

Well it is a transient error. Replication is just busy at that time. If you run it again in a few seconds it will probably pass. For us we already set it to a "medium" severity.

@Kivernitas
Copy link
Author

Thanks both for replying!

Yes it's a transient error. We run ipahealthcheck_exporter which basically scrapes ipa-healthcheck logs every 5 minutes. Can you suggest an alternative way of verifying replication health?

@mreynolds389 you mentioned you set it to "medium" severity, could I ask how?

@mreynolds389
Copy link
Contributor

Thanks both for replying!

Yes it's a transient error. We run ipahealthcheck_exporter which basically scrapes ipa-healthcheck logs every 5 minutes. Can you suggest an alternative way of verifying replication health?

@mreynolds389 you mentioned you set it to "medium" severity, could I ask how?

Well IPA is using DS's lib389 library for the DS healthchecks. IPA does not use DS's healthecheck severity level - it is ignored because there are basically two tools that were merged.

@rexberg
Copy link

rexberg commented May 15, 2024

@rcritten Since IPA does not use DS's healthcheck severity level could this checks severity level be lowered to WARNING in IPA?

@rcritten
Copy link
Collaborator

healthcheck doesn't ignore the DS severity. It converts it. See #283 (comment)

"medium" from DS is converted into a ipa-healthcheck ERROR severity.

@rexberg
Copy link

rexberg commented May 15, 2024

healthcheck doesn't ignore the DS severity. It converts it. See #283 (comment)

"medium" from DS is converted into a ipa-healthcheck ERROR severity.

Thanks for clarifying. Do we want to set this specific check's severity to WARNING bypassing the conversion? As mentioned it is a transient error but it is still triggering a ERROR severity.

@rcritten
Copy link
Collaborator

I suppose it's possible but it would be an ugly one-off. healthcheck has a rather thin wrapper to call the 389 checks and then re-format the return value. It's very generic code. It would be invasive to put in a test for a specific check.

@rexberg
Copy link

rexberg commented May 16, 2024

I looked at the code and would assume as much and I tend to agree. Currently we exclude this specific check since we can't really "trust" the ERROR trigger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants