Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix database purging of stale activewatchers #2519

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wdoekes
Copy link
Contributor

@wdoekes wdoekes commented May 11, 2021

This changeset contains two commits.

(1) To enable IS NULL support (it's ugly).

(2) Fix so activewatcher.sharing_tag IS NULL records get pruned, which was broken since 929ab4d (or 6022968 for 2.4.x).

Without this changeset, people using clustering but no sharing tags will get stale records in the database.

See the commit messages for details.

@wdoekes wdoekes force-pushed the fix-database-purging-of-stale-activewatchers branch from acd309a to addbbc9 Compare May 11, 2021 11:31
wdoekes added a commit to wdoekes/opensips that referenced this pull request May 11, 2021
…ck2db)

When clustering sharing_tags were added to presence, they were added to
the fallback2db "on" case only:

There are a couple of dimensions with differing behaviours:

                +------------+------------+
                | fallback2- | fallback2- |
                | -db = on   | -db = off  |
  +-clustering:-+------------+------------+
  | - no        | OK         | OK         |
  | - tagless   | PR-2519    | PR-2519    |
  | - active    | OK         | this       |
  +-------------+------------+------------+

The non-OK behaviour above refers to the activewatcher table getting
filled up with stale/expired items.

fallback2db on or off:
  ```
  modparam("presence", "fallback2db", 0)  # or 1=on
  ```

The no-clustering case:
  ```
  handle_subscribe();
  ```

The tagless case:
  ```
  modparam("presence", "cluster_id", 1)
  modparam("clusterer", "my_node_id", 2)

  handle_subscribe();
  ```

The active case:
  ```
  modparam("presence", "cluster_id", 1)
  modparam("clusterer", "my_node_id", 2)
  modparam("clusterer", "sharing_tag", "node2/1=active")

  handle_subscribe("0", "node2");
  ```

Where PR OpenSIPS#2519 fixes the tagless case, this PR fixes the fallback2db=0
case by writing the sharing_tag to the database so the records can get
found and cleaned up.

(Sidenote: subscriptions which ended with a timeout or 481 *would* get
cleaned up. This makes sense in all cases: if they have an error before
their expiry, it makes sense to purge them from the DB immediately. And
it's not a problem if the perioding cleanup had cleaned those records
already.)
wdoekes added a commit to wdoekes/opensips that referenced this pull request May 11, 2021
…ck2db)

When clustering sharing_tags were added to presence, they were added to
the fallback2db "on" case only:

There are a couple of dimensions with differing behaviours:

                +------------+------------+
                | fallback2- | fallback2- |
                | -db = on   | -db = off  |
  +-clustering:-+------------+------------+
  | - no        | OK         | OK         |
  | - tagless   | PR-2519    | PR-2519    |
  | - active    | OK         | this       |
  +-------------+------------+------------+

The non-OK behaviour above refers to the activewatcher table getting
filled up with stale/expired items.

fallback2db on or off:
  ```
  modparam("presence", "fallback2db", 0)  # or 1=on
  ```

The no-clustering case:
  ```
  handle_subscribe();
  ```

The tagless case:
  ```
  modparam("presence", "cluster_id", 1)
  modparam("clusterer", "my_node_id", 2)

  handle_subscribe();
  ```

The active case:
  ```
  modparam("presence", "cluster_id", 1)
  modparam("clusterer", "my_node_id", 2)
  modparam("clusterer", "sharing_tag", "node2/1=active")

  handle_subscribe("0", "node2");
  ```

Where PR OpenSIPS#2519 fixes the tagless case, this PR fixes the fallback2db=0
case by writing the sharing_tag to the database so the records can get
found and cleaned up.

(Sidenote: subscriptions which ended with a timeout or 481 *would* get
cleaned up. This makes sense in all cases: if they have an error before
their expiry, it makes sense to purge them from the DB immediately. And
the periodic cleanup had cleaned those records already, it would not be
an issue.)
wdoekes added a commit to wdoekes/opensips that referenced this pull request May 11, 2021
…ck2db)

When clustering sharing_tags were added to presence, they were added to
the fallback2db "on" case only:

There are a couple of dimensions with differing behaviours:

                +------------+------------+
                | fallback2- | fallback2- |
                | -db = on   | -db = off  |
  +-clustering:-+------------+------------+
  | - no        | OK         | OK         |
  | - tagless   | PR-2519    | PR-2519    |
  | - active    | OK         | this       |
  +-------------+------------+------------+

The non-OK behaviour above refers to the activewatcher table getting
filled up with stale/expired items.

fallback2db on or off:
  ```
  modparam("presence", "fallback2db", 0)  # or 1=on
  ```

The no-clustering case:
  ```
  handle_subscribe();
  ```

The tagless case:
  ```
  modparam("presence", "cluster_id", 1)
  modparam("clusterer", "my_node_id", 2)

  handle_subscribe();
  ```

The active case:
  ```
  modparam("presence", "cluster_id", 1)
  modparam("clusterer", "my_node_id", 2)
  modparam("clusterer", "sharing_tag", "node2/1=active")

  handle_subscribe("0", "node2");
  ```

Where PR OpenSIPS#2519 fixes the tagless case, this PR fixes the fallback2db=0
case by writing the sharing_tag to the database so the records can get
found and cleaned up.

(Sidenote: subscriptions which ended with a timeout or 481 *would* get
cleaned up. This makes sense in all cases: if they have an error before
their expiry, it makes sense to purge them from the DB immediately. And
if the periodic cleanup had cleaned those records already, it would not
be an issue.)
@wdoekes
Copy link
Contributor Author

wdoekes commented May 12, 2021

(This one might be for you @bogdan-iancu 😉 . See also PR #2520.)

@bogdan-iancu bogdan-iancu self-assigned this Jul 6, 2021
@stale
Copy link

stale bot commented Jul 21, 2021

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@stale stale bot added the stale label Jul 21, 2021
If you try to (mysql) prepare a statement with "... column IS ?" you'd
get an error.

For mysql, you could fix this by using the "... column <=> ?" special
operator ('IS NOT DISTINCT FROM'), but that's not standard SQL.

Instead, you can now do:

    update_cols[0] = &some_column;
    update_ops[0] = OP_IS_NULL;
    update_vals[0].nul = 1;

You still need to set .nul=1 because prepared statements have to consume
a single value.
If you're using clustering, without tags (or with tags, but without
fallback2db), you would end up with a lot of active watchers in the
database that never got cleaned up.

Before 929ab4d:

- all stale activewatcher records in the db got purged.

After:

- if you're not clustering, all stale active watcher got purged;

- if you are, only those with the sharing_tag set.

However, the sharing tag is not necessarily set:

- it is an optional argument to handle_subscribe;

- and setting it in handle_subscribe does not write to the database
  because of missing code in the fallback2db==0 bit;

- and even it it were set, then adding the argument to handle_subscribe
  does not "activate" the sharing tag.

(Also interesting to know: a 408 or 481 after the
this-subscription-is-expired NOTIFY _would_ cause the individual record
to get deleted. But any other response, including 200, would leave the
record to get sorted by the periodic purge.)

This changeset reverts parts of the aforementioned commit by always
purging stale records if the sharing_tag is NULL. Thus restoring
behaviour to pre-3.1.0 and pre-2.4.8.
@wdoekes wdoekes force-pushed the fix-database-purging-of-stale-activewatchers branch from addbbc9 to a6a22c9 Compare July 26, 2021 12:43
@wdoekes
Copy link
Contributor Author

wdoekes commented Jul 26, 2021

Rebased so the conflicts with 87f2416 are gone 🎉

wdoekes added a commit to wdoekes/opensips that referenced this pull request Jul 26, 2021
…ck2db)

When clustering sharing_tags were added to presence, they were added to
the fallback2db "on" case only:

There are a couple of dimensions with differing behaviours:

                +------------+------------+
                | fallback2- | fallback2- |
                | -db = on   | -db = off  |
  +-clustering:-+------------+------------+
  | - no        | OK         | OK         |
  | - tagless   | PR-2519    | PR-2519    |
  | - active    | OK         | this       |
  +-------------+------------+------------+

The non-OK behaviour above refers to the activewatcher table getting
filled up with stale/expired items.

fallback2db on or off:
  ```
  modparam("presence", "fallback2db", 0)  # or 1=on
  ```

The no-clustering case:
  ```
  handle_subscribe();
  ```

The tagless case:
  ```
  modparam("presence", "cluster_id", 1)
  modparam("clusterer", "my_node_id", 2)

  handle_subscribe();
  ```

The active case:
  ```
  modparam("presence", "cluster_id", 1)
  modparam("clusterer", "my_node_id", 2)
  modparam("clusterer", "sharing_tag", "node2/1=active")

  handle_subscribe("0", "node2");
  ```

Where PR OpenSIPS#2519 fixes the tagless case, this PR fixes the fallback2db=0
case by writing the sharing_tag to the database so the records can get
found and cleaned up.

(Sidenote: subscriptions which ended with a timeout or 481 *would* get
cleaned up. This makes sense in all cases: if they have an error before
their expiry, it makes sense to purge them from the DB immediately. And
if the periodic cleanup had cleaned those records already, it would not
be an issue.)
@hafkensite
Copy link
Contributor

Any reason not to merge this request?

I'm asking because our cluster is experiencing this issue, resulting in a lot of database stress and maintenance to prevent further issues.

@stale
Copy link

stale bot commented Jan 9, 2022

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@stale stale bot added the stale label Jan 9, 2022
@bogdan-iancu
Copy link
Member

OK, as a start, we need to evaluate this "is NULL" support ...or to see if there is no way around it... Maybe changing the DB schema, rather that complicating the code...

@stale stale bot removed the stale label Jan 10, 2022
@stale
Copy link

stale bot commented Apr 16, 2022

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@stale stale bot added the stale label Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants