Decommission operation timeout is too small for DB cluster with enabled tablets feature #8855

vponomaryov · 2024-09-26T19:26:40Z

Issue description

Nemesis disrupt_decommission_streaming_err times out in the test with enabled tablets:

2024-09-21 07:32:46.414: (ClusterHealthValidatorEvent Severity.CRITICAL) period_type=one-time \
    event_id=2d16b283-0ee6-4651-9862-d60efd70c2d3: type=NodeStatus \
    node=longevity-large-partitions-200k-pks-db-node-b866c292-0-2 \
    error=Current node Node longevity-large-partitions-200k-pks-db-node-b866c292-0-2 [34.148.211.120 | 10.142.0.128]. \
    Node Node longevity-large-partitions-200k-pks-db-node-b866c292-0-5 [35.185.64.144 | 10.142.0.139] \
    (DecommissionStreamingErr nemesis target node) status is UL

Argus:

Looking at the cluster state everything was going ok all that time while timeout was not reached.

So, I increased the timeout for it and ran another test here: scylla-staging/valerii/vp-longevity-large-partition-200k-pks-4days-gce-test#4

After timeout increase the nemesis passed:

Note that in the passed variant the add node operation after decommission was started at 11:39.
So, decommission under heavy write load took about 2h15m whereas current timeout is 1h20m.

Steps to Reproduce

Setup a DB cluster with tablets
Run heavy write load with large partitions
Run disrupt_decommission_streaming_err nemesis

Expected behavior: SCT waits proper amount of time

Actual behavior: SCT raises timeout error too early

Impact

False negative

How frequently does it reproduce?

100%

Installation details

SCT Version: master
Scylla version (or git commit hash): master/6.3

Logs

test_id: b866c292-32ca-4145-8ec5-910a55c1d92f
job log: scylla-master/tier1/longevity-large-partition-200k-pks-4days-gce-test#36

The text was updated successfully, but these errors were encountered:

fruch · 2024-09-29T14:43:32Z

@aleksbykov

isn't that something you already identified ? during this case ? has been any work related to fixed that ?

github-actions bot assigned vponomaryov Sep 26, 2024

vponomaryov assigned fruch, roydahan and soyacz Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decommission operation timeout is too small for DB cluster with enabled tablets feature #8855

Decommission operation timeout is too small for DB cluster with enabled tablets feature #8855

vponomaryov commented Sep 26, 2024 •

edited

Loading

fruch commented Sep 29, 2024

Decommission operation timeout is too small for DB cluster with enabled tablets feature #8855

Decommission operation timeout is too small for DB cluster with enabled tablets feature #8855

Comments

vponomaryov commented Sep 26, 2024 • edited Loading

Issue description

Steps to Reproduce

Impact

How frequently does it reproduce?

Installation details

Logs

fruch commented Sep 29, 2024

vponomaryov commented Sep 26, 2024 •

edited

Loading