Fix desync after errCatchupTooManyRetries #5939
Draft
+81
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Looked into the issue reported in Antithesis of non-monotonic sequence / pub ack sequence moving back due to stream desync.
When looking at these logs:
In
monitorStream
we call intoapplyStreamEntries
which calls intoprocessSnapshot
when about a snapshot. If after reachingmaxRetries
catchup remains stalled, the RAFT data is deleted, which meansn.pindex=0
.Then when a leader election comes around this server with missing RAFT data would allow an outdated server that misses data to become leader. This is reproduced in the test.
mset.resetClusteredState(errCatchupTooManyRetries)
n.pindex=0
on the reset server, it grants leader to the outdated server. Resulting in desync.This PR proposes to not fully delete the RAFT state when we are not able to reach the leader during the processing of a snapshot. Which ensures the outdated server does NOT get selected as a leader and it gets correctly caught up to contain the data it missed.
Signed-off-by: Maurice van Veen [email protected]