-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: fix panic when checking IsLearner of removed member #18606
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jscissr The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @jscissr. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files
... and 17 files with indirect coverage changes @@ Coverage Diff @@
## main #18606 +/- ##
==========================================
- Coverage 68.81% 68.79% -0.02%
==========================================
Files 420 420
Lines 35519 35516 -3
==========================================
- Hits 24441 24434 -7
Misses 9648 9648
- Partials 1430 1434 +4 Continue to review full report in Codecov by Sentry.
|
Did you see a case where panic happened or can you create an e2e or integration test case to make it happen? |
Yes I can, here are integration tests which demonstrate the panic. I need to add an artificial delay in IsMemberExist to reliably show the panic. diff --git a/server/etcdserver/api/membership/cluster.go b/server/etcdserver/api/membership/cluster.go
index 6becdfd62..4b6dbda64 100644
--- a/server/etcdserver/api/membership/cluster.go
+++ b/server/etcdserver/api/membership/cluster.go
@@ -816,6 +816,7 @@ func (c *RaftCluster) SetDowngradeInfo(d *serverversion.DowngradeInfo, shouldApp
// IsMemberExist returns if the member with the given id exists in cluster.
func (c *RaftCluster) IsMemberExist(id types.ID) bool {
+ defer time.Sleep(time.Second)
c.Lock()
defer c.Unlock()
_, ok := c.members[id]
diff --git a/tests/integration/cluster_test.go b/tests/integration/cluster_test.go
index 29f8ae8dd..852a11e85 100644
--- a/tests/integration/cluster_test.go
+++ b/tests/integration/cluster_test.go
@@ -201,6 +201,56 @@ func TestAddMemberAfterClusterFullRotation(t *testing.T) {
clusterMustProgress(t, c.Members)
}
+func TestConcurrentRemoveMember(t *testing.T) {
+ integration.BeforeTest(t)
+ c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+ defer c.Terminate(t)
+
+ time.Sleep(time.Second)
+ removeID := uint64(c.Members[1].Server.MemberID())
+ go func() {
+ time.Sleep(time.Second / 2)
+ c.Members[0].Client.MemberRemove(context.Background(), removeID)
+ }()
+ if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+ t.Fatal(err)
+ }
+ time.Sleep(time.Second)
+}
+
+func TestConcurrentMoveLeader(t *testing.T) {
+ integration.BeforeTest(t)
+ c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+ defer c.Terminate(t)
+
+ time.Sleep(time.Second)
+ removeID := uint64(c.Members[1].Server.MemberID())
+ go func() {
+ time.Sleep(time.Second / 2)
+ c.Members[0].Client.MoveLeader(context.Background(), removeID)
+ }()
+ if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+ t.Fatal(err)
+ }
+ time.Sleep(time.Second)
+}
+
+func TestConcurrentUnary(t *testing.T) {
+ integration.BeforeTest(t)
+ c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+ defer c.Terminate(t)
+
+ time.Sleep(2 * time.Second)
+ go func() {
+ time.Sleep(time.Second + time.Second/2)
+ c.Members[0].Client.Get(context.Background(), "key")
+ }()
+ if _, err := c.Members[0].Client.MemberRemove(context.Background(), uint64(c.Members[0].Server.MemberID())); err != nil {
+ t.Fatal(err)
+ }
+ time.Sleep(time.Second)
+}
+
// TestIssue2681 ensures we can remove a member then add a new one back immediately.
func TestIssue2681(t *testing.T) {
integration.BeforeTest(t) Here are the stack traces when running these tests:
|
Thanks for the integration test cases. I agree that it may happen theoretically. Did you ever see the panic in production or your test environment with the official etcd releases? I believe not. So overall minor issues to me. The change to etcd/tests/integration/v3_lease_test.go Line 1086 in 59cfd7a
For the change to
|
Previously, calling s.IsLearner() when the local node is no longer a member panics. There was an attempt to fix this by first checking IsMemberExist(), but this is not a correct fix because the member could be removed between the two calls. Instead of panicking when the member was removed, IsLearner() should return false. A node which is not a member is also not a learner. There was a similar concurrency bug when accessing the IsLearner property of a member, which will panic with a nil pointer access error if the member is removed between the IsMemberExist() and Member() calls. Signed-off-by: Jan Schär <[email protected]>
d61be85
to
605abca
Compare
I found the bug by reading the code, and have indeed not observed it happen without the added delay. I have added the For |
c.lg.Panic( | ||
"failed to find local ID in cluster members", | ||
zap.String("cluster-id", c.cid.String()), | ||
zap.String("local-member-id", c.localID.String()), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest not to change this. I don't think it will happen in production or test environment. If it happens, it means something critical occurs.
Please read #18606 (comment), and also comments below,
Also as mentioned previously, when the local member is removed from the cluster, it will eventually stop automatically. A panic right before stopping might not be too serious. So I suggest not to change
|
// gofail: var sleepAfterIsMemberExist struct{} | ||
// defer time.Sleep(time.Second) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a failpoint right above the line 821 "return ok
",
// gofail: var sleepAfterIsMemberExist struct{}
and inject sleep("1s")
to it during test
Previously, calling s.IsLearner() when the local node is no longer a member panics. There was an attempt to fix this by first checking IsMemberExist(), but this is not a correct fix because the member could be removed between the two calls. Instead of panicking when the member was removed, IsLearner() should return false. A node which is not a member is also not a learner.
There was a similar concurrency bug when accessing the IsLearner property of a member, which will panic with a nil pointer access error if the member is removed between the IsMemberExist() and Member() calls.
I did not add a unit test because it's basically impossible to test for such concurrency bugs.