etcdserver: fix panic when checking IsLearner of removed member #18606

jscissr · 2024-09-19T09:26:27Z

Previously, calling s.IsLearner() when the local node is no longer a member panics. There was an attempt to fix this by first checking IsMemberExist(), but this is not a correct fix because the member could be removed between the two calls. Instead of panicking when the member was removed, IsLearner() should return false. A node which is not a member is also not a learner.

There was a similar concurrency bug when accessing the IsLearner property of a member, which will panic with a nil pointer access error if the member is removed between the IsMemberExist() and Member() calls.

I did not add a unit test because it's basically impossible to test for such concurrency bugs.

k8s-ci-robot · 2024-09-19T09:26:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jscissr
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-09-19T09:26:37Z

Hi @jscissr. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov-commenter · 2024-09-19T18:29:29Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.79%. Comparing base (ce07474) to head (448ef94).
Report is 2 commits behind head on main.

❗ Current head 448ef94 differs from pull request most recent head d61be85

Please upload reports for the commit d61be85 to get more accurate results.

Files with missing lines	Patch %	Lines
server/etcdserver/api/membership/cluster.go	0.00%	1 Missing ⚠️
server/etcdserver/server.go	75.00%	0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

Files with missing lines	Coverage Δ
server/etcdserver/api/v3rpc/interceptor.go	`77.60% <100.00%> (+3.12%)`	⬆️
server/etcdserver/api/membership/cluster.go	`88.27% <0.00%> (-0.26%)`	⬇️
server/etcdserver/server.go	`81.28% <75.00%> (-0.36%)`	⬇️

... and 17 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18606      +/-   ##
==========================================
- Coverage   68.81%   68.79%   -0.02%     
==========================================
  Files         420      420              
  Lines       35519    35516       -3     
==========================================
- Hits        24441    24434       -7     
  Misses       9648     9648              
- Partials     1430     1434       +4

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce07474...d61be85. Read the comment docs.

ahrtr · 2024-09-22T18:52:11Z

Previously, calling s.IsLearner() when the local node is no longer a member panics.

Did you see a case where panic happened or can you create an e2e or integration test case to make it happen?

jscissr · 2024-09-23T11:44:23Z

Did you see a case where panic happened or can you create an e2e or integration test case to make it happen?

Yes I can, here are integration tests which demonstrate the panic. I need to add an artificial delay in IsMemberExist to reliably show the panic.

diff --git a/server/etcdserver/api/membership/cluster.go b/server/etcdserver/api/membership/cluster.go
index 6becdfd62..4b6dbda64 100644
--- a/server/etcdserver/api/membership/cluster.go
+++ b/server/etcdserver/api/membership/cluster.go
@@ -816,6 +816,7 @@ func (c *RaftCluster) SetDowngradeInfo(d *serverversion.DowngradeInfo, shouldApp
 
 // IsMemberExist returns if the member with the given id exists in cluster.
 func (c *RaftCluster) IsMemberExist(id types.ID) bool {
+	defer time.Sleep(time.Second)
 	c.Lock()
 	defer c.Unlock()
 	_, ok := c.members[id]
diff --git a/tests/integration/cluster_test.go b/tests/integration/cluster_test.go
index 29f8ae8dd..852a11e85 100644
--- a/tests/integration/cluster_test.go
+++ b/tests/integration/cluster_test.go
@@ -201,6 +201,56 @@ func TestAddMemberAfterClusterFullRotation(t *testing.T) {
 	clusterMustProgress(t, c.Members)
 }
 
+func TestConcurrentRemoveMember(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(time.Second)
+	removeID := uint64(c.Members[1].Server.MemberID())
+	go func() {
+		time.Sleep(time.Second / 2)
+		c.Members[0].Client.MemberRemove(context.Background(), removeID)
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
+func TestConcurrentMoveLeader(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(time.Second)
+	removeID := uint64(c.Members[1].Server.MemberID())
+	go func() {
+		time.Sleep(time.Second / 2)
+		c.Members[0].Client.MoveLeader(context.Background(), removeID)
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
+func TestConcurrentUnary(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(2 * time.Second)
+	go func() {
+		time.Sleep(time.Second + time.Second/2)
+		c.Members[0].Client.Get(context.Background(), "key")
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), uint64(c.Members[0].Server.MemberID())); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
 // TestIssue2681 ensures we can remove a member then add a new one back immediately.
 func TestIssue2681(t *testing.T) {
 	integration.BeforeTest(t)

Here are the stack traces when running these tests:

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentRemoveMember' '-p=2')
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x145df4d]

goroutine 218 [running]:
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).mayRemoveMember(0xc0003d3808, 0xa5a138a8b8d2b107)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1596 +0xed
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).RemoveMember(0xc0003d3808, {0x1cf91e0, 0xc001dbe540}, 0xa5a138a8b8d2b107)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1428 +0x85
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*ClusterServer).MemberRemove(0xc000012ae0, {0x1cf91e0, 0xc001dbe540}, 0xc001dbe570)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/member.go:71 +0xad


% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentMoveLeader' '-p=2')
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1459bed]

goroutine 207 [running]:
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).MoveLeader(0xc0003fb508, {0x1cf91e0, 0xc001e33bc0}, 0x984e59084a1f2c8b, 0xb3a11ebbc585e63a)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1227 +0xcd
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*maintenanceServer).MoveLeader(0xc00041f790, {0x1cf91e0, 0xc001e33bc0}, 0xc001e33bf0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/maintenance.go:282 +0x116
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*authMaintenanceServer).MoveLeader(0xc000600ef0, {0x1cf91e0, 0xc001e33bc0}, 0xc001e33bf0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/maintenance.go:347 +0xb1


% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentUnary' '-p=2')
panic: failed to find local ID in cluster members

goroutine 365 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2, 0xc000fa95f0, {0xc000229a00?, 0x2?, 0x2?})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x9f
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000fa95f0, {0xc000229980, 0x2, 0x2})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ad
go.uber.org/zap.(*Logger).Panic(0xc000084a00, {0x1aa86bb, 0x2a}, {0xc000229980, 0x2, 0x2})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/logger.go:285 +0x68
go.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).IsLocalMemberLearner(0xc0003a04d0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/membership/cluster.go:786 +0x35b
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).IsLearner(0xc000295508)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:2471 +0x45
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.newUnaryInterceptor.func4({0x1cf91e0, 0xc001ed27b0}, {0x1a3ec60, 0xc000002e10}, 0xc001ea4320?, 0xc001eab540)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/interceptor.go:52 +0x9a

ahrtr · 2024-09-24T15:14:36Z

Thanks for the integration test cases.

I agree that it may happen theoretically. Did you ever see the panic in production or your test environment with the official etcd releases? I believe not. So overall minor issues to me.

The change to server/etcdserver/server.go is straightforward and safe, please also add test cases TestConcurrentMoveLeader and TestConcurrentRemoveMember into this PR. You can use the gofailpoint to trigger the sleep failpoint. See an example below

etcd/tests/integration/v3_lease_test.go

Line 1086 in 59cfd7a

require.NoError(t, gofail.Enable(fpName, `sleep("3s")`))

For the change to server/etcdserver/api/v3rpc/interceptor.go, I prefer to keep it unchanged,

It changes the behaviour/protocal between etcdserver and client.
Also it makes more sense to return an error something "member not found" instead of ErrGRPCNotSupportedForLearner if it's removed in-between;
When the local member is removed from the cluster, it will eventually stop automatically. A panic right before stopping might not be too serious.

Previously, calling s.IsLearner() when the local node is no longer a member panics. There was an attempt to fix this by first checking IsMemberExist(), but this is not a correct fix because the member could be removed between the two calls. Instead of panicking when the member was removed, IsLearner() should return false. A node which is not a member is also not a learner. There was a similar concurrency bug when accessing the IsLearner property of a member, which will panic with a nil pointer access error if the member is removed between the IsMemberExist() and Member() calls. Signed-off-by: Jan Schär <[email protected]>

jscissr · 2024-09-25T16:09:11Z

I found the bug by reading the code, and have indeed not observed it happen without the added delay.

I have added the TestConcurrentMoveLeader and TestConcurrentRemoveMember tests to the PR, with failpoint and reliability improvements. Though, I'm not sure that adding these tests to the codebase has much value. The only way they could catch a bug is if someone adds back calls to IsMemberExist in these functions.

For server/etcdserver/api/v3rpc/interceptor.go, I don't understand your concern. The existing behavior is left unchanged by my changes, they only reduce the set of possible behaviors by removing the possibility of panicking. I changed IsLocalMemberLearner to return false when the local node is not a member. This means that s.IsMemberExist(s.MemberID()) && s.IsLearner() is equivalent to s.IsLearner().

ahrtr · 2024-09-25T17:28:51Z

server/etcdserver/api/membership/cluster.go

-		c.lg.Panic(
-			"failed to find local ID in cluster members",
-			zap.String("cluster-id", c.cid.String()),
-			zap.String("local-member-id", c.localID.String()),
-		)


Suggest not to change this. I don't think it will happen in production or test environment. If it happens, it means something critical occurs.

ahrtr · 2024-09-26T12:41:51Z

For server/etcdserver/api/v3rpc/interceptor.go, I don't understand your concern. The existing behavior is left unchanged by my changes, they only reduce the set of possible behaviors by removing the possibility of panicking.

Please read #18606 (comment), and also comments below,

Previously etcdserver panics in such case, but now it returns a rpctypes.ErrGRPCNotSupportedForLearner error. So it changes the behaviour, and it may mislead the users.
Also an error something like "member not found" is more proper in such case.

Also as mentioned previously, when the local member is removed from the cluster, it will eventually stop automatically. A panic right before stopping might not be too serious.

So I suggest not to change

server/etcdserver/api/v3rpc/interceptor.go
server/etcdserver/api/membership/cluster.go

ahrtr · 2024-09-26T12:50:47Z

server/etcdserver/api/membership/cluster.go

+	// gofail: var sleepAfterIsMemberExist struct{}
+	// defer time.Sleep(time.Second)
+


You can add a failpoint right above the line 821 "return ok",

// gofail: var sleepAfterIsMemberExist struct{}

and inject sleep("1s") to it during test

k8s-ci-robot added the needs-ok-to-test label Sep 19, 2024

k8s-ci-robot added the size/S label Sep 19, 2024

jscissr force-pushed the fix-islearner-panic branch from d61be85 to 605abca Compare September 25, 2024 15:32

k8s-ci-robot added area/robustness-testing area/testing size/M and removed size/S labels Sep 25, 2024

ahrtr reviewed Sep 25, 2024

View reviewed changes

ahrtr reviewed Sep 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: fix panic when checking IsLearner of removed member #18606

etcdserver: fix panic when checking IsLearner of removed member #18606

jscissr commented Sep 19, 2024

k8s-ci-robot commented Sep 19, 2024

k8s-ci-robot commented Sep 19, 2024

codecov-commenter commented Sep 19, 2024

ahrtr commented Sep 22, 2024

jscissr commented Sep 23, 2024

ahrtr commented Sep 24, 2024 •

edited

Loading

jscissr commented Sep 25, 2024

ahrtr Sep 25, 2024

ahrtr commented Sep 26, 2024

ahrtr Sep 26, 2024

		// gofail: var sleepAfterIsMemberExist struct{}
		// defer time.Sleep(time.Second)

etcdserver: fix panic when checking IsLearner of removed member #18606

Are you sure you want to change the base?

etcdserver: fix panic when checking IsLearner of removed member #18606

Conversation

jscissr commented Sep 19, 2024

k8s-ci-robot commented Sep 19, 2024

k8s-ci-robot commented Sep 19, 2024

codecov-commenter commented Sep 19, 2024

Codecov Report

ahrtr commented Sep 22, 2024

jscissr commented Sep 23, 2024

ahrtr commented Sep 24, 2024 • edited Loading

jscissr commented Sep 25, 2024

ahrtr Sep 25, 2024

Choose a reason for hiding this comment

ahrtr commented Sep 26, 2024

ahrtr Sep 26, 2024

Choose a reason for hiding this comment

ahrtr commented Sep 24, 2024 •

edited

Loading