Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: fix panic when checking IsLearner of removed member #18606

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jscissr
Copy link

@jscissr jscissr commented Sep 19, 2024

Previously, calling s.IsLearner() when the local node is no longer a member panics. There was an attempt to fix this by first checking IsMemberExist(), but this is not a correct fix because the member could be removed between the two calls. Instead of panicking when the member was removed, IsLearner() should return false. A node which is not a member is also not a learner.

There was a similar concurrency bug when accessing the IsLearner property of a member, which will panic with a nil pointer access error if the member is removed between the IsMemberExist() and Member() calls.

I did not add a unit test because it's basically impossible to test for such concurrency bugs.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jscissr
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @jscissr. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.79%. Comparing base (ce07474) to head (448ef94).
Report is 2 commits behind head on main.

Current head 448ef94 differs from pull request most recent head d61be85

Please upload reports for the commit d61be85 to get more accurate results.

Files with missing lines Patch % Lines
server/etcdserver/api/membership/cluster.go 0.00% 1 Missing ⚠️
server/etcdserver/server.go 75.00% 0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
Files with missing lines Coverage Δ
server/etcdserver/api/v3rpc/interceptor.go 77.60% <100.00%> (+3.12%) ⬆️
server/etcdserver/api/membership/cluster.go 88.27% <0.00%> (-0.26%) ⬇️
server/etcdserver/server.go 81.28% <75.00%> (-0.36%) ⬇️

... and 17 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18606      +/-   ##
==========================================
- Coverage   68.81%   68.79%   -0.02%     
==========================================
  Files         420      420              
  Lines       35519    35516       -3     
==========================================
- Hits        24441    24434       -7     
  Misses       9648     9648              
- Partials     1430     1434       +4     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce07474...d61be85. Read the comment docs.

@ahrtr
Copy link
Member

ahrtr commented Sep 22, 2024

Previously, calling s.IsLearner() when the local node is no longer a member panics.

Did you see a case where panic happened or can you create an e2e or integration test case to make it happen?

@jscissr
Copy link
Author

jscissr commented Sep 23, 2024

Did you see a case where panic happened or can you create an e2e or integration test case to make it happen?

Yes I can, here are integration tests which demonstrate the panic. I need to add an artificial delay in IsMemberExist to reliably show the panic.

diff --git a/server/etcdserver/api/membership/cluster.go b/server/etcdserver/api/membership/cluster.go
index 6becdfd62..4b6dbda64 100644
--- a/server/etcdserver/api/membership/cluster.go
+++ b/server/etcdserver/api/membership/cluster.go
@@ -816,6 +816,7 @@ func (c *RaftCluster) SetDowngradeInfo(d *serverversion.DowngradeInfo, shouldApp
 
 // IsMemberExist returns if the member with the given id exists in cluster.
 func (c *RaftCluster) IsMemberExist(id types.ID) bool {
+	defer time.Sleep(time.Second)
 	c.Lock()
 	defer c.Unlock()
 	_, ok := c.members[id]
diff --git a/tests/integration/cluster_test.go b/tests/integration/cluster_test.go
index 29f8ae8dd..852a11e85 100644
--- a/tests/integration/cluster_test.go
+++ b/tests/integration/cluster_test.go
@@ -201,6 +201,56 @@ func TestAddMemberAfterClusterFullRotation(t *testing.T) {
 	clusterMustProgress(t, c.Members)
 }
 
+func TestConcurrentRemoveMember(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(time.Second)
+	removeID := uint64(c.Members[1].Server.MemberID())
+	go func() {
+		time.Sleep(time.Second / 2)
+		c.Members[0].Client.MemberRemove(context.Background(), removeID)
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
+func TestConcurrentMoveLeader(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(time.Second)
+	removeID := uint64(c.Members[1].Server.MemberID())
+	go func() {
+		time.Sleep(time.Second / 2)
+		c.Members[0].Client.MoveLeader(context.Background(), removeID)
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), removeID); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
+func TestConcurrentUnary(t *testing.T) {
+	integration.BeforeTest(t)
+	c := integration.NewCluster(t, &integration.ClusterConfig{Size: 2})
+	defer c.Terminate(t)
+
+	time.Sleep(2 * time.Second)
+	go func() {
+		time.Sleep(time.Second + time.Second/2)
+		c.Members[0].Client.Get(context.Background(), "key")
+	}()
+	if _, err := c.Members[0].Client.MemberRemove(context.Background(), uint64(c.Members[0].Server.MemberID())); err != nil {
+		t.Fatal(err)
+	}
+	time.Sleep(time.Second)
+}
+
 // TestIssue2681 ensures we can remove a member then add a new one back immediately.
 func TestIssue2681(t *testing.T) {
 	integration.BeforeTest(t)

Here are the stack traces when running these tests:

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentRemoveMember' '-p=2')
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x145df4d]

goroutine 218 [running]:
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).mayRemoveMember(0xc0003d3808, 0xa5a138a8b8d2b107)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1596 +0xed
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).RemoveMember(0xc0003d3808, {0x1cf91e0, 0xc001dbe540}, 0xa5a138a8b8d2b107)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1428 +0x85
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*ClusterServer).MemberRemove(0xc000012ae0, {0x1cf91e0, 0xc001dbe540}, 0xc001dbe570)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/member.go:71 +0xad


% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentMoveLeader' '-p=2')
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1459bed]

goroutine 207 [running]:
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).MoveLeader(0xc0003fb508, {0x1cf91e0, 0xc001e33bc0}, 0x984e59084a1f2c8b, 0xb3a11ebbc585e63a)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:1227 +0xcd
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*maintenanceServer).MoveLeader(0xc00041f790, {0x1cf91e0, 0xc001e33bc0}, 0xc001e33bf0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/maintenance.go:282 +0x116
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*authMaintenanceServer).MoveLeader(0xc000600ef0, {0x1cf91e0, 0xc001e33bc0}, 0xc001e33bf0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/maintenance.go:347 +0xb1


% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' './integration/...' '-timeout=15m' '--race' '-run=TestConcurrentUnary' '-p=2')
panic: failed to find local ID in cluster members

goroutine 365 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2, 0xc000fa95f0, {0xc000229a00?, 0x2?, 0x2?})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x9f
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000fa95f0, {0xc000229980, 0x2, 0x2})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ad
go.uber.org/zap.(*Logger).Panic(0xc000084a00, {0x1aa86bb, 0x2a}, {0xc000229980, 0x2, 0x2})
        /home/jan/go/pkg/mod/go.uber.org/[email protected]/logger.go:285 +0x68
go.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).IsLocalMemberLearner(0xc0003a04d0)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/membership/cluster.go:786 +0x35b
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).IsLearner(0xc000295508)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/server.go:2471 +0x45
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.Server.newUnaryInterceptor.func4({0x1cf91e0, 0xc001ed27b0}, {0x1a3ec60, 0xc000002e10}, 0xc001ea4320?, 0xc001eab540)
        /home/jan/Documents/git-thirdparty/etcd/server/etcdserver/api/v3rpc/interceptor.go:52 +0x9a

@ahrtr
Copy link
Member

ahrtr commented Sep 24, 2024

Thanks for the integration test cases.

I agree that it may happen theoretically. Did you ever see the panic in production or your test environment with the official etcd releases? I believe not. So overall minor issues to me.

The change to server/etcdserver/server.go is straightforward and safe, please also add test cases TestConcurrentMoveLeader and TestConcurrentRemoveMember into this PR. You can use the gofailpoint to trigger the sleep failpoint. See an example below

require.NoError(t, gofail.Enable(fpName, `sleep("3s")`))

For the change to server/etcdserver/api/v3rpc/interceptor.go, I prefer to keep it unchanged,

  • It changes the behaviour/protocal between etcdserver and client.
  • Also it makes more sense to return an error something "member not found" instead of ErrGRPCNotSupportedForLearner if it's removed in-between;
  • When the local member is removed from the cluster, it will eventually stop automatically. A panic right before stopping might not be too serious.

Previously, calling s.IsLearner() when the local node is no longer a
member panics. There was an attempt to fix this by first checking
IsMemberExist(), but this is not a correct fix because the member could
be removed between the two calls. Instead of panicking when the member
was removed, IsLearner() should return false. A node which is not a
member is also not a learner.

There was a similar concurrency bug when accessing the IsLearner
property of a member, which will panic with a nil pointer access error
if the member is removed between the IsMemberExist() and Member() calls.

Signed-off-by: Jan Schär <[email protected]>
@jscissr
Copy link
Author

jscissr commented Sep 25, 2024

I found the bug by reading the code, and have indeed not observed it happen without the added delay.

I have added the TestConcurrentMoveLeader and TestConcurrentRemoveMember tests to the PR, with failpoint and reliability improvements. Though, I'm not sure that adding these tests to the codebase has much value. The only way they could catch a bug is if someone adds back calls to IsMemberExist in these functions.

For server/etcdserver/api/v3rpc/interceptor.go, I don't understand your concern. The existing behavior is left unchanged by my changes, they only reduce the set of possible behaviors by removing the possibility of panicking. I changed IsLocalMemberLearner to return false when the local node is not a member. This means that s.IsMemberExist(s.MemberID()) && s.IsLearner() is equivalent to s.IsLearner().

Comment on lines -786 to -790
c.lg.Panic(
"failed to find local ID in cluster members",
zap.String("cluster-id", c.cid.String()),
zap.String("local-member-id", c.localID.String()),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest not to change this. I don't think it will happen in production or test environment. If it happens, it means something critical occurs.

@ahrtr
Copy link
Member

ahrtr commented Sep 26, 2024

For server/etcdserver/api/v3rpc/interceptor.go, I don't understand your concern. The existing behavior is left unchanged by my changes, they only reduce the set of possible behaviors by removing the possibility of panicking.

Please read #18606 (comment), and also comments below,

  • Previously etcdserver panics in such case, but now it returns a rpctypes.ErrGRPCNotSupportedForLearner error. So it changes the behaviour, and it may mislead the users.
  • Also an error something like "member not found" is more proper in such case.

Also as mentioned previously, when the local member is removed from the cluster, it will eventually stop automatically. A panic right before stopping might not be too serious.

So I suggest not to change

  • server/etcdserver/api/v3rpc/interceptor.go
  • server/etcdserver/api/membership/cluster.go

Comment on lines +815 to +817
// gofail: var sleepAfterIsMemberExist struct{}
// defer time.Sleep(time.Second)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a failpoint right above the line 821 "return ok",

// gofail: var sleepAfterIsMemberExist struct{}

and inject sleep("1s") to it during test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants