Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix auto reconnect for RPC WebSocket #53

Merged
merged 2 commits into from
Dec 1, 2023

Conversation

nguyer
Copy link
Contributor

@nguyer nguyer commented Nov 30, 2023

In the case of a reconnect the resChl was nil so sending nil to it ended up blocking, causing the receiver to not receive any further messages.

@peterbroadhurst
Copy link
Contributor

peterbroadhurst commented Dec 1, 2023

Digging into this... it needs some thinking to understand why it's nil - and I've not spotted that in code comments or the PR comment.

@nguyer
Copy link
Contributor Author

nguyer commented Dec 1, 2023

Digging into this... it needs some thinking to understand why it's nil - and I've not spotted that in code comments or the PR comment.

Agreed. I'm not sure why it's nil here either. Though the if statement I added seemed to be consistent with the existing comment in the code if someone is waiting...

@peterbroadhurst
Copy link
Contributor

So I see here that we set it into nil, but that's when we've called handleSubscriptionConfirm - so that scenario would be a double dispatch:

inflightSub.newSubResponse = nil // we only dispatch once (it's only new once, on reconnect it's old and there's nobody to tell if we fail)

@nguyer - I think to review this, I need a bit more information.
Do you have thread stacks or other things that lead to you an explanation of the reconnect scenario, and how this could be nil - and further more how that leads to a hang?

@peterbroadhurst
Copy link
Contributor

peterbroadhurst commented Dec 1, 2023

More likely from just code inspection to me would seem like:

  • Subscribe() gos into case rpcErr := <-newSubResponse:
  • The context closed
  • Subscribe() fails
  • The hang happens, because there wan't a "slot" created on the make() of the channel

... if that were the bug, then the fix would be on this line:

newSubResponse: make(chan *RPCError),

to change it to make(chan *RPCError, 1),

@nguyer
Copy link
Contributor Author

nguyer commented Dec 1, 2023

newSubResponse: make(chan *RPCError),

to change it to make(chan *RPCError, 1),

I just did a quick test of this and it still resulted in inflightSub.newSubResponse being nil when it came into handleSubscriptionConfirm.

I will keep digging.

@nguyer
Copy link
Contributor Author

nguyer commented Dec 1, 2023

After digging into this more, it looks like this is the expected behavior that we only send the notification when the subscription is new, and not on reconnects:

// need to return newSubResponse because it's a use-once thing (not on reconnect)
// and will be nilled out (under lock) when the first creation response comes in
return s, s.newSubResponse

Copy link
Contributor

@peterbroadhurst peterbroadhurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nguyer, and for pointing me at:

// need to return newSubResponse because it's a use-once thing (not on reconnect)
// and will be nilled out (under lock) when the first creation response comes in
return s, s.newSubResponse

I understand now the nil case I think, but I actually think we also need the , 1 change for another potential edge case I mention around the context.

So let's include both changes.

@nguyer nguyer merged commit ec63452 into hyperledger:main Dec 1, 2023
2 checks passed
@nguyer nguyer deleted the fix-rpcws-reconnect branch December 1, 2023 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants