Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix random test failures caused by mainnet connections #722

Closed
tegefaulkes opened this issue May 15, 2024 · 7 comments · Fixed by #725
Closed

Fix random test failures caused by mainnet connections #722

tegefaulkes opened this issue May 15, 2024 · 7 comments · Fixed by #725
Assignees
Labels
development Standard development r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@tegefaulkes
Copy link
Contributor

tegefaulkes commented May 15, 2024

Specification

While working on #720 I was re-enabling disabled tests. One of them can pull a cloned vault at tests/vaults/VaultManager.test.ts:932 was failing intermittently. After some digging I found that it was caused by external connections and RPC requests being made to the test NodeConnectionManager being used within the test.

image

There are two problems with this.

  1. NodeConnectionManagers within tests should not be starting or receiving any connections during the test. This is just to maintain test isolation since test failures should ideally be directly related to what the test is checking. This would depend on Allow NodeManager to start lazily without network entry procedure #461 being completed and then updating all relevant tests.
  2. The error causing the test to fail is an RPC error for missing a handler. This shouldn't be causing the test to fail since this error should only be thrown back through the RPC call being made. I'll have to find out why it's causing the test to fail since it's very likely a bug. Unless there's some weird interaction where our test NodeConnectionManager is attempting the RPC call to a RPCServer that doesn't have the handler? Needs more digging.

Additional context

Tasks

  1. Ensure all tests PolykeyAgents and NodeConnectionManagers don't attempt to interact with the wider mainnet while testing by starting with network entry disabled.
  2. Dig into why the RPC error is causing the test to fail and fix any bug that may be revealed.
@tegefaulkes tegefaulkes added the development Standard development label May 15, 2024
@tegefaulkes tegefaulkes self-assigned this May 15, 2024
Copy link

linear bot commented May 15, 2024

@tegefaulkes
Copy link
Contributor Author

Just a note, this is a source of intermittent test failures for tests that create node-node connections as part of it's tests. As such it's pretty high priority and I'll work on it first thing after completing the vaults review PR #720

@tegefaulkes
Copy link
Contributor Author

So the main problem here is that in some tests we have a nodeConnectionManager that was created with an empty or truncated handler support. While we have a connection to this nodeConnectionManager, any non supported handlers will fail with ErrorRPCHandlerFailed.

This in itself shouldn't be a problem. But this error is bubbling up to cause certain tests to fail randomly. And given the nature of how these errors are created and thrown. It's really hard to pin down their origin and their return path. I'm still looking into it.

@tegefaulkes
Copy link
Contributor Author

I think the problem is in how the QUIC library handles the errors with streams. At the time we went with the design that if an error happened relating to the stream within its handlers then we'd controller.error(e) it AND throw e it.

It seems that its leading to readablePull throwing the ErrorRPCHandlerFailed coming from the codeToReason which is being thrown outside of the usual stack for handling the errors. So rather than coming out of the stream to be handled it's ending up at the top level and causing the problem.

I'll dig into it more, but at this moment the fix just seems to be only calling controller.error(e); within the streams and not throw e;.

@tegefaulkes
Copy link
Contributor Author

Yep, confirmed that this is the problem. It's really weird though since this should be causing much more problems since we do it in a bunch of places in the streams.

That said, this is the only place we throw the error but don't call controller.error(e) just before it. Still it's a very weird interaction. I'll have to fix this in js-quic. and release a patch.

I'm going to put this on hold for now while I finish off the discovery PR for @amydevs.

@CMCDragonkai
Copy link
Member

Remember sometimes we do both because we consider to be 2 kinds of errors simultaneously. It's an architectural decision.

@tegefaulkes
Copy link
Contributor Author

Not actually blocked by #461 so I'm removing that as a criteria and moving it back to the backlog.

@tegefaulkes tegefaulkes reopened this May 23, 2024
@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

Successfully merging a pull request may close this issue.

2 participants