Fix random test failures caused by `mainnet` connections #722

tegefaulkes · 2024-05-15T23:39:29Z

Specification

While working on #720 I was re-enabling disabled tests. One of them can pull a cloned vault at tests/vaults/VaultManager.test.ts:932 was failing intermittently. After some digging I found that it was caused by external connections and RPC requests being made to the test NodeConnectionManager being used within the test.

There are two problems with this.

NodeConnectionManagers within tests should not be starting or receiving any connections during the test. This is just to maintain test isolation since test failures should ideally be directly related to what the test is checking. This would depend on Allow NodeManager to start lazily without network entry procedure #461 being completed and then updating all relevant tests.
The error causing the test to fail is an RPC error for missing a handler. This shouldn't be causing the test to fail since this error should only be thrown back through the RPC call being made. I'll have to find out why it's causing the test to fail since it's very likely a bug. Unless there's some weird interaction where our test NodeConnectionManager is attempting the RPC call to a RPCServer that doesn't have the handler? Needs more digging.

Additional context

Related: General vaults review and fixes #720
Related by Allow NodeManager to start lazily without network entry procedure #461

Tasks

Ensure all tests PolykeyAgents and NodeConnectionManagers don't attempt to interact with the wider mainnet while testing by starting with network entry disabled.
Dig into why the RPC error is causing the test to fail and fix any bug that may be revealed.

The text was updated successfully, but these errors were encountered:

linear · 2024-05-15T23:39:31Z

ENG-317 Fix random test failures caused by `mainnet` connections

tegefaulkes · 2024-05-17T02:43:20Z

Just a note, this is a source of intermittent test failures for tests that create node-node connections as part of it's tests. As such it's pretty high priority and I'll work on it first thing after completing the vaults review PR #720

tegefaulkes · 2024-05-21T00:28:19Z

So the main problem here is that in some tests we have a nodeConnectionManager that was created with an empty or truncated handler support. While we have a connection to this nodeConnectionManager, any non supported handlers will fail with ErrorRPCHandlerFailed.

This in itself shouldn't be a problem. But this error is bubbling up to cause certain tests to fail randomly. And given the nature of how these errors are created and thrown. It's really hard to pin down their origin and their return path. I'm still looking into it.

tegefaulkes · 2024-05-21T03:47:16Z

I think the problem is in how the QUIC library handles the errors with streams. At the time we went with the design that if an error happened relating to the stream within its handlers then we'd controller.error(e) it AND throw e it.

It seems that its leading to readablePull throwing the ErrorRPCHandlerFailed coming from the codeToReason which is being thrown outside of the usual stack for handling the errors. So rather than coming out of the stream to be handled it's ending up at the top level and causing the problem.

I'll dig into it more, but at this moment the fix just seems to be only calling controller.error(e); within the streams and not throw e;.

tegefaulkes · 2024-05-21T04:00:55Z

Yep, confirmed that this is the problem. It's really weird though since this should be causing much more problems since we do it in a bunch of places in the streams.

That said, this is the only place we throw the error but don't call controller.error(e) just before it. Still it's a very weird interaction. I'll have to fix this in js-quic. and release a patch.

I'm going to put this on hold for now while I finish off the discovery PR for @amydevs.

CMCDragonkai · 2024-05-21T04:10:16Z

Remember sometimes we do both because we consider to be 2 kinds of errors simultaneously. It's an architectural decision.

tegefaulkes · 2024-05-23T04:53:04Z

Not actually blocked by #461 so I'm removing that as a criteria and moving it back to the backlog.

tegefaulkes added the development Standard development label May 15, 2024

tegefaulkes self-assigned this May 15, 2024

tegefaulkes mentioned this issue May 15, 2024

General vaults review and fixes #720

Merged

14 tasks

tegefaulkes mentioned this issue May 23, 2024

Fixing intermittent failures due to js-quic stream errors #725

Merged

9 tasks

tegefaulkes closed this as completed in #725 May 23, 2024

tegefaulkes reopened this May 23, 2024

tegefaulkes closed this as completed May 23, 2024

CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix random test failures caused by `mainnet` connections #722

Fix random test failures caused by `mainnet` connections #722

tegefaulkes commented May 15, 2024 •

edited

Loading

linear bot commented May 15, 2024

tegefaulkes commented May 17, 2024

tegefaulkes commented May 21, 2024

tegefaulkes commented May 21, 2024

tegefaulkes commented May 21, 2024

CMCDragonkai commented May 21, 2024

tegefaulkes commented May 23, 2024

Fix random test failures caused by mainnet connections #722

Fix random test failures caused by mainnet connections #722

Comments

tegefaulkes commented May 15, 2024 • edited Loading

Specification

Additional context

Tasks

linear bot commented May 15, 2024

tegefaulkes commented May 17, 2024

tegefaulkes commented May 21, 2024

tegefaulkes commented May 21, 2024

tegefaulkes commented May 21, 2024

CMCDragonkai commented May 21, 2024

tegefaulkes commented May 23, 2024

Fix random test failures caused by `mainnet` connections #722

Fix random test failures caused by `mainnet` connections #722

tegefaulkes commented May 15, 2024 •

edited

Loading