Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not a valid WebSocket upgrade request crashes #149

Closed
pvthuyen opened this issue May 19, 2023 · 16 comments
Closed

Not a valid WebSocket upgrade request crashes #149

pvthuyen opened this issue May 19, 2023 · 16 comments

Comments

@pvthuyen
Copy link

Apologize if this is probably not the best place to raise my question.

We’ve recently experimented on using Bandit instead of Cowboy for our Websocket service built on Elixir. With just a small percentage of our traffic, we saw a lot of crashes saying “Not a valid WebSocket upgrade request”. We followed the instructions for migration in https://hexdocs.pm/bandit/1.0.0-pre.1/Bandit.html.

Are there any other configuration which we should do? Is this possible that this issue has always been happening and Cowboy does not raise it but Bandit does?

@mtrudel
Copy link
Owner

mtrudel commented May 19, 2023

Interesting! If you're seeing that message, then you're ending up in a situation where your Plug has indicated an upgrade should be performed, but the underlying HTTP request isn't a valid WebSocket upgrade (due to how all the pieces of Plug fit together we can't identify this situation until it's too late for your Plug call to do anything about it, hence the crash).

Offhand, I don't know if Cowboy may have always silently been failing in a similar manner. As a general observation, Cowboy tends to err in the direction of being silent, where Bandit has an explicit policy of being noisy in the face of unexpected situations. If this is the case, you certainly wouldn't be the first person to discover that you'd always had a broken setup and Cowboy was just being quiet about it.

Some questions to try and diagnose this:

  1. What is your Plug? If this is Phoenix, do you have any non-standard socket configurations in your endpoint?
  2. Is there anything in the logs adjacent to these messages that may provide some clues?
  3. What are your clients? Are they connecting to Bandit directly, or through a reverse proxy or similar?
  4. If it's possible to share logs (or even better, if this is an open source code base), any other clues you have would possibly help immensely.

My guesses here are (in no particular order):

  1. You have a busted client that's trying to upgrade but is somehow failing (again, this may have been happening forever under Cowboy)
  2. You have a client that's trying to hit a websocket upgrade endpoint as if it were a regular HTTP endpoint
  3. There's some oddball combination of protocols here (connecting HTTP to an HTTPS endpoint, etc)
  4. There's a bug in Bandit's upgrade handling (frankly I would be surprised by this; Autobahn is a very comprehensive test suite and we run it on every CI run, but it's certainly possible)

@pvthuyen
Copy link
Author

pvthuyen commented May 22, 2023

Hey @mtrudel, thanks a lot for your prompt response:

To answer your questions:

  1. We're using Phoenix and we do not have any non-standard socket configuration
  2. This is the log I found which I believe they're related to the crahses:
07:44:32.443 [error] GenServer #PID<0.3352.0> terminating
** (stop) "Not a valid WebSocket upgrade request"
Last message: {:continue, :handle_connection}
  1. We're using the phoenix-js npm package for our frontend apps. All connections are from users' browsers, the connection goes through a Cloudflare Load Balancer, then to our Google Cloud Load Balancer, and finally to our VMs. Only the traffic between our GC Load Balancer and our VMs is served on HTTP
  2. We also saw these errors on Javascript frontend around the same time if it helps:
SecurityError
Failed to construct 'WebSocket': Access to address of 'websocket.example.com' is not allowed from 'public' address space.
SecurityError
Failed to construct 'WebSocket': Access to this address is not allowed.

It seems like we only saw this happening on a few users as I cannot reproduce the error on my side.

Thanks a lot for helping us out and please let me know if you need any other information.

@mtrudel
Copy link
Owner

mtrudel commented May 23, 2023

I've added improved logging for the upgrade failure reason in 1.0.0-pre.4 (just released). Give it a try @pvthuyen and see what comes out in the logs (the improved logging will appear in place where ** (stop) "Not a valid WebSocket upgrade request" appeared previously).

@pvthuyen
Copy link
Author

Thanks @mtrudel. All of the crashes now gives ** (exit) "WebSocket upgrade failed: error in upgrade_header check".

@pvthuyen
Copy link
Author

It looks like Cowboy returns 426 responses for requests with wrong Upgrade header and does not crash.
https://github.com/ninenines/cowboy/blob/22adc4de517bf54661795f6e18b6b2e62aec4140/src/cowboy_websocket.erl#L147

@mtrudel
Copy link
Owner

mtrudel commented May 24, 2023

Interesting! 426 is certainly the wrong choice there (the semantics of 426 refer to the current protocol, not the desired protocol outlined in an Upgrade header); 400 continues to be the most reasonable choice, so in terms of response codes we're doing it right (at least IMHO).

In terms of crashing, there's a tricky balance here. If the client is indeed attempting to upgrade to a WebSocket connection then it's not a great idea to attempt to reuse the underlying TCP connection via keepalives, so closing the underlying connection is likely a good idea. We also want to notify via logging that an error occurred so we can debug issues (like this one!). That being said we can do a better job of doing this without crashing; I'll work up a better approach for the next release.

Regardless, it would seem that your upstream client is doing something wrong here (that, or we have a bug!). To diagnose this, you'll want to find out the header values that the broken client(s) are sending when attempting to upgrade, which is probably going to be difficult for you given that you've mentioned it's not reproducible. I'll make sure that the improvements I land in the next version capture this logging as well as possible. Stay tuned.

@pvthuyen
Copy link
Author

Thanks a lot. I think this issues have always been happening for us since I recalled seeing a lot of 426 responses in our services. We're using the phoenix-js npm package on our frontends with no other special configuration so I'm gonna guess it could be because of something on the user's end, e.g. extensions, configuration.

I'm going to ignore this crashes for now to reduce noise on our side first.

@woylie
Copy link

woylie commented Jun 13, 2023

We're seeing a whole bunch of ** (exit) "WebSocket upgrade failed: error in upgrade_header check" errors as well (Bandit 1.0.0-pre.6). I don't know what to do about them, but they do clog up our error logs. If anybody knows how to deal with this, please let me know.

@mtrudel
Copy link
Owner

mtrudel commented Jun 14, 2023

I've pushed improved logging for this to main and have released it as part of 1.0.0-pre.7. Give it a shot and let me know what the improved error messages are; this should help narrow down the culprit!

@woylie
Copy link

woylie commented Jun 15, 2023

Now we're getting this:

** (exit) "WebSocket upgrade failed: error in upgrade_header check: \"Did not find upgrade in websocket\""

@mtrudel
Copy link
Owner

mtrudel commented Jun 15, 2023

Ugh. I switched the needle and the haystack values in the logging output; those logs aren't helpful. Let me cut a new version. Sorry for the trouble.

@mtrudel
Copy link
Owner

mtrudel commented Jun 15, 2023

1.0.0-pre.8 just released with updated logging (again!). Give that a shot.

@woylie
Copy link

woylie commented Jun 16, 2023

Now we get this:

** (exit) "WebSocket upgrade failed: error in upgrade_header check: \"Did not find 'websocket' in ''\"")

🤔

@mtrudel
Copy link
Owner

mtrudel commented Jun 16, 2023

There you go. Your clients aren't sending a proper websocket upgrade (per RFC6455§4.2.1 clause 3, the Upgrade header MUST contain websocket, and your clients are not doing so).

My hunch here is that you've either got you last hop load balancers misconfigured, or (more likely), your clients are hitting a websocket upgrade path as a regular HTTP request. Anything in your logs to suggest either?

@woylie
Copy link

woylie commented Jun 16, 2023

This is a Phoenix/LiveView application. Websockets work fine, so I don't think it's a misconfiguration.

Some clients seem to do GET requests on /live/websocket indeed. Among the user agents are some bots like bingbot or semrush bot, but also mobile safari or LINE. I don't know why the latter would make those requests. The app works fine in mobile safari as well, we only see these errors occasionally.

It seems like this is something outside of our control, so I don't think we need to handle this with exceptions that end up in Sentry. Do you think it would be better if bandit handles this silently?

@mtrudel
Copy link
Owner

mtrudel commented Jun 16, 2023

Some clients seem to do GET requests on /live/websocket indeed. Among the user agents are some bots like bingbot or semrush bot, but also mobile safari or LINE. I don't know why the latter would make those requests. The app works fine in mobile safari as well, we only see these errors occasionally.

This is very much expected for normal WebSocket upgrade calls (after all, all WebSockets start as GETs). The problem is what happens when a client connects expecting a regular HTTP response, but the server mistakenly tries to upgrade the connection to a WebSocket. This is what you're seeing here.

This is happening because Phoenix/WebsockAdapter is pretty loosey-goosey with what it tries to upgrade (basically, it tries to upgrade any GET request to a designated socket endpoint, as codified here). What it should be doing is actually validating the clauses in the RFC6455§4.2.1 handshake before signalling an upgrade to the underlying server, because once such an upgrade is signalled there's no way for the user to express any control over what happens in the case of a failed upgrade.

In light of this Bandit chooses to fail upgrade requests loudly (as you're seeing), in keeping with our stated goal of not codifying policy. Beyond this particular issue, this goal is proving to be a bit of a pain as it's exposing places where the Phoenix / Plug / Cowboy interface was leaking and nobody noticed since Cowboy's usual behaviour is to silently fail (examples #144, #106, #101). Thanks for bearing with the pain of helping to plug (ha!) all these leaky abstractions.

I'm going to close this issue and start on a PR to add the relevant improvements described above to WebSockAdapter. I'll keep this issue linked to those changes.

Thanks again for your patience and diagnostic work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants