-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warp busy-loops on EMFILE #928
Comments
The behaviour of
This was bad because:
From this perspective, the change from #830 made sense: It allows Warp to accept new clients, resolving their hangs as soon as FDs become free. So I'm in general in favour of that change from #830, just that the implementation isn't great (busy-looping without sleep). |
This is made worse by the fact that Warp does not detect client disconnects (#196). It means that the For example, consider the common scenario that you have Warp+Postgres web app. For uptime checking, you use e.g. AWS Route53 health checks, Consul health checks, or some uptime checking service such as Pingdom. These HTTP checks usually have some HTTP timeout set, e.g. 5 seconds. Let's say you implement a route Assume that when the DB goes down, the Postgres connection from the Warp request handler blocks. Then the health check will query But Warp does not notice the disconnect. It leaks the FD, forever (until the default TCP timeout of 3600 seconds). (The slowloris timeout does not help with this, as the client has already fully sent the request, and it's now on the server to produce a response.) If your health checks come in once per second, you'll exceed 1024 leaked FDs in less than half an hour. Other web servers do not have this problem: I show how |
@nh2 Thank you for analyzing. |
@nh2 gentle ping |
@kazu-yamamoto Unfortunately I haven't had the time yet to look into the implementation. I'll post here in case I start it! |
I'm still getting this on warp-3.3.31 and it's pretty deveastating to see the server busy loop and take down production. |
@kazu-yamamoto I had time to look now again. I think that simply reverting that commit, as done in PR #933, will not improve the situation because it will just re-introduce the issues mentioned in #928 (comment) (which describes the issues without that commit). I suspect we really have to do the approach from "My recommendations" in #928 (comment) |
This is a follow-up to #830. Adding in subscribers from there: @snoyberg @domenkozar @kazu-yamamoto @ProofOfKeags.
The retry logic added in https://github.com/yesodweb/wai/pull/831/files does not make sense to me.
It is present in
warp >= warp-3.3.19
.As @snoyberg wrote in #830 (comment), that code will busy-loop:
The busy-loop is:
Here is a minimal example that demonstrates it.
We set the
open files
limit to 1024, and then open a couple of files until we have 1024 open files. Then we start Warp.(
lts-20.18
haswarp-3.3.23
so it has the code from #831 in, which was released as part ofwarp-3.3.19
.)Observing the run of the above with
stack Main.hs 0
, and attaching withsudo strace -fyp "$(pidof Main)"
we can observe the busy-loop:This is bad:
:main
, and Ctrl+C there, nothing happens; another Ctrl+C will bring you back to the ghci prompt, but theaccept4()
busy loop will go on in the background.Further, the issue also happens when there are still FDs free when warps starts.
Run
stack Main.hs 1
instead ofstack Main.hs 0
to leave one FD free. Now warp starts without error.But when you run
curl 'http://localhost:1234'
, during the time the request handler runs (1 second as per mythreadDelay
) warp busy-loops aroundEMFILE
.This means the issue can also occur any time during production usage: As soon as any request touches the open files limit, busy looping starts, and clients like
curl
hang until warp gets a free FD.Further bad is:
setTimeout
). This means that in this state, slowloris protection somewhat defunct, and acurl
may hang for much longer than the value you give tosetTimeout
.But you cannot
close()
a socket that you haven'taccept()
ed; anything that sits in thelisten(..., int backlog)
queue to my knowledge cannot be ejected, so we probably cannot do anything against that. (One could setbacklog = 1
to reject further connections but from a quick search it isn't guaranteed that the OS accepts that hint, and some places document that setting a small backlog can result in less web serving performance -- probably in the case that the userspace program cannotaccept()
fast enough.)curl
hanging is problematic when building reliable systems. It is better when web servers can tell that they are too busy e.g. by returning HTTP503 Service Unavailable
, which is a way to indicate that a server is overloaded. Many web servers have a setting for the maximum amount of connections they will accept, independent (but necessarily lower than) the max number of open files. For examplenginx
hasworker_connections
, so that it can return503 Service Unavailable
when those are reached, before it gets into the problem of hittingEMFILE
. Ideally warp should have a setting for this, too, and count how many request handlers are currently active. That setting could be defaulted to be reasonably less thangetResourceLimit ResourceOpenFiles
. Of course it is not perfect, sincewarp
-the-library may run alongside other Haskell threads that open files, or the request handler might open more files, so hitting a hang incurl
due toEMFILE
cannot be entirely avoided (unless one patches the entire system to count every FD opened, to add this number to the threshold).But such setting could still be very effective in practice: If the
rlimit
is 10k open files, Warp could set thismaxConnections
setting to e.g. 5k, so then there's still a lot of headroom to hitting a hang. This would adopt nginx's philosophy.My recommendations
accept()
, there should be some configurableretryAcceptDelay
betweenaccept()
retries.retryAcceptDelay
passed, or another warp request handler has finished (which will free up an FD).maxConnections
setting; exceeding it should return503 Service Unavailable
.The text was updated successfully, but these errors were encountered: