Forcing a connection close() #45

alfredodeza · 2013-08-14T19:15:55Z

We are having some issues when doing remote subprocess calls (actually using subprocess.check_call) where as soon as we hit the close() method on the connection object it blocks forever.

The (partial) reason for this is because the message streams are acquiring a lock and since there are no timeouts for closing it basically stops all threads from closing the connection and it deadlocks.

This is extremely severe for us as we can't do anything but to Ctrl-C the command to exit.

On the actual machines that call we see a bunch of these processes around after a while:

root     36535  0.1  0.0 325188  8360 pts/4    Sl   12:03   0:00 python -u -c exec reduce(lambda a,b: a+b, map(chr,...

All that the remote subprocess command is doing is starting a service, that service in turn backgrounds a script call (something like bash script.sh 2> /dev/null & )

That script will do a long running process which might not complete but the user does not care, it should just be a fire and forget and then close the connection.

When starting that service manually in the remote machine everything works as expected. There are no errors, tracebacks (or exceptions), it just deadlocks.

The actual part where the deadlock occurs is on this file:

pushy-0.5.3-py2.6.egg/pushy/protocol/baseconnection.py

In the MessageStream class, in the close() method when trying to do this:

self.__lock.acquire()

Is there anyway I could force closing the connection? I am no longer sure what else to try.

The text was updated successfully, but these errors were encountered:

axw · 2013-08-15T00:55:17Z

@alfredodeza subprocesses are perilous when it comes to Pushy; you need to make sure that the subprocess does not inherit anything that will cause the subprocess to interact with the proxying. That includes I/O, as there's a background thread on the target that reads from the redirected I/O back to the client.

If you can either point me at the offending code, or better yet, provide a self contained minimal reproducer, then I will hopefully be able to provide some concrete suggestions.

alfredodeza · 2013-08-15T12:27:45Z

@axw we have a few helpers around pushy to make it easier to send remote functions, the workflow can be a bit hard to follow.

What we are kind of forced to do (and I can confirm is what causes all of this) is that we set StringIO() objects to the connection.modules.sys.stdout and connection.modules.sys.stderr because the stderr and stdout on the server block whenever you attempt to readlines() or loop over readline() until there is no more content.

Using StringIO() is not a problem, it actually works really well, you can see how we do this in this context manager:
https://github.com/ceph/ceph-deploy/blob/master/ceph_deploy/util/context.py#L36-37

However, the particular scenario where this blocks forever, triggering the problem described in this ticket is when that init.d service starts, and forks this call and that forked call doesn't complete.

That forked call is also writing to stderr but we are redirecting it to /dev/null (as you can see in the example in the description).

If that script completes normally, then the whole process is able to complete and I can close the connection without blocking. If that call doesn't, then pushy sits waiting for it to complete.

Capturing the remote stderr and stdout is imperative for our processes, but for this specific scenario, we know we have completed everything and we want to close all connections to the remote ends.

Is there really no way to do this?

axw · 2013-08-15T13:03:36Z

@alfredodeza Can you please try creating remote StringIOs and assigning to stdout/stderr? That way the subprocess shouldn't ever be interacting with a proxied object. i.e.
self.client.modules.sys.stdout = self.client.modules.StringIO.StringIO()
self.client.modules.sys.stderr = self.client.modules.StringIO.StringIO()

alfredodeza · 2013-08-15T13:21:34Z

Yes, I did tried that with the same result, the connection hangs and I can't close it :(

axw · 2013-08-16T00:34:36Z

Sorry, I'll need a reproducing test case then to look into it further. If you could provide something minimal that'd be great, otherwise I will attempt it myself when I have some spare time.

axw · 2013-08-17T12:06:46Z

@alfredodeza Can you please try passing close_fds=True to subprocess.Popen? I created a program that does something similar to what you describe, and the connection is locking up because the forked process is inheriting and keeping the RPC channel open. Passing close_fds fixed it for my case.

alfredodeza · 2013-08-19T14:04:32Z

The only way I can get this to work (other than not capturing stdout/stderr) is by not doing a check_call or call but a normal subprocess.Popen.

As soon as .wait() is involved (as is in check_call and call) it blocks.

I tried using close_fds = True on all my attempts and failed.

axw · 2013-08-20T07:05:14Z

@alfredodeza It seems I cannot reproduce this problem based on your descriptions alone. I've managed to get something similar (again), but in my case it blocks even with Popen; it blocks because the subprocess inherits the I/O redirector pipe, which stops the server from exiting until the subprocess exits. However, it's not blocking on MessageStream.close.

I will need a reproducing test case to continue analysing.

alfredodeza · 2013-08-20T12:15:34Z

That sounds similar. I mentioned MessageStream.close because stepping through the code to .close() the connection that is the exact place where it would block indefinitely.

Maybe the fact that I am stepping through the code has nothing to do with the actual problem. I would be very interested to know if your fix for this could solve my problem.

I attempted numerous times to create a test case for you the past few days but it has been really really hard and have not been able to.

The only way I have reproducing this is by using ceph-deploy directly on a remote server that calls the init script, which is absolutely not reproducible outside of that environment.

axw · 2013-08-20T12:43:35Z

Maybe the fact that I am stepping through the code has nothing to do with the actual problem. I would be very interested to know if your fix for this could solve my problem.

I don't have a "fix" for this, if it truly is the same issue. If you spawn a subprocess that inherits the stdout/stderr file descriptors, then this issue will present. If you want to capture the output but don't want to block the connection, you would have to create your own pipes and threads to read the output into the StringIO objects.

alfredodeza mentioned this issue Aug 15, 2013

Do not attempt to capture stdout/stderr on some check_calls ceph/ceph-deploy#44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forcing a connection close() #45

Forcing a connection close() #45

alfredodeza commented Aug 14, 2013

axw commented Aug 15, 2013

alfredodeza commented Aug 15, 2013

axw commented Aug 15, 2013

alfredodeza commented Aug 15, 2013

axw commented Aug 16, 2013

axw commented Aug 17, 2013

alfredodeza commented Aug 19, 2013

axw commented Aug 20, 2013

alfredodeza commented Aug 20, 2013

axw commented Aug 20, 2013

Forcing a connection close() #45

Forcing a connection close() #45

Comments

alfredodeza commented Aug 14, 2013

axw commented Aug 15, 2013

alfredodeza commented Aug 15, 2013

axw commented Aug 15, 2013

alfredodeza commented Aug 15, 2013

axw commented Aug 16, 2013

axw commented Aug 17, 2013

alfredodeza commented Aug 19, 2013

axw commented Aug 20, 2013

alfredodeza commented Aug 20, 2013

axw commented Aug 20, 2013