Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shell: Command hangs when --conda is used and the runtime has not been set up #363

Open
victorlin opened this issue Mar 19, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@victorlin
Copy link
Member

victorlin commented Mar 19, 2024

Current Behavior

I thought I had the Conda runtime set up on my computer so I used nextstrain shell --conda. It resulted in a process that hung indefinitely using 100% CPU even upon closing the terminal. I was able to terminate the process through kill -9 <PID> (SIGTERM) and Activity Monitor's "Force Quit" option.

Console output:

$ nextstrain check-setup
…
# conda is not supported
✔ yes: operating system is supported
✔ yes: Rosetta 2 is enabled
✔ yes: runtime data dir doesn't have spaces
✘ no: runtime appears set up

  The Conda runtime appears supported but not yet set up.
  Try running `nextstrain setup conda` first.
…


$ nextstrain shell --conda .
Entering the Nextstrain runtime (conda)

Run the command "exit" to leave the runtime.

^C^C^C^C^C^C

Closed that terminal and opened another. Here you can see it's been running for 4 minutes.

$ ps -ax | grep bash
37226 ??         4:14.93 bash --rcfile /var/folders/ld/41vgpmnj5t7bgmj5b4cn8l9h0000gn/T/tmp6w1u7o58bashrc

Expected behavior

This is partly user error since --conda is useless without the runtime being set up. But the command should at least be easily terminable, or even better exit with a useful error saying that the Conda runtime has not been set up yet.

How to reproduce

Steps to reproduce the current behavior:

  1. Run nextstrain shell --conda . in an environment without the Conda runtime available.
  2. Try to quit.

Your environment: if running Nextstrain locally

Hardware: macOS 14.3.1 Apple M2 chip

$ nextstrain --version
nextstrain.cli 8.2.0
@victorlin victorlin added the bug Something isn't working label Mar 19, 2024
@joverlee521
Copy link
Contributor

I'm curious why it just hangs 🤔

Tangentialy, it seems like the error message for nextstrain build is not super informative when the runtime hasn't been set up either:

$ nextstrain build --conda .
Error executing into ['snakemake', '--printshellcmds', '--cores=all']: [Errno 2] No such file or directory: b'/usr/bin/snakemake'

even better exit with a useful error saying that the Conda runtime has not been set up yet.

I think this we can accomplish this by checking runtime setup before trying to run anything.

diff --git a/nextstrain/cli/runner/__init__.py b/nextstrain/cli/runner/__init__.py
index 0aaefd0..6bc1388 100644
--- a/nextstrain/cli/runner/__init__.py
+++ b/nextstrain/cli/runner/__init__.py
@@ -14,7 +14,7 @@ from .. import config, env, hostenv
 from ..argparse import DirectoryPath, SKIP_AUTO_DEFAULT_IN_HELP
 from ..errors import UserError
 from ..types import EllipsisType, Env, Options, RunnerModule
-from ..util import prose_list, runner_name, runner_module, runner_help, warn
+from ..util import prose_list, runner_name, runner_module, runner_help, warn, runner_tests_ok, print_runner_tests
 from ..volume import NamedVolume
 
 
@@ -277,6 +277,13 @@ def run(opts: Options, working_volume: NamedVolume = None, extra_env: Env = {},
             **dict(hostenv.forwarded_values()),
             **extra_env }
 
+    tests = opts.__runner__.test_setup()
+    if not runner_tests_ok(tests):
+        print(f"The {runner_name(opts.__runner__)} is not supported.")
+        print()
+        print_runner_tests(tests)
+        return 1
+
     return opts.__runner__.run(opts, argv, working_volume = working_volume, extra_env = extra_env, cpus = cpus, memory = memory)

@tsibley
Copy link
Member

tsibley commented Apr 18, 2024

I think this we can accomplish this by checking runtime setup before trying to run anything.

Sure, but probably not with that patch as-is? Runner tests sometimes still produce a bunch of junk output to stdout (should fix that!) and they're also sometimes slow (seconds instead of milliseconds). I'd suggest a more cursory check for "is it good?" instead as the guard, or perhaps running the full tests only on failure to actually run, as part of better error reporting.

@tsibley
Copy link
Member

tsibley commented Apr 18, 2024

The hanging is interesting. Perhaps something about macOS or something about your specific system?

If I try I get a shell, but of course no runtime is available (bad, but better than hang):

$ NEXTSTRAIN_HOME=/tmp/bogus nextstrain shell --conda /var/empty
Entering the Nextstrain runtime (conda)

Run the command "exit" to leave the runtime.

$ echo $PATH
/tmp/bogus/runtimes/conda/env/bin:/bin:/usr/bin
$ stat /tmp/bogus/runtimes/conda/env/bin
stat: cannot stat '/tmp/bogus/runtimes/conda/env/bin': No such file or directory

@victorlin
Copy link
Member Author

I'm on a different computer now (macOS 13.6 Apple M1 chip) and can't reproduce. I'll try again on my home computer which is where I observed this initially. Is there any extra logging that would be helpful here, maybe something around these lines?

try:
subprocess.run(argv, env = env, check = True)
except (OSError, subprocess.CalledProcessError) as err:
raise InternalError(f"Error running {argv!r}") from err

@tsibley
Copy link
Member

tsibley commented Apr 19, 2024

Well, that snippet of code is in nextstrain.cli.runner.conda.micromamba(), not .run(), but I think you probably meant something like:

return exec_or_return(argv, {**extra_env, **EXEC_ENV})

cli/nextstrain/cli/util.py

Lines 322 to 328 in 25075fa

# Use a POSIX exec(3) for file descriptor and signal handling…
if os.name == "posix":
try:
os.execvpe(argv[0], argv, env)
except OSError as error:
warn("Error executing into %s: %s" % (argv, error))
return 1

If the hang is somewhere in Python before the exec, then logging around it could be helpful. But I suspect it's after the exec and we're no longer in Python. What output did you see before it hung? Anything? That would be a useful clue.

Would also be useful to enable existing debug logging (which is minimal, but a little relevant here), e.g.

NEXTSTRAIN_DEBUG=1 nextstrain shell --conda .

Beyond that, what I'd reach for first is process inspection (from the outside). i.e. looking at the process tree, the output of lsof on the process(es), strace on the process(es) (well, dtruss on macOS I guess), etc.

@victorlin
Copy link
Member Author

Thanks, I'll poke around on my home computer sometime and report back.

Output is in the issue description – I had it collapsed but just edited so it's more visible.

@tsibley
Copy link
Member

tsibley commented Apr 19, 2024

Ah! Thanks. I clearly missed it! 🙃

So seeing bash --rcfile /var/folders/ld/41vgpm… in the process list means it exec'd out of Python and into bash. I wonder what the process was up to.

@victorlin
Copy link
Member Author

I was able to reproduce on my home computer, but only by removing ~/.nextstrain/runtimes/conda/. Setting NEXTSTRAIN_HOME=/tmp/bogus fails to reproduce. It's what I tried on my work computer, so I'm hopeful that I can go into work later and reproduce by removing the runtime directory, then we can investigate further.

Here's the debug output:

$ NEXTSTRAIN_DEBUG=1 nextstrain shell --conda /var/empty 
DEBUG: Resolved /var/empty to /private/var/empty
DEBUG: Looking for nextstrain-pathogen.yaml as pathogen root dir marker
DEBUG: /private/var/empty/nextstrain-pathogen.yaml: does not exist
DEBUG: /private/var/nextstrain-pathogen.yaml: does not exist
DEBUG: /private/nextstrain-pathogen.yaml: does not exist
DEBUG: /nextstrain-pathogen.yaml: does not exist
DEBUG: Using /private/var/empty as build volume
DEBUG: Using /private/var/empty as working (build) volume
Entering the Nextstrain runtime (conda)

Run the command "exit" to leave the runtime.

DEBUG: exec_or_return(['bash', '--rcfile', '/var/folders/ld/41vgpmnj5t7bgmj5b4cn8l9h0000gn/T/tmp7ma5b0iabashrc'], {'NEXTSTRAIN_PS1': '\\[\\e[1;97m\\]\\[\\e[48;2;67;119;205m\\] \\[\\e[48;2;67;119;205m\\]N\\[\\e[48;2;80;151;186m\\]e\\[\\e[48;2;99;172;154m\\]x\\[\\e[48;2;124;184;121m\\]t\\[\\e[48;2;154;190;92m\\]s\\[\\e[48;2;185;188;74m\\]t\\[\\e[48;2;212;177;63m\\]r\\[\\e[48;2;228;153;56m\\]a\\[\\e[48;2;230;112;48m\\]i\\[\\e[48;2;222;60;38m\\]n\\[\\e[48;2;222;60;38m\\] \\[\\e[0m\\] \\w\\[\\e[38;2;255;0;255m\\] \\$ \\[\\e[0m\\]', 'NEXTSTRAIN_HISTFILE': '/Users/victor/.nextstrain/shell-history', 'PATH': '/Users/victor/.nextstrain/runtimes/conda/env/bin:/bin:/usr/bin', 'PYTHONHOME': None, 'PYTHONPATH': None, 'PYTHONUSERBASE': '/Users/victor/.nextstrain/runtimes/conda/python-user-base', 'PYTHONNOUSERSITE': '1'})

and the same with NEXTSTRAIN_HOME=/tmp/bogus which is successful:

$ NEXTSTRAIN_DEBUG=1 NEXTSTRAIN_HOME=/tmp/bogus nextstrain shell --conda /var/empty
DEBUG: Resolved /var/empty to /private/var/empty
DEBUG: Looking for nextstrain-pathogen.yaml as pathogen root dir marker
DEBUG: /private/var/empty/nextstrain-pathogen.yaml: does not exist
DEBUG: /private/var/nextstrain-pathogen.yaml: does not exist
DEBUG: /private/nextstrain-pathogen.yaml: does not exist
DEBUG: /nextstrain-pathogen.yaml: does not exist
DEBUG: Using /private/var/empty as build volume
DEBUG: Using /private/var/empty as working (build) volume
Entering the Nextstrain runtime (conda)

Run the command "exit" to leave the runtime.

DEBUG: exec_or_return(['bash', '--rcfile', '/var/folders/ld/41vgpmnj5t7bgmj5b4cn8l9h0000gn/T/tmpudxmics7bashrc'], {'NEXTSTRAIN_PS1': '\\[\\e[1;97m\\]\\[\\e[48;2;67;119;205m\\] \\[\\e[48;2;67;119;205m\\]N\\[\\e[48;2;80;151;186m\\]e\\[\\e[48;2;99;172;154m\\]x\\[\\e[48;2;124;184;121m\\]t\\[\\e[48;2;154;190;92m\\]s\\[\\e[48;2;185;188;74m\\]t\\[\\e[48;2;212;177;63m\\]r\\[\\e[48;2;228;153;56m\\]a\\[\\e[48;2;230;112;48m\\]i\\[\\e[48;2;222;60;38m\\]n\\[\\e[48;2;222;60;38m\\] \\[\\e[0m\\] \\w\\[\\e[38;2;255;0;255m\\] \\$ \\[\\e[0m\\]', 'NEXTSTRAIN_HISTFILE': '/private/tmp/bogus/shell-history', 'PATH': '/tmp/bogus/runtimes/conda/env/bin:/bin:/usr/bin', 'PYTHONHOME': None, 'PYTHONPATH': None, 'PYTHONUSERBASE': '/tmp/bogus/runtimes/conda/python-user-base', 'PYTHONNOUSERSITE': '1'})

The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
 Nextstrain  /private/var/empty $ 

@victorlin
Copy link
Member Author

victorlin commented Apr 23, 2024

Can reproduce on my work computer and the blab Mac Mini, but not a GitHub Codespace.

  • Work computer uses bash 5.2.15 installed via Homebrew uses bash 3.2.57 pre-installed on macOS 13.6.
  • blab Mac Mini uses bash 3.2.57 pre-installed on macOS 12.6.8.
  • GitHub Codespace uses bash 5.0.17 pre-installed on Ubuntu 20.04.

@victorlin
Copy link
Member Author

We had a debugging session this morning. Here's a reproducible example:

cat >bad-histfile <<~~
#anything here
ls
~~

cat >bad-rcfile <<~~
HISTSIZE=-1
~~

HISTFILE=bad-histfile /bin/bash --rcfile bad-rcfile

@victorlin
Copy link
Member Author

victorlin commented Apr 23, 2024

Scratch the comment that my work computer uses bash 5.2.15. It does in my normal startup shell (zsh) that adds /opt/homebrew/bin to PATH in .zshrc. In the exec, PATH is set to /Users/victor/.nextstrain/runtimes/conda/env/bin:/bin:/usr/bin which uses /bin/bash, version 3.2.57.

A negative value for HISTSIZE is only supported in bash 4.3 and above, though the behavior of bash 3.2 hanging at 100% CPU when given a negative value is still noteworthy and mentioned elsewhere.

@tsibley
Copy link
Member

tsibley commented Apr 23, 2024

Ah, really good to know that 5.2.15 isn't actually affected. I can repro the issue on Linux with Bash 3.2.57. This lets me strace it, which shows that immediately after reading the history file it enters an infinite loop of segfaults, each of which invokes its registered SIGSEGV handler.

We could adjust HISTSIZE to not trigger this on Bash 3.2.57, but our Conda runtime will have a newer Bash where it's not an issue, so the real fix here is still just that we should error out ourselves when the runtime isn't available.

@victorlin
Copy link
Member Author

The flip side to this where it "works" on newer versions of bash can also be considered a bug - the shell appears to enter a conda runtime that doesn't exist. This is also solved by erroring when the runtime isn't available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants