-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipython error when trying to submit jobs on AWS cluster: ERROR | Controller start failed #158
Comments
Thomas;
This is failing for some reason but unfortunately ipython swallows the error message. Hopefully doing that should give you a useful output that will help us identify the issue. Thanks much. |
Thanks a ton for your instantaneous reply. I will submit the script and report back! |
When I submitted the job via
it was clear that the slurm script requested more memory than is available. It includes the following line:
but each c3.large instance only provides Manually editing the script or switchting to larger EC2 instances fixes the issue, thanks a lot for pointing out how to troubleshoot! One more question: Are the system requirements (e.g. RAM) for the different pipelines or tools documented somewhere? Or perhaps you have recommendations as to which instance type(s) to use for real datasets? |
Thomas; Regarding resource usage, this is hard to give a ballpark on without more details about what you're trying to run. We typically do not run clusters on AWS for smaller number of samples since you can get pretty high scale machines with balanced CPU/memory using the m4 series (m4.4xlarge = 16 cores, m4.10xlarge = 40 cores, m4.16xlarge = 64 cores). This saves the overhead of dealing with SLURM and a shared filesystem and also lets you stop/start as needed and use spot instances more easily. This is not as automated but we have documentation and ansible scripts to help set it up: https://github.com/chapmanb/bcbio-nextgen/tree/master/scripts/ansible Happy to help more with that if that seems like a more cost-effective approach for your work. |
Thanks a lot, especially for the pointer to the ansible script! I hadn't been aware of this simplified approach, yet. |
I am following the instructions on Mac OS X, but I am running into an error (see below).
|
P.S.: defining the AWS_DEFAULT_REGION doesn't help.
|
Thomas; Please let us know if you run into any other problems. The ansible scripts are still a work in progress so happy for feedback on pain points and issues. |
Great! Thanks for updating the ansible playbook so quickly. I think the
With this modification, the instance is started and the volume is added.
I checked the Thanks again for looking into this! Please let me know what would be useful for you, eg if I can test things out for you. |
- Alternative approach to avoiding ssh manual checking by setting ansible ssh options on launched hosts - Adds required region argument to ec2_vol mount Fixes bcbio/bcbio-nextgen-vm#158
Thomas; |
I have successfully followed the instructions and created a cluster (1 head node + 2 compute nodes, each c3-large instances) in the us-west-1 zone on AWS.
I can successfully log into the head node and start a small RNA-seq workflow on the node itself:
But when I try to submit the same workflow to the worker nodes, it seems that the ipython controller fails. I can see the submission job and a
bcbio-c
job in the queue, but the latter fails immediately:Here is the content of the
SLURM_controller7ebeddb7-8cb1-489a-9a9b-a84358a7ed35
file in the work directory:The slurm log file contains
Finally, the
log/ipython/log/ipcluster-18cc7e8c-3b67-476d-906e-06098a524f2c-4351.log
file contains anERROR | Controller start failed
error message:The
bcbio_submit.sh
file contains:Here is the list of packages available to conda on the head node, in case that is helpful:
Any idea what might be going on?
The text was updated successfully, but these errors were encountered: