slurm bcbio_submit.sh job pending #159

tomsing1 · 2016-12-18T07:00:45Z

After realizing that the AWS instances I had specified for the workers were too small in issue #158 ,
I switched to using c3.xlarge instances for the workers.

The cluster is built without problems:

bcbio_vm.py aws info
Available clusters: bcbio

Configuration for cluster 'bcbio':
 Frontend: c3.large with 200Gb NFS storage
 Cluster: 2 c3.xlarge machines

AWS setup:
 OK: expected IAM user 'bcbio' exists.
 OK: expected security group 'bcbio_cluster_sg' exists.
 OK: VPC 'bcbio' exists.

Instances in VPC 'bcbio':
	bcbio-frontend001 (c3.large, running) at 54.153.93.241 in us-west-1a
	bcbio-compute002 (c3.xlarge, running) at 52.53.175.55 in us-west-1a
	bcbio-compute001 (c3.xlarge, running) at 54.153.79.122 in us-west-1a

But when I log into the head node and try to submit a job to the queue, the job is never executed and remains pending.

bcbio_vm.py ipythonprep s3://my-bucket@us-west-1/test/project1.yaml slurm cloud -n 8
sbatch /encrypted/project1/work/bcbio_submit.sh

sacct_std
       JobID    JobName  Partition   NNodes  AllocCPUS      State        NodeList
------------ ---------- ---------- -------- ---------- ---------- ---------------
6            bcbio_sub+      cloud        2          0    PENDING   None assigned

It seems that no suitable resource is available to execute the bcbio_submit.sh script.

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 6     cloud bcbio_su   ubuntu PD       0:00      1 (Resources)

I am new to using slurm, so I don't really know how the resources are defined. Any hints?

For completeness, here is some more information about the scripts and the cluster environment:

cat /encrypted/project1/work/bcbio_submit.sh
#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=1000
#SBATCH -p cloud
#SBATCH -t 0
bcbio_vm.py ipython /encrypted/project1/work/config/project1.yaml slurm cloud --numcores 8 -r timelimit=0 --timeout 15

cat ~/install/bcbio-vm/data/galaxy/bcbio_system.yaml
resources:
  default:
    cores: 4
    jvm_opts:
    - -Xms750m
    - -Xmx1750m
    memory: 1750M
  dexseq:
    memory: 10g
  express:
    memory: 1750m
  gatk:
    jvm_opts:
    - -Xms500m
    - -Xmx1750m
  macs2:
    memory: 1750m
  qualimap:
    memory: 1750m
  seqcluster:
    memory: 1750m
  snpeff:
    jvm_opts:
    - -Xms750m
    - -Xmx1750m

The text was updated successfully, but these errors were encountered:

chapmanb · 2016-12-18T21:09:20Z

Thomas;
I'm not sure what is going on here. Everything looks good and I don't see any reasons why it wouldn't allocate this job. It seems to just be saying that there are no available machines to run on, which shouldn't be true (unless you have other people on the machines running things that we can't see here). My guess is that something went wrong during the SLURM setup and it's not correctly recognizing the available resources for running. We have a "kick the system and try again" style command with:

bcbio_vm.py aws cluster bootstrap

that you could try running to see if that unsticks it. Sorry about the issue and hope that does it for you.

tomsing1 · 2016-12-19T00:30:25Z

Thanks for your feedback. I have been able to reproduce this issue with multiple
independent cluster instances now. It is not rectified by running
bcbio_vm.py aws cluster bootstrap. The clusters are not used by anybody else, either.

Interestingly, the issue might be related to the type of AWS EC2 instance provisioned for
the compute/worker nodes.

c3.large worker nodes

First, I created a cluster with two c3.large worker nodes:

bcbio_vm.py aws info
Available clusters: bcbio

Configuration for cluster 'bcbio':
 Frontend: c3.large with 200Gb NFS storage
 Cluster: 2 c3.large machines

AWS setup:
 OK: expected IAM user 'bcbio' exists.
 OK: expected security group 'bcbio_cluster_sg' exists.
 OK: VPC 'bcbio' exists.

Instances in VPC 'bcbio':
	bcbio-compute002 (c3.large, running) at 54.153.6.226 in us-west-1a
	bcbio-compute001 (c3.large, running) at 54.153.79.255 in us-west-1a
	bcbio-frontend001 (c3.large, running) at 52.53.214.218 in us-west-1a

Next, I created and submitted my batch job:

bcbio_vm.py aws cluster ssh
mkdir /encrypted/project1
cd !$ && mkdir work && cd work
bcbio_vm.py ipythonprep s3://my-bucket@us-west-1/test/project1.yaml slurm cloud -n 4
sbatch bcbio_submit.sh

This job is added to the SLURM queue and starts executing. (The downstream job fails,
because the c3.large nodes don't have the --mem=4000 memory requested by the
SLURM_controller* job script.)

c3.xlarge worker nodes

Next, I destroyed the above cluster and changed the ~/.bcbio/elasticluster/config file
to request two c3.xlarge instead c3.large worker nodes.

The new cluster starts without any problems:

bcbio_vm.py aws info
Available clusters: bcbio

Configuration for cluster 'bcbio':
 Frontend: c3.large with 200Gb NFS storage
 Cluster: 2 c3.xlarge machines

AWS setup:
 OK: expected IAM user 'bcbio' exists.
 OK: expected security group 'bcbio_cluster_sg' exists.
 OK: VPC 'bcbio' exists.

Instances in VPC 'bcbio':
	bcbio-frontend001 (c3.large, running) at 54.183.63.74 in us-west-1a
	bcbio-compute001 (c3.xlarge, running) at 54.67.43.71 in us-west-1a
	bcbio-compute002 (c3.xlarge, running) at 52.53.238.167 in us-west-1a

Once again, I submit my batch job:

bcbio_vm.py aws cluster ssh
mkdir /encrypted/project1
cd !$ && mkdir work && cd work
bcbio_vm.py ipythonprep s3://my-bucket@us-west-1/test/project1.yaml slurm cloud -n 4
sbatch bcbio_submit.sh

But this time the job is added to the queue, but remains pending due to insufficient
resources. Perhaps the resources of c3.xlarge instances are not recognized
correctly?

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2     cloud bcbio_su   ubuntu PD       0:00      1 (Resources)
sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2            bcbio_sub+      cloud        gc3          0    PENDING      0:0

m4.xlarge worker nodes

Finally, I destroyed the cluster and requested two m4.xlarge worker nodes instead:

bcbio_vm.py aws info
Available clusters: bcbio

Configuration for cluster 'bcbio':
 Frontend: c3.large with 200Gb NFS storage
 Cluster: 2 m4.xlarge machines

AWS setup:
 OK: expected IAM user 'bcbio' exists.
 OK: expected security group 'bcbio_cluster_sg' exists.
 OK: VPC 'bcbio' exists.

Instances in VPC 'bcbio':
	bcbio-compute001 (m4.xlarge, running) at 54.183.100.49 in us-west-1a
	bcbio-compute002 (m4.xlarge, running) at 54.153.31.43 in us-west-1a
	bcbio-frontend001 (c3.large, running) at 54.183.98.2 in us-west-1a

Once again, I submit my batch job:

bcbio_vm.py aws cluster ssh
mkdir /encrypted/project1
cd !$ && mkdir work && cd work
bcbio_vm.py ipythonprep s3://my-bucket@us-west-1/test/project1.yaml slurm cloud -n 4
sbatch bcbio_submit.sh

and again, the job is placed into the queue but remains pending.

It seems that bcbio_vm has correctly identified that the worker nodes have 4 cores and
16 Gb of memory (so 4Gb per core) each:

cat ~/install/bcbio-vm/data/galaxy/bcbio_system.yaml
resources:
  default:
    cores: 4
    jvm_opts:
    - -Xms750m
    - -Xmx4000m
    memory: 4000M
  dexseq:
    memory: 10g
  express:
    memory: 4000m
  gatk:
    jvm_opts:
    - -Xms500m
    - -Xmx4000m
  macs2:
    memory: 4000m
  qualimap:
    memory: 4000m
  seqcluster:
    memory: 4000m
  snpeff:
    jvm_opts:
    - -Xms750m
    - -Xmx4000m

In summary, I have only managed to launch jobs when the worker nodes were of instance
type c3.large. Weird, isn't it?

Any ideas?

chapmanb · 2016-12-19T11:33:27Z

Thomas;
Apologies, I'm not positive what is going on here, slurm should not be dependent on instance type like this but is clearly getting confused somehow. I'll spin up some instances and try to reproduce, dig into what is going on and report back once I have some clue. Sorry to not have an immediate fix and thanks again for testing this.

chapmanb mentioned this issue Dec 18, 2016

ipython error when trying to submit jobs on AWS cluster: ERROR | Controller start failed #158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slurm bcbio_submit.sh job pending #159

slurm bcbio_submit.sh job pending #159

tomsing1 commented Dec 18, 2016

chapmanb commented Dec 18, 2016

tomsing1 commented Dec 19, 2016

chapmanb commented Dec 19, 2016

slurm bcbio_submit.sh job pending #159

slurm bcbio_submit.sh job pending #159

Comments

tomsing1 commented Dec 18, 2016

chapmanb commented Dec 18, 2016

tomsing1 commented Dec 19, 2016

c3.large worker nodes

c3.xlarge worker nodes

m4.xlarge worker nodes

chapmanb commented Dec 19, 2016