Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master HTCondor workflow breaks submission of seemingly random jobs #183

Open
HerrHorizontal opened this issue Jul 17, 2024 · 7 comments
Open
Assignees
Labels

Comments

@HerrHorizontal
Copy link
Contributor

HerrHorizontal commented Jul 17, 2024

Bug description

With the latest commit on the master branch the HTCondorWorkflow execution breaks for random jobs with an output like:

running htcondor_wrapper_2701241706.sh for job number 211
empty htcondor job arguments for LAW_HTCONDOR_JOB_NUMBER 211

I noticed that the reported LAW_HTCONDOR_JOB_NUMBER 211 does not correspond to the job number I would expect from the Error, Log, Output, and stdall files, that share for this particular job above the suffix _863To864.txt. This might be related.

@HerrHorizontal
Copy link
Contributor Author

Checking out an earlier commit that doesn't include the HTCondor group submission introduced in PR #176, e.g. commit 1848c57 seems to run fine. I expect the bug has been introduced in PR #176 .

@riga
Copy link
Owner

riga commented Jul 17, 2024

Hi @HerrHorizontal ,

odd indeed. Are you sure your workflow didn't pick up a submission json file that was generated with the previous submission mode?

Btw, for the time being, you can also set job::htcondor_job_grouping_submit to False without the need to switch to an older version.

@HerrHorizontal
Copy link
Contributor Author

I have removed the submission json before I ran the test. So I am pretty sure that it didn't.

I will try this out. Where do I set the job::htcondor_job_grouping_submit configuration for a certain workflow?

@riga
Copy link
Owner

riga commented Jul 19, 2024

You can set this value globally in the config, or you put this into your htcondor workflow

def htcondor_create_job_manager(self, **kwargs):
    job_manager = super().htcondor_create_job_manager(**kwargs)
    job_manager.job_grouping_submit = True
    job_manager.chunk_size_submit = 0  # all in one
    return job_manager

Regarding the issue you're seeing, I could not spot anything obviously wrong. Could you sent me the content of the submission directory, including the main job files? This would help. Thank you!

@harrypuuter
Copy link
Contributor

Hi all,

i am currently observing the same issue as reported by @HerrHorizontal - when I look at the submission jdl and the arguments in the htconodor_wrapper_xxxx.sh file, the first submission looks fine. Problems arise, as soon as a single job fails, and jobs have to be resubmitted.

I have not looked into the implementation in more detail, but I would suspect something like

  • running jobs fail and will be resubmitted
  • an updated arguments file for the resumission is created, that does not contain all jobs anymore, but only the ones to be resubmitted
  • scheduled jobs, that have not started yet, are now able to start but since there is only a single arguments file (which was modified due to the resubmission) the error mentioned above can occur, resulting in a cascade of failing jobs.

Could this be the reason for the errors ?

@riga
Copy link
Owner

riga commented Aug 26, 2024

@harrypuuter

Thanks for confirming and the suggestion! I think you're onto something. I'm going to create a reproducer this week to debug this further.

@HerrHorizontal
Copy link
Contributor Author

I, and also independently @harrypuuter, have encountered another issue with the new HTCondor group submission. When turning the group submission off as suggested previously, with

def htcondor_create_job_manager(self, **kwargs):
    job_manager = super().htcondor_create_job_manager(**kwargs)
    job_manager.job_grouping_submit = True
    job_manager.chunk_size_submit = 0  # all in one
    return job_manager

the rendering of the values enclosed by double curly brackets, for instance {{law_job_base}}, in the job script fails. Also, it seems that the 'XToY' suffix is not consistently attached to the files. This leads to the jobs failing already at the setup stage with File or Directory Not Found Errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants