Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull timeout #2418

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Pull timeout #2418

wants to merge 4 commits into from

Conversation

taylordowns2000
Copy link
Member

this is a concept. @stuartc , we're trying to:

  1. raise the timeout (make it ENV configurable) so that 11s query doesnt lead to lost (@josephjclark on it.)
  2. make that query more efficient (and maybe look for other things that might be slowing down this part of the pipeline under heavy load - you and @rorymckinley on it)
  3. come up with some sort of failsafe that sweeps up false-claims (you sir must FORFEIT YOUR CLAIM ON THAT RUN!) and releases them for someone else to claim... but as you can see in the test placeholder, i really don't like this.

@taylordowns2000 taylordowns2000 marked this pull request as draft August 22, 2024 17:07
dbg("eish, i don't love this. what if there was some other reason for them getting lost?")
dbg("like, what if they did start, and did the work, and all that, but we never heard back from them because of network issues?")
dbg("i wouldn't want to re-do the work.")
dbg("i wish there was some way to only mark them as claimed once we know that the worker actually got the run.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taylordowns2000 If I may chip in (uninvited):

As you mentioned yesterday the janitor approach is a band-aid until a better fix is available. If I was applying the bandaid, I would not tie it to the timeout value - I would make it a large multiple of the timeout. E.g. if your timeout is 20 sec, maybe wait 10 min (after all you wait 2 South Effrican hours to determine if something is lost, right?). And, notify Sentry whenever this happens, so that we have an opportunity to take a look at what is going on and tune things. Wrt the concern of potentially redoing work, you could make this something that is user-controllable (i.e. if a Run is 'LostAfterClaimed' would you like us to automagically resubmit it - or would you rather do that manually?).

A longer-term fix (which I can only assume will mean a fair amount of work) would be to allow a run to be claimed by multiple worker processes but only started by one (whoever gets there first). This obviously creates inefficiency elsewhere. Personally, I would do the bandaid if it helps us satisfy the immediate requirements and then identify the issue - e.g. maybe it is a simple as scaling up web pods alongside worker pods?

@taylordowns2000
Copy link
Member Author

@rorymckinley , @stuartc , do you know if this code is still needed? would love to shut down this PR if it's not helpful (or has been implemented elsewhere in response to #2465 or #2416 )

@rorymckinley
Copy link
Collaborator

@rorymckinley , @stuartc , do you know if this code is still needed? would love to shut down this PR if it's not helpful (or has been implemented elsewhere in response to #2465 or #2416 )

@taylordowns2000 I have not implemented anything that addresses this - did the changing of the timeout values make the original problem go away?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New Issues
Development

Successfully merging this pull request may close these issues.

2 participants