Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition but how to debug? #1179

Open
2called-chaos opened this issue Nov 14, 2022 · 1 comment
Open

Race condition but how to debug? #1179

2called-chaos opened this issue Nov 14, 2022 · 1 comment

Comments

@2called-chaos
Copy link

I'm not quite sure what exactly is at fault here but let me explain.

We use delayed_job(_active_record) for almost a decade without any real issues. We have DJs on multiple servers started with --monitor --prefix tomatic --number_of_workers 5. Everything runs on Ubuntu (currently 20.04) and MariaDB (via mysql2).

We upgraded servers once and one in thousand restarts an issue occurs:

One DJ master seemingly bugs out and all its workers (and all jobs) fail like this

I, [2022-11-02T14:43:25.302511 #2521391]  INFO -- : 2022-11-02T14:43:25+0100: [Worker(delayed_job.4 host:www6 pid:2521391)] Job Order.save_ipinfo (id=25491141) RUNNING
E, [2022-11-02T14:43:25.323137 #2521391] ERROR -- : 2022-11-02T14:43:25+0100: [Worker(delayed_job.4 host:www6 pid:2521391)] Job Order.save_ipinfo (id=25491141) FAILED (0 prior attempts) with ActiveRecord::StatementInvalid: Mysql2::Error::ConnectionError: Got packets out of order

This kills the worker but due to --monitor they eventually restart (still rogue). The job record however ends up like this

2022-11-14 14 11 16

It is locked but does not appear to have failed (last_error, failed_at). So this rogue DJ will just lock tasks when it gets the chance which will effectively delay these tasks for max_runtime at a minimum.

We barely every noticed this as it only happened so rarely and we deployed rather often.


A few years go down and we started to use coverband to get some more insights on our old codebase. We reported a bug which resulted in log spam and as soon as it was fixed DJ went rogue a lot of the time. So we are currently version pinned.

Also consider this:

When we include a logger in the delayed_job startup script the job succeeds. When we do not include a logger in the delayed_job startup script we get the error (full stack attached):
Mysql2::Error: Got packets out of order

From: https://groups.google.com/g/sequel-talk/c/domXZyPmvLw/m/NWCJRrbaAAAJ


Now we tried upgrading to Ruby 3.x and suddenly we started to run into this issue again.
Different server, virtually the same server environment (same os, memcache, mariadb although not replicated) I cannot reproduce this race condition on our staging server/env except for one time.

Somewhere is a race condition and I don't know what to do. I'm not sure if it has something to do with daemons or mysql or if we do something abhorrently wrong (which worked for like 10 years) but it seems to happen before forking as workers keep getting respawned in a rogue state until we actually restart DJ. Any ideas?

@albus522
Copy link
Member

That looks like some kind of weird connection sharing incident that isn't suppose to happen. DJ automatically disconnects before fork and reconnects after. Maybe you have a thread sneaking in hitting the connection after we disconnect but before we fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants