Race condition but how to debug? #1179

2called-chaos · 2022-11-14T13:38:26Z

I'm not quite sure what exactly is at fault here but let me explain.

We use delayed_job(_active_record) for almost a decade without any real issues. We have DJs on multiple servers started with --monitor --prefix tomatic --number_of_workers 5. Everything runs on Ubuntu (currently 20.04) and MariaDB (via mysql2).

We upgraded servers once and one in thousand restarts an issue occurs:

One DJ master seemingly bugs out and all its workers (and all jobs) fail like this

I, [2022-11-02T14:43:25.302511 #2521391]  INFO -- : 2022-11-02T14:43:25+0100: [Worker(delayed_job.4 host:www6 pid:2521391)] Job Order.save_ipinfo (id=25491141) RUNNING
E, [2022-11-02T14:43:25.323137 #2521391] ERROR -- : 2022-11-02T14:43:25+0100: [Worker(delayed_job.4 host:www6 pid:2521391)] Job Order.save_ipinfo (id=25491141) FAILED (0 prior attempts) with ActiveRecord::StatementInvalid: Mysql2::Error::ConnectionError: Got packets out of order

This kills the worker but due to --monitor they eventually restart (still rogue). The job record however ends up like this

It is locked but does not appear to have failed (last_error, failed_at). So this rogue DJ will just lock tasks when it gets the chance which will effectively delay these tasks for max_runtime at a minimum.

We barely every noticed this as it only happened so rarely and we deployed rather often.

A few years go down and we started to use coverband to get some more insights on our old codebase. We reported a bug which resulted in log spam and as soon as it was fixed DJ went rogue a lot of the time. So we are currently version pinned.

Also consider this:

When we include a logger in the delayed_job startup script the job succeeds. When we do not include a logger in the delayed_job startup script we get the error (full stack attached):
Mysql2::Error: Got packets out of order

From: https://groups.google.com/g/sequel-talk/c/domXZyPmvLw/m/NWCJRrbaAAAJ

Now we tried upgrading to Ruby 3.x and suddenly we started to run into this issue again.
Different server, virtually the same server environment (same os, memcache, mariadb although not replicated) I cannot reproduce this race condition on our staging server/env except for one time.

Somewhere is a race condition and I don't know what to do. I'm not sure if it has something to do with daemons or mysql or if we do something abhorrently wrong (which worked for like 10 years) but it seems to happen before forking as workers keep getting respawned in a rogue state until we actually restart DJ. Any ideas?

The text was updated successfully, but these errors were encountered:

albus522 · 2024-07-26T17:59:01Z

That looks like some kind of weird connection sharing incident that isn't suppose to happen. DJ automatically disconnects before fork and reconnects after. Maybe you have a thread sneaking in hitting the connection after we disconnect but before we fork.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition but how to debug? #1179

Race condition but how to debug? #1179

2called-chaos commented Nov 14, 2022

albus522 commented Jul 26, 2024

Race condition but how to debug? #1179

Race condition but how to debug? #1179

Comments

2called-chaos commented Nov 14, 2022

albus522 commented Jul 26, 2024