Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use monotonic time (clock_gettime) for cups_enum_dests #1083

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AdamWill
Copy link

Using gettimeofday here is not safe against clock jumps (e.g. a manual setting of the clock). Use clock_gettime instead.

Using gettimeofday here is not safe against clock jumps (e.g. a
manual setting of the clock). Use clock_gettime instead.

Signed-off-by: Adam Williamson <[email protected]>
@AdamWill
Copy link
Author

For context, see https://bugzilla.redhat.com/show_bug.cgi?id=2316066 . In Fedora 41 validation testing we found a problem where live images sometimes weren't working properly for the first minute or so after boot. We worked out that this is associated with a system time jump that often happens during boot of live images on systems with the hardware clock set to UTC. Our live images assume the hardware clock is set to localtime and the system timezone is US Eastern, so they set the system clock accordingly; when chrony kicks in during boot and finds the real current time, this can cause a system clock jump of several hours. If we adjust things so no clock jump occurs, the bug doesn't happen.

Steve tracked it down to the org.gnome.SettingsDaemon.PrintNotifications.service user service timing out on startup, which blocked some other things from starting up properly too, especially xdg-desktop-portal-gnome.service , which is likely the source of the observed problems. I then noticed the backtrace of that service (after systemd kills it for exceeding its startup timeout) shows it apparently waiting for a response from cupsGetDests2, which calls cups_enum_dests, so I went to look at that codepath and found this, which looks a lot like it could cause the problem we're observing.

I built a live image with a scratch build of our cups package with this patch applied (backported to 2.4.11), and in some initial testing by Steve it seems to have resolved the problem.

I do note there's a lot of other calls to gettimeofday in CUPS, some of which do seem to be used for 'how much time has elapsed' comparisons like this one, so there may be other bugs like this lurking. But I didn't want to spread this change out too far initially.

Copy link
Member

@zdohnal zdohnal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose using '/ 1000000' instead of dividing /1000 twice, but otherwise the changes look good to me.

I've checked the other usages of gettimeofday in libcups which could cause similar issue and IMO cups_dnssd_resolve() and cups_dnssd_resolve_cb() should be using the monotonic clock as well, but since we can't reproduce it at the moment, we migrate it later.

I've checked the CI errors as well - MacOS fails due unrelated reason, however Windows does not know the function or macro - probably we have to use a different way for Windows.

gettimeofday(&nt, NULL);

msecs = (int)(1000 * (nt.tv_sec - t->tv_sec) + (nt.tv_usec - t->tv_usec) / 1000);
msecs = (int)(1000 * (nt.tv_sec - t->tv_sec) + (nt.tv_nsec - t->tv_nsec) / 1000 / 1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can use '/ 1000000' instead dividing by 1000 twice.

@@ -3058,15 +3058,14 @@ cups_dnssd_unquote(char *dst, // I - Destination buffer
//

static int // O - Elapsed time in milliseconds
cups_elapsed(struct timeval *t) // IO - Previous time
cups_elapsed(struct timespec *t) // IO - Previous time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation issue, but it is not a blocker.

@michaelrsweet
Copy link
Member

Yes, this needs cleanup and we need to determine the proper functions to use on Windows for the same purpose (if the current code isn't suitable). I also want to check on how long this has been supported as some of these POSIX functions are "standard" but with loads of optional bits - I have a vague memory of Solaris not supporting the monotonic clock, among others, so we'll need to research this a bit more before we "pull the trigger"...

@michaelrsweet michaelrsweet self-assigned this Oct 21, 2024
@michaelrsweet michaelrsweet added enhancement New feature or request priority-low platform issue Issue is specific to an OS or desktop labels Oct 21, 2024
@michaelrsweet michaelrsweet added this to the v2.5 milestone Oct 21, 2024
@AdamWill
Copy link
Author

AdamWill commented Oct 21, 2024

Roger, those all sound like excellent improvements. I'll try and get back and work on this once we're through Fedora 41 final freeze (which is where this came from). FWIW though I'd say this is a bug fix, not an enhancement. To be clear, I found this as the cause of a fairly visible bug in Fedora - https://bugzilla.redhat.com/show_bug.cgi?id=2316066 . It was causing GNOME's print notifications daemon to get blocked on startup until it timed out and systemd killed it; during that time, various other things didn't work, because some other necessary stuff doesn't get started up till the print notifications daemon is running...

@AdamWill
Copy link
Author

From fish-shell/fish-shell#6440 it seems like macOS is probably OK since 10.12, and that's a long way out of support lifetime now.

https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-gettickcount64 seems like possibly the best option on Windows.

Googling around suggests CLOCK_MONOTONIC does exist on Solaris (as an alias to CLOCK_HIGHRES - see vim/vim#10647 ), but does not exist on HP-UX - vim/vim#10873 .

nmap/nmap#180 has even more icky details, including some differences in CLOCK_MONOTONIC behaviour across platforms.

Various people who've run into this mess before suggest things like falling back to gettimeofday or setitimer/getitimer. There's a library at https://github.com/ThomasHabets/monotonic_clock which tries to handle various platforms but only has tested implementations for Linux, OpenBSD and Solaris, plus apparently untested mach and Windows support.

In summary I don't know what to do! Maybe get/setitimer would be viable here? I dunno.

@michaelrsweet
Copy link
Member

@AdamWill What I've pushed thus far uses GetTickCount64 on Windows, clock_gettime(CLOCK_MONOTONIC, ...) when supported, with gettimeofday as a backup. Since modern platforms do support CLOCK_MONOTONIC and we don't have testing/integration resources for legacy Unix platforms, using interval timers or other platform-specific code would be problematic.

The new proposed API is a simple "elapsed time in seconds" API with a defined resolution of at least 10ms:

double cupsGetClock(void);

The use of "double" allows for 54 bits of precision and effectively unlimited run time support. The implementation bases the return values on the first call to cupsGetClock, so the first call returns 0.0, calling 42 seconds later would return approximately 42.0, etc.

[master 3395443] Add cupsGetClock API.

With the new API, it should be fairly straight-forward to refactor the dest and other timing-specific code and eliminate a bunch of the platform-specific code that gets sprinkled throughout.

@michaelrsweet
Copy link
Member

I will also note, if this gets back-ported to 2.4.x we will name it _cupsGetClock to avoid introducing a new public API in a stable release.

@AdamWill
Copy link
Author

Looks good to me!

Some of the references I found suggested CLOCK_MONOTONIC_RAW might be preferred over CLOCK_MONOTONIC for this purpose if it's present, but not 100% sure.

@mcatanzaro
Copy link

Some of the references I found suggested CLOCK_MONOTONIC_RAW might be preferred over CLOCK_MONOTONIC for this purpose if it's present, but not 100% sure.

It looks like the difference is CLOCK_MONOTONIC_RAW ignores frequency adjustments, so one second is always actually one second, even if an NTP daemon has decided to make seconds shorter or longer in order to slowly adjust the system clock to the correct time. So I guess CLOCK_MONOTONIC_RAW is probably almost always what you really want. But it's only available on Linux.

I had never heard of it until last week, when I noticed dbus-broker was using it. GLib uses vanilla CLOCK_MONOTONIC though, and that affects basically the entire desktop. Maybe GLib should switch, but it seems like a very small problem, so whatever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request platform issue Issue is specific to an OS or desktop priority-low
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants