Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix netns leak on container creation and exit code 1 on SIGTERM. #24082

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Sep 26, 2024

libpod: ensure we are not killed during netns creation

When we are killed during netns setup it will leak the netns path as it
was not commited in the db. This is rather common if you run systemctl
stop on a podman systemd unit. Of course we cannot protect against
SIGKILL but in systemd case we get SIGTERM and we really should not exit
in a critical section like this.

Fixes #24044


libpod: rework shutdown handler flow

Currently podman run -d can exit 0 if we send SIGTERM during startup
even though the contianer was never started. That just doesn't make any
sense is horribly confusing for a external job manager like systemd.

The original motivation was to exit 0 for the podman.service in commit
ca7376b. That does make sense but it should only do so for the
service and only if the server did indeed gracefully shutdown.

So we rework how the exit logic works, do not let the handler perform
the exit. Instead the shutdown package does the exit after all handlers
are run, this solves the issue of ordering. Then we default to exit code
1 like we did before and allow the service exit handler to overwrite the
exit code 0 in case of a graceful shutdown.


libpod: remove shutdown.Unregister()

It is never used and needed so let's just remove some dead code.

Does this PR introduce a user-facing change?

Podman no longer exits 0 on SIGTERM by default.
Fixed a race that could cause podman to leak netns files when it was interrupted during the netns creation.

When we are killed during netns setup it will leak the netns path as it
was not commited in the db. This is rather common if you run systemctl
stop on a podman systemd unit. Of course we cannot protect against
SIGKILL but in systemd case we get SIGTERM and we really should not exit
in a critical section like this.

Fixes containers#24044

Signed-off-by: Paul Holzinger <[email protected]>
Currently podman run -d can exit 0 if we send SIGTERM during startup
even though the contianer was never started. That just doesn't make any
sense is horribly confusing for a external job manager like systemd.

The original motivation was to exit 0 for the podman.service in commit
ca7376b. That does make sense but it should only do so for the
service and only if the server did indeed gracefully shutdown.

So we rework how the exit logic works, do not let the handler perform
the exit. Instead the shutdown package does the exit after all handlers
are run, this solves the issue of ordering. Then we default to exit code
1 like we did before and allow the service exit handler to overwrite the
exit code 0 in case of a graceful shutdown.

Signed-off-by: Paul Holzinger <[email protected]>
It is never used and needed so let's just remove some dead code.

Signed-off-by: Paul Holzinger <[email protected]>
@github-actions github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Sep 26, 2024
@Luap99
Copy link
Member Author

Luap99 commented Sep 26, 2024

@mheon @edsantiago PTAL

Copy link
Contributor

openshift-ci bot commented Sep 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2024
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@Luap99 Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Sep 26, 2024
@edsantiago edsantiago changed the title Fix netns leak on contianer creation and exit code 0 on SIGTERM. Fix netns leak on container creation and exit code 0 on SIGTERM. Sep 26, 2024
@edsantiago
Copy link
Member

LGTM, and I can no longer reproduce the netns leak

@@ -75,6 +83,7 @@ func Start() error {
}
handlerLock.Unlock()
shutdownInhibit.Unlock()
os.Exit(exitCode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this prevent a lot of defer functions from running? I know I avoided it deliberately when I wrote this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean yes but if we exit what do we want to defer here? And most importantly the current handler also just exited so there should not be any functional difference I would say

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we have a bunch of things running in defer that do things like removing files to clean up after ourselves

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but this PR doesn't change the behavior here. Critical sections where we cannot leak need to use shutdown.Inhibit() which is how I fixed the issue with the netns leak.

@Luap99 Luap99 changed the title Fix netns leak on container creation and exit code 0 on SIGTERM. Fix netns leak on container creation and exit code 1 on SIGTERM. Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/api-change Change to remote API; merits scrutiny No New Tests Allow PR to proceed without adding regression tests release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: system tests: netns leak
3 participants