Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached #39

bernerhat · 2023-11-09T13:50:39Z

Fixes the issue created where consecutive CronJobs iterations create pods unnecessarily and caused maxPod limit to be reached thus resulting in future pods scheduling to be stuck in a pending state.

Added a ConcurrencyPolicy field to the CronJob for reconcileClientStatusReporterJob in StorageClientReconciler
ConcurrencyPolicy chosen in this case to be Forbid

The Forbid option specifies waiting for the current job to finish before starting a new one and the Replace option starts a new job no matter what is the current one's status is (which can also create a new pod while the previous job’s pod is still terminating and increase pod count unnecessarily).
Since the created pod for this cronJob purpose is a heartbeat to the provider server i don't see the point in choosing to replace the current job with another, therefore forbid made more sense.

openshift-ci · 2023-11-09T13:50:51Z

@bernerhat: This pull request references Bugzilla bug 2247748, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

2 validation(s) were run on this bug

bug is open, matching expected state (open)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @nehaberry

In response to this:

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-11-09T13:50:55Z

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: nehaberry.

Note that only red-hat-storage members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@bernerhat: This pull request references Bugzilla bug 2247748, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

2 validation(s) were run on this bug

bug is open, matching expected state (open)

bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @nehaberry

In response to this:

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

leelavg · 2023-11-13T10:57:28Z

@bernerhat could you pls add controllers as the prefix to commit msg to pass github actions?

leelavg · 2023-11-13T11:58:40Z

controllers/storageclient_controller.go

@@ -470,6 +470,7 @@ func (s *StorageClientReconciler) reconcileClientStatusReporterJob(instance *v1a
 					},
 				},
 			},
+			ConcurrencyPolicy: "Forbid",


Move the fields that directly belong to a struct to the top, ie, move this line to after 439, ie, after Schedule which generates diff for easy viewing

Even though the string Forbid is technically correct, since this is a enum-ish field, you should be searching for any predefined constants as a value for this field, ie, batchv1.ForbidConcurrent should be used

Please comment on how you simulated the scenario as well

ack. all addressed.

bernerhat · 2023-11-19T19:38:37Z

controllers/storageclient_controller.go

+	var jobDeadLineSeconds int64 = 155
+	var podDeadLineSeconds int64 = 120
+	var keepJobResourceSeconds int32 = 600
+	var reducedKeptSuccecsful int32 = 1 


Had to create passable variables to the pointer variables required in the new fields. looking into other options as well.

bernerhat · 2023-11-19T20:33:32Z

Firstly addressing the issue at hand, simulated and tested the options to either Forbid or Replace the current running Job locally using minikube and kubernetes version 1.27.4

Secondly, Concerns brought up to me by Ohad were:

Setting a proper deadline for the pod to finish it's operation to prevent an infinite loop scenario.
Keeping last run objects to investigate such option

In order to address these concerns i've added a few new specs to the crafted CR:

TTLSecondsAfterFinished - keep iteration objects (job and pod). set the ttl to 10 minutes, can be changed.
ActiveDeadlineSeconds - deadline before killing the objects. Pod 2 minutes after consultation, Job 2 minutes 35 seconds.
SuccessfulJobsHistoryLimit - set to 1. default value is 3, reduced to prevent unnecessary object clutter.

ActiveDeadlineSeconds on the job level was set to 2 minutes and 35 seconds in order to keep the pod. for some reason the pod was not kept after the deadline reached, after many testing i've managed to find that the pod was only kept after moving into error status 30 seconds after the deadline is reached and the kill order was issued (extra 5 seconds to be sure).

leelavg · 2023-11-20T06:18:51Z

@bernerhat could you pls add controllers as the prefix to commit msg to pass github actions?

missed, GH action is still failing.

controllers/storageclient_controller.go

leelavg · 2023-11-20T07:58:24Z

controllers/storageclient_controller.go

+	var podDeadLineSeconds int64 = 120
+	var keepJobResourceSeconds int32 = 600
+	var reducedKeptSuccecsful int32 = 1 
+

 	_, err := controllerutil.CreateOrUpdate(s.ctx, s.Client, cronJob, func() error {
 		cronJob.Spec = batchv1.CronJobSpec{


use startindDeadlineSeconds as well, just to safegaurd against 100 times job failure w/ Forbid. can be b/n 30 to 60s

i've verified locally that a new job starts right after the deadline is reached for the previous one, so i dont believe this is necessary.

no, I'm referring to a very edgy case search for 100 in this and another post

its only referring to job scheduling failure, with the deadline (to finish) in place there wont be a skipped scheduling more than twice per iteration (at most). once the 155 seconds deadline is reached a new iteration will spawn and the counter for how many missed jobs occurred will be reset. there is still the option that the CronJob controller happens to be down for a long time (more than 100 minutes) which will cause the issue you are referring to but wouldn't that be considered a cluster issue at this point? i can add the field but i'm unsure if it will cause issues with the other fields i've added and will need additional testing

It's not mandatory, the reasoning is good enough.

bernerhat · 2023-11-20T08:22:49Z

@bernerhat could you pls add controllers as the prefix to commit msg to pass github actions?

missed, GH action is still failing.

did not miss, its failing for the first commit still.. is there a way to fix or should i re-create the pr?

added the capability to keep resources on failed job execution with a timeout Signed-off-by: Amit Berner <[email protected]>

nb-ohad · 2023-11-20T11:32:04Z

/approve

openshift-ci · 2023-11-20T11:32:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bernerhat, leelavg, nb-ohad

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-11-20T11:33:31Z

@bernerhat: All pull requests linked via external trackers have merged:

red-hat-storage/ocs-client-operator#39

Bugzilla bug 2247748 has been moved to the MODIFIED state.

In response to this:

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pull-request-size bot added the size/XS label Nov 9, 2023

openshift-ci bot added the bugzilla/severity-unspecified label Nov 9, 2023

openshift-ci bot added the bugzilla/valid-bug label Nov 9, 2023

leelavg reviewed Nov 13, 2023

View reviewed changes

pull-request-size bot added size/S and removed size/XS labels Nov 19, 2023

bernerhat commented Nov 19, 2023

View reviewed changes

leelavg reviewed Nov 20, 2023

View reviewed changes

controllers/storageclient_controller.go Outdated Show resolved Hide resolved

leelavg reviewed Nov 20, 2023

View reviewed changes

controllers: insert ConcurrencyPolicy field to CronJob

7fb2735

added the capability to keep resources on failed job execution with a timeout Signed-off-by: Amit Berner <[email protected]>

bernerhat force-pushed the fix-bz-2247748 branch from cf01fdd to 7fb2735 Compare November 20, 2023 09:19

leelavg approved these changes Nov 20, 2023

View reviewed changes

openshift-ci bot assigned leelavg Nov 20, 2023

openshift-ci bot added the lgtm label Nov 20, 2023

nb-ohad merged commit 3a8f949 into red-hat-storage:fusion-hci-4.14 Nov 20, 2023
13 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached #39

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached #39

bernerhat commented Nov 9, 2023

openshift-ci bot commented Nov 9, 2023

openshift-ci bot commented Nov 9, 2023

leelavg commented Nov 13, 2023

leelavg Nov 13, 2023

bernerhat Nov 19, 2023

bernerhat Nov 19, 2023

bernerhat commented Nov 19, 2023

leelavg commented Nov 20, 2023

leelavg Nov 20, 2023

bernerhat Nov 20, 2023

leelavg Nov 20, 2023

bernerhat Nov 20, 2023

leelavg Nov 20, 2023

bernerhat commented Nov 20, 2023

nb-ohad commented Nov 20, 2023

openshift-ci bot commented Nov 20, 2023

openshift-ci bot commented Nov 20, 2023

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached #39

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached #39

Conversation

bernerhat commented Nov 9, 2023

openshift-ci bot commented Nov 9, 2023

openshift-ci bot commented Nov 9, 2023

leelavg commented Nov 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernerhat commented Nov 19, 2023

leelavg commented Nov 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernerhat commented Nov 20, 2023

nb-ohad commented Nov 20, 2023

openshift-ci bot commented Nov 20, 2023

openshift-ci bot commented Nov 20, 2023