Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached #39

Merged
merged 1 commit into from
Nov 20, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions controllers/storageclient_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -433,14 +433,24 @@ func (s *StorageClientReconciler) reconcileClientStatusReporterJob(instance *v1a
cronJob.Namespace = s.OperatorNamespace
addLabel(cronJob, storageClientNameLabel, instance.Name)
addLabel(cronJob, storageClientNamespaceLabel, instance.Namespace)
var podDeadLineSeconds int64 = 120
jobDeadLineSeconds := podDeadLineSeconds + 35
var keepJobResourceSeconds int32 = 600
var reducedKeptSuccecsful int32 = 1


_, err := controllerutil.CreateOrUpdate(s.ctx, s.Client, cronJob, func() error {
cronJob.Spec = batchv1.CronJobSpec{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use startindDeadlineSeconds as well, just to safegaurd against 100 times job failure w/ Forbid. can be b/n 30 to 60s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've verified locally that a new job starts right after the deadline is reached for the previous one, so i dont believe this is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I'm referring to a very edgy case search for 100 in this and another post

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its only referring to job scheduling failure, with the deadline (to finish) in place there wont be a skipped scheduling more than twice per iteration (at most). once the 155 seconds deadline is reached a new iteration will spawn and the counter for how many missed jobs occurred will be reset. there is still the option that the CronJob controller happens to be down for a long time (more than 100 minutes) which will cause the issue you are referring to but wouldn't that be considered a cluster issue at this point? i can add the field but i'm unsure if it will cause issues with the other fields i've added and will need additional testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not mandatory, the reasoning is good enough.

Schedule: "* * * * *",
ConcurrencyPolicy: batchv1.ForbidConcurrent,
SuccessfulJobsHistoryLimit: &reducedKeptSuccecsful,
JobTemplate: batchv1.JobTemplateSpec{
Spec: batchv1.JobSpec{
ActiveDeadlineSeconds: &jobDeadLineSeconds,
TTLSecondsAfterFinished: &keepJobResourceSeconds,
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{
ActiveDeadlineSeconds: &podDeadLineSeconds,
Containers: []corev1.Container{
{
Name: "heartbeat",
Expand Down
Loading