Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configure-scheduler.sh script breaks Kubernetes config #126

Open
eero-t opened this issue Feb 2, 2023 · 6 comments
Open

configure-scheduler.sh script breaks Kubernetes config #126

eero-t opened this issue Feb 2, 2023 · 6 comments

Comments

@eero-t
Copy link
Contributor

eero-t commented Feb 2, 2023

Updating Kubernetes scheduler config using the script & (GAS) config in this repo, breaks Kubernetes 1.26 config:
./telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml

It messed up the last volume specification in the config so that there were couple of extra lines for the last volume specification.

Besides fixing that in the script, I think script should take a backup of the config file, and show user a diff of the changes it did.

Maybe also ask user whether changes should be accepted (unless something like "--yes" is specified), like e.g. Debian configuration packages script updates do.

You might also consider using kustomize to do the file updates, as that is actually designed for semantic updating of k8s YAML files.

(Which still leaves the issue of upgrade overwriting the changes, as mentioned in #86.)

@togashidm
Copy link
Contributor

Hi @eero-t

Thank you for flag this and for your suggestions. We are looking into them.

@madalazar
Copy link
Contributor

Hi @eero-t
would you be able to share both the output of the configure-scheduler.sh run (the output of the command you pasted above) and resulting kube-scheduler.yaml output? I'm expecting the script changed the file in a way that the scheduler would not start anymore. If that's not available, could you share the current kube-scheduler manifest?
I would like to understand what was the problem in the first place and without it it's pretty much impossible.

As for the other suggestions, I have added them to our backlog. Will let you know when they will be picked up

@eero-t
Copy link
Contributor Author

eero-t commented Feb 7, 2023

As script did not provide a backup (and I was not the person updating that cluster to k8s 1.26), I do now know what the starting point was.

If I run the command now (with fixed config), it outputs following:

# ./telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml
Manifest file is located at: /etc/kubernetes/manifests/kube-scheduler.yaml
Scheduler config file is located at: ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml
Scheduler config destination will be: /etc/kubernetes
Version of the image used in the kube scheduler is: 26
Version of the KubeScheduler API: v1
Will proceed to copy the scheduler configuration to its destination path: /etc/kubernetes.
Determined that the appropriate nodeSelector key would be: control-plan

And does not change kube-scheduler.yaml any more.

Which looks now like this:

...
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler-config-tas+gas.yaml
    name: schedulerconfig
  - hostPath:
      path: /etc/certs
    name: certdir
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
status: {}

After the initial run of the script, there were additional path: & type: lines for the last (kubeconfig) volume, in addition to path: & type: lines you see above.

@madalazar
Copy link
Contributor

madalazar commented Feb 9, 2023

Hi,
I tried to use the same command as you did and this was my output:

Command

./telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml

Initial manifest file

   - mountPath: /etc/kubernetes/scheduler-config.yaml
     name: schedulerconfig
     readOnly: true
   - mountPath: /host/certs
     name: certdir
   - mountPath: /etc/kubernetes/scheduler.conf
     name: kubeconfig
     readOnly: true
 hostNetwork: true
 priorityClassName: system-node-critical
 securityContext:
   seccompProfile:
     type: RuntimeDefault
 volumes:
 - hostPath:
     path: /etc/kubernetes/scheduler-config.yaml
   name: schedulerconfig
 - hostPath:
     path: /etc/certs
   name: certdir
 - hostPath:
     path: /etc/kubernetes/scheduler.conf
     type: FileOrCreate
   name: kubeconfig

Output

  volumeMounts:
    - mountPath: /etc/kubernetes/scheduler-config-tas+gas.yaml
      name: schedulerconfig
      readOnly: true
    - mountPath: /host/certs
      name: certdir
    - mountPath: /etc/kubernetes/scheduler-config.yaml
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler-config-tas+gas.yaml
    name: schedulerconfig
  - hostPath:
      path: /etc/certs
    name: certdir
  - hostPath:
      path: /etc/kubernetes/scheduler-config.yaml
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig

The problem I found is here :

-- hostPath:
path: /etc/kubernetes/scheduler-config.yaml
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig

From what I could tell the root cause is this line https://github.com/intel-innersource/libraries.orchestrators.resourcemanagement.platform-aware-scheduling/blob/master/telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh#L162

sed -i "/$scheduler_config_file/d" "$MANIFEST_FILE"

I started with the default manifest file (deploy/extender/scheduler-config.yaml), but when I specified a new one -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml, the $MANIFEST_FILE would be scheduler-config-tas+gas.yaml not scheduler-config.yaml so sed can't find and won't remove it.

Not sure if this is exactly the same problem as in my case, the scheduler was still working (the scheduler pod started successfully, and then I was able to schedule the demo pod according to the TAS health-demo policy).
I created a follow-up item for us to work on to address this particular problem.

In terms of priorities, I think we will be starting with 2 items first: fixing the bug above and having the file create backups as I think these are more urgent. I'll be picking these up in the next sprint (following week).
I will update this issue once they will be released.

@eero-t
Copy link
Contributor Author

eero-t commented Feb 27, 2023

IMHO YAML modifications should be done with something that actually understands YAML (e.g. Python script using pyyaml module), otherwise it's too easy for the script to break things.

@madalazar
Copy link
Contributor

Hi,

Apologies for the late reply. I logged an issue in our backlog about this problem (it was also referenced here #86), but we weren't able to pick-up the work so far.
I'm working on a new release for TAS, but unfortunately this change did not make it.

My plan is to look this week/Monday at what changes we can work on in March and then what we plan for the next quarter. I'll be able to give a more specific date then.

Just to keep things clear, I will continue to update this issue regarding the changes required for the configuration script and just update issue#86 with regards to your last request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants