configure-scheduler.sh script breaks Kubernetes config #126

eero-t · 2023-02-02T17:42:06Z

Updating Kubernetes scheduler config using the script & (GAS) config in this repo, breaks Kubernetes 1.26 config:
./telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml

It messed up the last volume specification in the config so that there were couple of extra lines for the last volume specification.

Besides fixing that in the script, I think script should take a backup of the config file, and show user a diff of the changes it did.

Maybe also ask user whether changes should be accepted (unless something like "--yes" is specified), like e.g. Debian configuration packages script updates do.

You might also consider using kustomize to do the file updates, as that is actually designed for semantic updating of k8s YAML files.

(Which still leaves the issue of upgrade overwriting the changes, as mentioned in #86.)

The text was updated successfully, but these errors were encountered:

togashidm · 2023-02-03T09:23:07Z

Hi @eero-t

Thank you for flag this and for your suggestions. We are looking into them.

madalazar · 2023-02-07T18:00:01Z

Hi @eero-t
would you be able to share both the output of the configure-scheduler.sh run (the output of the command you pasted above) and resulting kube-scheduler.yaml output? I'm expecting the script changed the file in a way that the scheduler would not start anymore. If that's not available, could you share the current kube-scheduler manifest?
I would like to understand what was the problem in the first place and without it it's pretty much impossible.

As for the other suggestions, I have added them to our backlog. Will let you know when they will be picked up

eero-t · 2023-02-07T18:16:22Z

As script did not provide a backup (and I was not the person updating that cluster to k8s 1.26), I do now know what the starting point was.

If I run the command now (with fixed config), it outputs following:

# ./telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml
Manifest file is located at: /etc/kubernetes/manifests/kube-scheduler.yaml
Scheduler config file is located at: ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml
Scheduler config destination will be: /etc/kubernetes
Version of the image used in the kube scheduler is: 26
Version of the KubeScheduler API: v1
Will proceed to copy the scheduler configuration to its destination path: /etc/kubernetes.
Determined that the appropriate nodeSelector key would be: control-plan

And does not change kube-scheduler.yaml any more.

Which looks now like this:

...
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler-config-tas+gas.yaml
    name: schedulerconfig
  - hostPath:
      path: /etc/certs
    name: certdir
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
status: {}

After the initial run of the script, there were additional path: & type: lines for the last (kubeconfig) volume, in addition to path: & type: lines you see above.

madalazar · 2023-02-09T11:09:21Z

Hi,
I tried to use the same command as you did and this was my output:

Command

./telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml

Initial manifest file

   - mountPath: /etc/kubernetes/scheduler-config.yaml
     name: schedulerconfig
     readOnly: true
   - mountPath: /host/certs
     name: certdir
   - mountPath: /etc/kubernetes/scheduler.conf
     name: kubeconfig
     readOnly: true
 hostNetwork: true
 priorityClassName: system-node-critical
 securityContext:
   seccompProfile:
     type: RuntimeDefault
 volumes:
 - hostPath:
     path: /etc/kubernetes/scheduler-config.yaml
   name: schedulerconfig
 - hostPath:
     path: /etc/certs
   name: certdir
 - hostPath:
     path: /etc/kubernetes/scheduler.conf
     type: FileOrCreate
   name: kubeconfig

Output

  volumeMounts:
    - mountPath: /etc/kubernetes/scheduler-config-tas+gas.yaml
      name: schedulerconfig
      readOnly: true
    - mountPath: /host/certs
      name: certdir
    - mountPath: /etc/kubernetes/scheduler-config.yaml
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler-config-tas+gas.yaml
    name: schedulerconfig
  - hostPath:
      path: /etc/certs
    name: certdir
  - hostPath:
      path: /etc/kubernetes/scheduler-config.yaml
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig

The problem I found is here :

-- hostPath:
path: /etc/kubernetes/scheduler-config.yaml
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig

From what I could tell the root cause is this line https://github.com/intel-innersource/libraries.orchestrators.resourcemanagement.platform-aware-scheduling/blob/master/telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh#L162

sed -i "/$scheduler_config_file/d" "$MANIFEST_FILE"

I started with the default manifest file (deploy/extender/scheduler-config.yaml), but when I specified a new one -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml, the $MANIFEST_FILE would be scheduler-config-tas+gas.yaml not scheduler-config.yaml so sed can't find and won't remove it.

Not sure if this is exactly the same problem as in my case, the scheduler was still working (the scheduler pod started successfully, and then I was able to schedule the demo pod according to the TAS health-demo policy).
I created a follow-up item for us to work on to address this particular problem.

In terms of priorities, I think we will be starting with 2 items first: fixing the bug above and having the file create backups as I think these are more urgent. I'll be picking these up in the next sprint (following week).
I will update this issue once they will be released.

eero-t · 2023-02-27T10:27:03Z

IMHO YAML modifications should be done with something that actually understands YAML (e.g. Python script using pyyaml module), otherwise it's too easy for the script to break things.

madalazar · 2023-03-09T15:04:51Z

Hi,

Apologies for the late reply. I logged an issue in our backlog about this problem (it was also referenced here #86), but we weren't able to pick-up the work so far.
I'm working on a new release for TAS, but unfortunately this change did not make it.

My plan is to look this week/Monday at what changes we can work on in March and then what we plan for the next quarter. I'll be able to give a more specific date then.

Just to keep things clear, I will continue to update this issue regarding the changes required for the configuration script and just update issue#86 with regards to your last request.

eero-t mentioned this issue Apr 12, 2023

Claiming i915 resources in a single container has 2 cards accessible for Intel GPU Flex 140 intel/intel-device-plugins-for-kubernetes#1377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configure-scheduler.sh script breaks Kubernetes config #126

configure-scheduler.sh script breaks Kubernetes config #126

eero-t commented Feb 2, 2023

togashidm commented Feb 3, 2023

madalazar commented Feb 7, 2023

eero-t commented Feb 7, 2023 •

edited

Loading

madalazar commented Feb 9, 2023 •

edited

Loading

eero-t commented Feb 27, 2023

madalazar commented Mar 9, 2023

configure-scheduler.sh script breaks Kubernetes config #126

configure-scheduler.sh script breaks Kubernetes config #126

Comments

eero-t commented Feb 2, 2023

togashidm commented Feb 3, 2023

madalazar commented Feb 7, 2023

eero-t commented Feb 7, 2023 • edited Loading

madalazar commented Feb 9, 2023 • edited Loading

Command

Initial manifest file

Output

eero-t commented Feb 27, 2023

madalazar commented Mar 9, 2023

eero-t commented Feb 7, 2023 •

edited

Loading

madalazar commented Feb 9, 2023 •

edited

Loading