Add arguments to pass Ray cluster head and worker templates #570

astefanutti · 2024-06-21T15:28:51Z

Issue link

Fixes #413.

What changes have been made

This PR adds parameters to the Ray cluster configuration API, so users can provide Pod template for the head and worker nodes, e.g.:

from kubernetes import V1PodTemplateSpec, V1PodSpec, V1Toleration

cluster = Cluster(ClusterConfiguration(
    worker_template=V1PodTemplateSpec(
        spec=V1PodSpec(
            tolerations=[V1Toleration(
                key="nvidia.com/gpu",
                operator="Exists",
                effect="NoSchedule",
            )],
            node_selector={
                "nvidia.com/gpu.present": "true",
            },
        )
    ),
    head_template=V1PodTemplateSpec(...),
))

Verification steps

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

openshift-ci · 2024-06-21T15:28:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

KPostOffice

I have one comment, which pertains to the scope of what the user can provide. I think it might require more in depth discussion.

KPostOffice · 2024-06-21T21:33:32Z

src/codeflare_sdk/utils/generate_yaml.py

+
+def apply_worker_template(cluster_yaml: dict, worker_template: client.V1PodTemplateSpec):
+    worker = cluster_yaml.get("spec").get("workerGroupSpecs")[0]
+    merge(worker["template"], worker_template.to_dict(), strategy=Strategy.ADDITIVE)


How should a user edit the container spec? Specifying partial container spec will only add a new partial entry to the containers list. This becomes an issue when specifying something like extra volume mounts.

Right, ideally we'd want to leverage strategic merge patch for that: https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/#notes-on-the-strategic-merge-patch. It's easy to do in Go by using https://github.com/kubernetes/apimachinery/blob/master/pkg/util/strategicpatch/patch.go, but it don't know if that's possible in Python.

With mergedeep and the ADDITIVE strategy, lists are not replaced, but elements appended. The problem being that something like:

cluster = Cluster(ClusterConfiguration( head_template=V1PodTemplateSpec( spec=V1PodSpec( containers=[ V1Container(name="ray-head", volume_mounts=[ V1VolumeMount(name="config", mount_path="path"), ]) ], volumes=[ V1Volume(name="config", config_map=V1ConfigMapVolumeSource(name="config")), ], ) ),

It appends an entire new container, while it should only append the extra volume mount to the existing container.

Strategic merge patch solves this, by relying on information like patchMergeKey in Go structs (x-kubernetes-patch-merge-key in CRDs) so it knows how to match items by key.

I couldn't seem to find anything for merging V1PodTemplate specs, only namespaced pods by sending API requests. There's this package https://pypi.org/project/jsonmerge/ which we could use, it would require maintaining a bit of redundant config however 😒.

Yes, I've actually stumbled upon jsonmerge this morning as well. it seems it'd be possible to provide a merge strategy by key for lists. Let's give it a try? I agree we'd have to maintain some config, unless we figure a way to leverage Kubernetes JSON schema https://github.com/yannh/kubernetes-json-schema?tab=readme-ov-file, that do contain x-kubernetes-patch-merge-key information.

@varshaprasad96 maybe you would have some ideas?

One step better than just additive merging could be calculating the diff and then merging them (deepdiff and then deepmerge - with probably custom merging strategies if needed). But this would still not solve the problem with conflicts at the very least. Looks like if we want to leverage JSON schema we either need to use a live client or load the config to guide merging process.

Here's a script that we could use as a starting point for generate a jsonmerge schema from a k8 schema. I haven't gotten around to testing it yet. The output looks right based on my understanding of the jsonmerge package documentation, not based on testing done with actual objects. Sorry for the messiness

def merge_schema_from_k8_schema(k8_schema): to_return = {} k8_properties = {} if "array" in k8_schema.get("type", []) and k8_schema.get("x-kubernetes-list-type") != "atomic": to_return["items"] = {} to_return["items"]["properties"] = {} toret_properties = to_return["items"]["properties"] k8_properties = k8_schema.get("items", {}).get("properties", {}) elif "object" in k8_schema.get("type", []): to_return["properties"] = {} toret_properties = to_return["properties"] k8_properties = k8_schema.get("properties", {}) for key, value in k8_properties.items(): if "object" in value.get("type", []) and "properties" in value: toret_properties[key] = merge_schema_from_k8_schema(value) elif "array" in value.get("type", []) and "items" in value: toret_properties[key] = merge_schema_from_k8_schema(value) else: toret_properties[key] = {"mergeStrategy": "overwrite"} if "array" in k8_schema.get("type", []): if k8_schema.get("x-kubernetes-list-type") == "set": to_return["mergeStrategy"] = "arrayMergeById" to_return["idRef"] = "/" elif k8_schema.get("x-kubernetes-list-type") == "atomic": to_return["mergeStrategy"] = "overwrite" elif k8_schema.get("x-kubernetes-list-type") == "map": if k8_schema.get("x-kubernetes-patch-merge-key"): to_return["mergeStrategy"] = "arrayMergeById" to_return["id"] = k8_schema["x-kubernetes-patch-merge-key"] else: to_return["mergeStrategy"] = "overwrite" elif "object" in k8_schema.get("type", []): to_return["mergeStrategy"] = "objectMerge" else: to_return["mergeStrategy"] = "overwrite" return to_return

openshift-merge-robot · 2024-06-30T22:47:32Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Add arguments to pass Ray cluster head and worker templates

0bc477d

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2024

KPostOffice reviewed Jun 21, 2024

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 30, 2024

Bobbins228 mentioned this pull request Jul 1, 2024

RHOAIENG-8098 - ClusterConfiguration can be patched #564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add arguments to pass Ray cluster head and worker templates #570

Add arguments to pass Ray cluster head and worker templates #570

astefanutti commented Jun 21, 2024

openshift-ci bot commented Jun 21, 2024

KPostOffice left a comment

KPostOffice Jun 21, 2024

astefanutti Jun 24, 2024 •

edited

Loading

KPostOffice Jun 24, 2024

astefanutti Jun 24, 2024

astefanutti Jun 24, 2024

varshaprasad96 Jun 24, 2024

KPostOffice Jul 9, 2024 •

edited

Loading

openshift-merge-robot commented Jun 30, 2024

Add arguments to pass Ray cluster head and worker templates #570

Are you sure you want to change the base?

Add arguments to pass Ray cluster head and worker templates #570

Conversation

astefanutti commented Jun 21, 2024

Issue link

What changes have been made

Verification steps

Checks

openshift-ci bot commented Jun 21, 2024

KPostOffice left a comment

Choose a reason for hiding this comment

KPostOffice Jun 21, 2024

Choose a reason for hiding this comment

astefanutti Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

KPostOffice Jun 24, 2024

Choose a reason for hiding this comment

astefanutti Jun 24, 2024

Choose a reason for hiding this comment

astefanutti Jun 24, 2024

Choose a reason for hiding this comment

varshaprasad96 Jun 24, 2024

Choose a reason for hiding this comment

KPostOffice Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

openshift-merge-robot commented Jun 30, 2024

astefanutti Jun 24, 2024 •

edited

Loading

KPostOffice Jul 9, 2024 •

edited

Loading