Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support TFJob(kubeflow) in Multikueue #2626

Merged

Conversation

mszadkow
Copy link
Contributor

@mszadkow mszadkow commented Jul 16, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

The PR introduces a new Multikueue adapter to handle TFJob (Kubeflow).
We want to extend Multikueue capabilities to satisfy the needs of early adopters.

Which issue(s) this PR fixes:

Relates #2552

Special notes for your reviewer:

Does this PR introduce a user-facing change?

MultiKueue: Support for the Kubeflow TFJob

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Jul 16, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 16, 2024
Copy link

netlify bot commented Jul 16, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit dd4dbe9
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66a9dc76fcc1c900081a8911
😎 Deploy Preview https://deploy-preview-2626--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 16, 2024
@mszadkow mszadkow force-pushed the feature/support-kubeflow-in-mk branch from 51ef408 to 7845aa5 Compare July 16, 2024 13:53
@mszadkow
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jul 17, 2024
@mszadkow mszadkow force-pushed the feature/support-kubeflow-in-mk branch 2 times, most recently from fed964d to c8ce51a Compare July 18, 2024 10:28
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 18, 2024
@mszadkow mszadkow force-pushed the feature/support-kubeflow-in-mk branch from c8ce51a to 4c252f9 Compare July 18, 2024 10:29
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow force-pushed the feature/support-kubeflow-in-mk branch from 03e2736 to 7848c26 Compare July 22, 2024 11:14
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 22, 2024
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow force-pushed the feature/support-kubeflow-in-mk branch from 7848c26 to 3e314a8 Compare July 22, 2024 14:54
@mszadkow
Copy link
Contributor Author

/retest

pkg/controller/admissionchecks/multikueue/indexer_test.go Outdated Show resolved Hide resolved
hack/multikueue-e2e-test.sh Outdated Show resolved Hide resolved
test/e2e/multikueue/e2e_test.go Outdated Show resolved Hide resolved
@mszadkow mszadkow force-pushed the feature/support-kubeflow-in-mk branch from 3e314a8 to 74aeec5 Compare July 23, 2024 07:52
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow force-pushed the feature/support-kubeflow-in-mk branch from 74aeec5 to 702e214 Compare July 23, 2024 08:11
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow marked this pull request as ready for review July 23, 2024 16:37
@mimowo
Copy link
Contributor

mimowo commented Jul 30, 2024

@mszadkow please ping us in a comment when the PR is ready for the second pass after the updates.

@mszadkow mszadkow force-pushed the feature/support-kubeflow-in-mk branch from 83ea5c9 to 3b81bb6 Compare July 30, 2024 10:44
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2024
@k8s-ci-robot k8s-ci-robot requested a review from trasc July 30, 2024 10:44
@mszadkow
Copy link
Contributor Author

@tenzen-y @mimowo
I think it's ready

@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow force-pushed the feature/support-kubeflow-in-mk branch from 3b81bb6 to 63305cf Compare July 30, 2024 11:58
Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits, otherwise LGTM

pkg/controller/jobs/job/job_multikueue_adapter.go Outdated Show resolved Hide resolved
pkg/controller/jobs/jobset/jobset_multikueue_adapter.go Outdated Show resolved Hide resolved
site/content/en/docs/concepts/multikueue.md Outdated Show resolved Hide resolved
@@ -74,6 +74,16 @@ kubectl apply --server-side -f https://raw.githubusercontent.com/kubernetes-sigs
```
{{% /alert %}}

### Kubeflow Installation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the kubeflow installation guide, to be done in the worker clusters.

test/e2e/multikueue/e2e_test.go Show resolved Hide resolved
@mszadkow mszadkow force-pushed the feature/support-kubeflow-in-mk branch from 63305cf to bd4e440 Compare July 30, 2024 14:59
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor change suggestions.
/approve

@@ -68,6 +68,12 @@ The `managedBy` field is available as an Alpha feature staring Kubernetes 1.30.0

We recommend using JobSet v0.5.1 or newer.

### Kubeflow

The supported version of the Kubeflow Training Operator is v1.7.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The supported version of the Kubeflow Training Operator is v1.7.0.
The supported version of the Kubeflow Training Operator is v1.7.0, or a newer version.

### Kubeflow Installation

Install Kubeflow Training-operator in the Worker cluster (see [Kubeflow Training-operator Installation](https://www.kubeflow.org/docs/components/training/installation/)
for more details). Please use version v1.7.0 for MultiKueue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for more details). Please use version v1.7.0 for MultiKueue.
for more details). Please use version v1.7.0 or a newer version for MultiKueue.

### Kubeflow Installation

{{% alert title="Warning" color="warning" %}}
Make sure to only install the Kubeflow TFJobs CRD of version v1.7.0 on the management cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Make sure to only install the Kubeflow TFJobs CRD of version v1.7.0 on the management cluster.
Make sure to install only the Kubeflow TFJobs CRD of version v1.7.0 or newer on the management cluster.

Make sure to only install the Kubeflow TFJobs CRD of version v1.7.0 on the management cluster.

```bash
kubectl apply --server-side -f https://github.com/kubeflow/training-operator/blob/v1.7.0/manifests/base/crds/kubeflow.org_tfjobs.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kubectl apply --server-side -f https://github.com/kubeflow/training-operator/blob/v1.7.0/manifests/base/crds/kubeflow.org_tfjobs.yaml
kubectl apply --server-side -f https://github.com/kubeflow/training-operator/blob/v1.8.0/manifests/base/crds/kubeflow.org_tfjobs.yaml

Let's use the latest Kubeflow Training Operator version.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mszadkow, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 31, 2024
Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 31, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: c69f1a8dde6f7c068ec57fb0f2838318a4fb45a7

@k8s-ci-robot k8s-ci-robot merged commit 3b8d828 into kubernetes-sigs:main Jul 31, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Jul 31, 2024
@trasc trasc deleted the feature/support-kubeflow-in-mk branch July 31, 2024 07:46
@tenzen-y
Copy link
Member

/release-note-edit

MultiKueue: Support for the Kubeflow TFJob

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants