Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support Kubeflow MPIJob in MultiKueue #2880

Merged

Conversation

mszadkow
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

The PR introduces a new MultiKueue adapter to handle MPIJob (Kubeflow).
We want to extend MultiKueue capabilities to satisfy the needs of early adopters.

Which issue(s) this PR fixes:

Fixes #2552

Special notes for your reviewer:

Does this PR introduce a user-facing change?

MultiKueue: Support for the Kubeflow MPIJob

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Aug 23, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 23, 2024
Copy link

netlify bot commented Aug 23, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 8daa5f7
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66e1f6c187261800080eb51e

@mszadkow
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Aug 23, 2024
@mszadkow mszadkow force-pushed the feature/support-kubeflow-mpijob-in-mk branch from 1182083 to 5537a91 Compare August 23, 2024 13:20
@mszadkow
Copy link
Contributor Author

/retest-required

1 similar comment
@mszadkow
Copy link
Contributor Author

/retest-required

@mbobrovskyi
Copy link
Contributor

/test pull-kueue-test-integration-main

Due to #2901.

@mszadkow
Copy link
Contributor Author

/retest-required

@mszadkow mszadkow force-pushed the feature/support-kubeflow-mpijob-in-mk branch from bfebd01 to e231672 Compare August 27, 2024 10:50
@mszadkow
Copy link
Contributor Author

/retest-required

@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow force-pushed the feature/support-kubeflow-mpijob-in-mk branch from b27cba7 to 11f84a7 Compare August 27, 2024 12:23
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow force-pushed the feature/support-kubeflow-mpijob-in-mk branch from 4af4abb to 25c5547 Compare August 27, 2024 13:03
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow
Copy link
Contributor Author

mszadkow commented Sep 6, 2024

@alculquicondor I think everything from @mbobrovskyi was addressed

Copy link
Contributor

@mbobrovskyi mbobrovskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Only this warning left:

# Warning: 'patchesStrategicMerge' is deprecated. Please use 'patches' instead. Run 'kustomize edit fix' to update your Kustomization automatically.

But I think we can fix it on follow-up.

Thanks!

hack/multikueue-e2e-test.sh Outdated Show resolved Hide resolved
mkdir -p $(EXTERNAL_CRDS_DIR)/training-operator/
cp -f $(KF_TRAINING_ROOT)/manifests/base/crds/* $(EXTERNAL_CRDS_DIR)/training-operator/
cp -prf $(KF_TRAINING_ROOT)/manifests/* $(EXTERNAL_CRDS_DIR)/training-operator/
## Removing kubeflow.org_mpijobs.yaml is required as the version of MPIJob is conflicting between training-operator and mpi-operator - in integration tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details!

It makes me think that we need a note on this page https://kueue.sigs.k8s.io/docs/tasks/run/kubeflow/mpijobs/ explaining that you need to disable the MPI from training-operator if using both.

hack/e2e-common.sh Show resolved Hide resolved
hack/multikueue-e2e-test.sh Show resolved Hide resolved
pkg/controller/jobframework/validation.go Outdated Show resolved Hide resolved
pkg/controller/jobs/mpijob/mpijob_multikueue_adapter.go Outdated Show resolved Hide resolved
pkg/util/testingjobs/mpijob/wrappers_mpijob.go Outdated Show resolved Hide resolved
Comment on lines 1 to 4
apiVersion: v1
kind: Namespace
metadata:
name: kubeflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this namespace? Are the tests creating any objects in it?

test/integration/multikueue/multikueue_test.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 9, 2024
# 2. Training-operator deployment is modified to enable all kubeflow jobs except for mpi - https://github.com/kubeflow/training-operator/issues/1777

# Modify the `newTag` for the `kubeflow/training-operator` to use the one training-operator version
$YQ eval '(.images[] | select(.name == "kubeflow/training-operator").newTag) = env(KUBEFLOW_IMAGE_VERSION)' -i "$KUBEFLOW_MANIFEST_MANAGER/kustomization.yaml"
Copy link
Contributor

@mbobrovskyi mbobrovskyi Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you reverting these changes after testing?

Like here:

(cd config/components/manager && $KUSTOMIZE edit set image controller="$IMAGE_TAG")

kueue/hack/e2e-common.sh

Lines 97 to 99 in 467b4ee

function restore_managers_image {
(cd config/components/manager && $KUSTOMIZE edit set image controller="$INITIAL_IMAGE")
}

If not, users may accidentally commit unnecessary changes to this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't, good point I will add the restore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this file outside of the tracked files?

Copy link
Contributor

@mbobrovskyi mbobrovskyi Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It's on the git and yq will edit this file.

@mszadkow mszadkow force-pushed the feature/support-kubeflow-mpijob-in-mk branch from 10f0399 to 37f9a07 Compare September 11, 2024 19:53
@mszadkow mszadkow force-pushed the feature/support-kubeflow-mpijob-in-mk branch from 37f9a07 to 8daa5f7 Compare September 11, 2024 19:59
@mszadkow
Copy link
Contributor Author

/retest

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mszadkow

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2024
@alculquicondor
Copy link
Contributor

Leaving LGTM to @mbobrovskyi

@mbobrovskyi
Copy link
Contributor

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e6c09ab9b855734fcb0824b61368ab7fe86456f5

@k8s-ci-robot k8s-ci-robot merged commit fde44ba into kubernetes-sigs:main Sep 12, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Sep 12, 2024
@mbobrovskyi mbobrovskyi deleted the feature/support-kubeflow-mpijob-in-mk branch September 12, 2024 02:03
@mbobrovskyi
Copy link
Contributor

/cherry-pick website

@k8s-infra-cherrypick-robot

@mbobrovskyi: #2880 failed to apply on top of branch "website":

Applying: Introduce MultiKueue adapter for MPIJob
Applying: Add MPIJobs integration tests
Using index info to reconstruct a base tree...
M	test/integration/multikueue/multikueue_test.go
M	test/integration/multikueue/suite_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/integration/multikueue/suite_test.go
CONFLICT (content): Merge conflict in test/integration/multikueue/suite_test.go
Auto-merging test/integration/multikueue/multikueue_test.go
CONFLICT (content): Merge conflict in test/integration/multikueue/multikueue_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 Add MPIJobs integration tests
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick website

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mbobrovskyi
Copy link
Contributor

@mszadkow IIUC we have changes on site. Could you please prepare cherry-pick to website branch?

@tenzen-y
Copy link
Member

@mszadkow IIUC we have changes on site. Could you please prepare cherry-pick to website branch?

We should not merge this PR to the website branch since this feature has not been released yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Kubeflow Jobs in MultiKueue
6 participants