Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Kubeflow Jobs in MultiKueue #2552

Open
1 of 3 tasks
alculquicondor opened this issue Jul 8, 2024 · 8 comments · Fixed by #2880
Open
1 of 3 tasks

Support Kubeflow Jobs in MultiKueue #2552

alculquicondor opened this issue Jul 8, 2024 · 8 comments · Fixed by #2880
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@alculquicondor
Copy link
Contributor

What would you like to be added:

Support for Kubeflow Jobs in MultiKueue, in particular, for TFJob and PyTorchJob.
Ideally, the implementation should be mostly common among all job types.

Kubeflow Job doesn't have support for managedBy, so, for now, we can only support the scenario where the manager cluster doesn't have the controller installed.

Why is this needed:

To continue the incremental improvement of MK and satisfy the needs of early adopters.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@alculquicondor alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 8, 2024
@alculquicondor
Copy link
Contributor Author

/assign @mszadkow

@kannon92
Copy link
Contributor

The hope is to implement a new version of Trainingoperator called TrainJob that would use JobSet as the base.

I think if rhat is done then one could use the ManagedField in JobSet for this.

@alculquicondor
Copy link
Contributor Author

Right, but we have users requesting this feature today.

@kannon92
Copy link
Contributor

Should there be some work done to add managedField to the KubeFlow API? Or are we trying to avoid that to satisfy a solution that bypasses KubeFlow releases?

@alculquicondor
Copy link
Contributor Author

We don't need it in the current version of MultiKueue. We can just recommend users not to install the operator in the dispatcher cluster.

@mimowo
Copy link
Contributor

mimowo commented Sep 12, 2024

/reopen
Let's close it when the ongoing effort of supporting managedBy is complete for the training-operator and MPIJob.

@k8s-ci-robot k8s-ci-robot reopened this Sep 12, 2024
@k8s-ci-robot
Copy link
Contributor

@mimowo: Reopened this issue.

In response to this:

/reopen
Let's close it when the ongoing effort of supporting managedBy is complete for the training-operator and MPIJob.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tenzen-y
Copy link
Member

FYI: kubeflow/training-operator#2203 was merged right now.
We will include the feature in the next training-operator minor release.

RC.0 with the managedBy feature will be released on January 20th, 2025.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants