Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support managed by external controller #2203

Conversation

mszadkow
Copy link
Contributor

What this PR does / why we need it:
Add support for ManagedBy to be able delegate reconciliation process to external controller.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2193

Checklist:

  • Docs included if any changes are user facing

Copy link

google-cla bot commented Aug 12, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@coveralls
Copy link

coveralls commented Aug 12, 2024

Pull Request Test Coverage Report for Build 10849366606

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 10793288019: 0.0%
Covered Lines: 66
Relevant Lines: 66

💛 - Coveralls

pkg/common/util/webhooks.go Outdated Show resolved Hide resolved
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this effort @mszadkow!
We will review it soon!
/ok-to-test
/assign @kubeflow/wg-training-leads

@andreyvelich
Copy link
Member

/rerun-all

@mszadkow
Copy link
Contributor Author

mszadkow commented Aug 28, 2024

/rerun-all

@andreyvelich
Copy link
Member

@mszadkow @mimowo I think, I have only one question here: #2203 (comment).

@kubeflow/wg-training-leads Please take a look at this change.

It would be nice if you could also open PR to update the Kubeflow docs with that API change: https://www.kubeflow.org/docs/components/training/user-guides/.

Maybe we can introduce another user-guide on how to modify managedBy API and why it is needed.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/apis/kubeflow.org/v1/common_types.go Outdated Show resolved Hide resolved
pkg/apis/kubeflow.org/v1/common_types.go Outdated Show resolved Hide resolved
pkg/common/util/webhooks.go Outdated Show resolved Hide resolved
pkg/controller.v1/mpi/mpijob_controller.go Outdated Show resolved Hide resolved
pkg/apis/kubeflow.org/v1/common_types.go Outdated Show resolved Hide resolved
pkg/common/util/webhooks.go Outdated Show resolved Hide resolved
pkg/apis/kubeflow.org/v1/common_types.go Outdated Show resolved Hide resolved
pkg/apis/kubeflow.org/v1/common_types.go Outdated Show resolved Hide resolved
pkg/common/util/webhooks.go Show resolved Hide resolved
pkg/controller.v1/mpi/mpijob_controller.go Outdated Show resolved Hide resolved
@mszadkow mszadkow force-pushed the feature/support-managed-by-external-controller branch from c4f4278 to 34c0b24 Compare September 4, 2024 10:48
@mszadkow
Copy link
Contributor Author

mszadkow commented Sep 4, 2024

Hey @tenzen-y @andreyvelich, I have addressed most of the comments.
What left is:

  1. e2e tests - I would be very grateful for any help or advice on how to run these tests locally.
    From what I see, they are being run in docker sdk/python/Dockerfile.conformance - but seems that they need running cluster or something because that's what I see in result:
test/e2e/test_e2e_xgboostjob.py:41: in <module>
    TRAINING_CLIENT = TrainingClient(job_kind=constants.XGBOOSTJOB_KIND)
kubeflow/training/api/training_client.py:79: in __init__
    config.load_kube_config(config_file=config_file, context=context)
/usr/local/lib/python3.10/site-packages/kubernetes/config/kube_config.py:819: in load_kube_config
    loader = _get_kube_config_loader(
/usr/local/lib/python3.10/site-packages/kubernetes/config/kube_config.py:776: in _get_kube_config_loader
    raise ConfigException(
E   kubernetes.config.config_exception.ConfigException: Invalid kube-config file. No configuration found.
=========================== short test summary info ============================
ERROR test/e2e/test_e2e_mpijob.py - kubernetes.config.config_exception.Config...
ERROR test/e2e/test_e2e_paddlejob.py - kubernetes.config.config_exception.Con...
ERROR test/e2e/test_e2e_pytorchjob.py - kubernetes.config.config_exception.Co...
ERROR test/e2e/test_e2e_tfjob.py - kubernetes.config.config_exception.ConfigE...
ERROR test/e2e/test_e2e_xgboostjob.py - kubernetes.config.config_exception.Co...
!!!!!!!!!!!!!!!!!!! Interrupted: 5 errors during collection !!!!!!!!!!!!!!!!!!!!
============================== 5 errors in 0.33s ===============================
  1. Where should I locate the skip in the end: reconcile, predicate or both?
    There are some situations that @mimowo mentioned that it would be beneficial to have it in both places.
    [Feature] Support managed by external controller #2203 (comment)

Would you agree to make those changes in the follow up possibly?

@tenzen-y
Copy link
Member

tenzen-y commented Sep 4, 2024

  1. e2e tests - I would be very grateful for any help or advice on how to run these tests locally.
    From what I see, they are being run in docker sdk/python/Dockerfile.conformance - but seems that they need running cluster or something because that's what I see in result:

Actually, the conformance container image does not related to our E2E testing.
So, you can reproduce the CI step locally in the following:

  1. Create K8s Cluster:
    - name: Create k8s Kind Cluster
    uses: helm/[email protected]
    with:
    node_image: kindest/node:${{ matrix.kubernetes-version }}
    cluster_name: training-operator-cluster
    kubectl_version: ${{ matrix.kubernetes-version }}
  2. Build and load new container image to K8s cluster:
    - name: Build training-operator
    run: |
    ./scripts/gha/build-image.sh
    env:
    TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
  3. Deploy the training-operator with new image:
    - name: Deploy training operator
    run: |
    ./scripts/gha/setup-training-operator.sh
    env:
    KIND_CLUSTER: training-operator-cluster
    TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
    GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}
    KUBERNETES_VERSION: ${{ matrix.kubernetes-version }}
  4. install Python library, and perform tests:
    - name: Run tests
    run: |
    pip install pytest
    python3 -m pip install -e sdk/python; pytest -s sdk/python/test --log-cli-level=debug --namespace=default

Signed-off-by: Michal Szadkowski <[email protected]>
@mszadkow mszadkow force-pushed the feature/support-managed-by-external-controller branch from a822a17 to 7d3bb89 Compare September 12, 2024 09:33
@mszadkow mszadkow force-pushed the feature/support-managed-by-external-controller branch from 64c6452 to 40b902d Compare September 12, 2024 12:27
pkg/common/util/webhooks.go Outdated Show resolved Hide resolved
pkg/common/util/webhooks.go Outdated Show resolved Hide resolved
@mszadkow mszadkow force-pushed the feature/support-managed-by-external-controller branch from 40b902d to 353d505 Compare September 12, 2024 12:59
Signed-off-by: Michal Szadkowski <[email protected]>
@mszadkow mszadkow force-pushed the feature/support-managed-by-external-controller branch from e57b7c3 to 80576f6 Compare September 12, 2024 13:41
@mszadkow mszadkow force-pushed the feature/support-managed-by-external-controller branch from c94ab34 to 18a5313 Compare September 13, 2024 10:57
pkg/common/util/webhooks.go Outdated Show resolved Hide resolved
pkg/webhooks/pytorch/pytorchjob_webhook.go Outdated Show resolved Hide resolved
@mszadkow mszadkow force-pushed the feature/support-managed-by-external-controller branch from f6e749a to 2f920a8 Compare September 13, 2024 12:18
pkg/controller.v1/common/job_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/common/job_test.go Outdated Show resolved Hide resolved
Signed-off-by: Michal Szadkowski <[email protected]>
Copy link

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Reviewed implementation and tests.
PTAL @tenzen-y and @andreyvelich

Copy link

@mimowo: changing LGTM is restricted to collaborators

In response to this:

/lgtm
Reviewed implementation and tests.
PTAL @tenzen-y and @andreyvelich

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mszadkow Thank you for driving this feature!
@mimowo Additionally, I would like to express my appreciation to help review this PR!

/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit d8b8b34 into kubeflow:master Sep 18, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for the managedBy field
5 participants