Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] Provide an evolvable End to End Solution for Koordinator Device Management #2181

Open
4 tasks
ZiMengSheng opened this issue Aug 29, 2024 · 5 comments

Comments

@ZiMengSheng
Copy link
Contributor

What is your proposal:

Provide an evolvable End to End Solution for Koordinator Device Management

Why is this needed:

Koordinator already supports two functions in the scheduler: GPU shared scheduling and GPU & RDMA joint allocation. It supports users to apply for GPU or RDMA resources using kubrenetes extended resources and Hints defined on Pod Annotation. The extended resource method was originally introduced into Kubernetes mainly to describe discrete and countable node resources. The Kubelet Device Plugin interface is the main way for the Kubernetes community to support such resource reporting and allocation.

However, the allocation logic of Kubelet Device Manager does not support the refined joint allocation of multiple resources according to the device topology, such as the scenario where GPU and RDMA need to be allocated under a PCIESwitch. The only topology allocation supported by Kubelet is allocation according to NUMA. However, even in the scenario where only NUMA allocation is required, Kubelet intervenes in NUMA access a little late. Users will have to face the failure of a successfully scheduled Pod to start due to topology mismatch.

To solve this problem, Koordinator moved the device allocation logic from Kubelet to the scheduler, and used cri-runtime-proxy on the stand-alone side to set up device isolation and visibility. However, the cri-runtime-proxy approach is indeed heavy and inconvenient to install. In addition, although the Koordinator scheduler provides the GPU and RDMA joint allocation function, there is no end-to-end solution available overall, especially on the stand-alone side, it has not yet been connected to the community standard RDMA logic. This proposal attempts to solve the above problems for Koordinator and provide an end-to-end feasible solution.

Finally, in the field of device management, the community proposed Dynamic Resource Allocation after the Device Plugin interface to overcome the various limitations of the current Device Plugin solution. This proposal will also show how Koordintor's GPU sharing and GPU & RDMA joint allocation are implemented under the DRA mode, and how the current solution evolves to DRA.

Key Results:

  • Provide a convenient End to End Solution for Koordinator Device Magement, especially solution as a replcament for runtime-proxy. This will be checked on the GPU Share use case.
  • Collabrate with Device Sharing and Isolation Solution Providers, such as HAMI, cGPU, MIG and vGPU, to enhance Koordinator GPU Share Solutions. This will be checked on the GPU Share use case with strict gpu usage limit.
  • Collabrate with rdma and SRIOV solution Providers, such as sriov-device-plugin, multus, to provide an end to end solution for GPU & RDMA joint allocation.
  • Provide a End to End Solution with DRA structured parameters on k8s version 1.31. This will be checked on the GPU Share use case and GDR use case.
@ZiMengSheng ZiMengSheng added the kind/proposal Create a report to help us improve label Aug 29, 2024
@ZiMengSheng
Copy link
Contributor Author

/area koord-scheduler

@saintube
Copy link
Member

saintube commented Sep 2, 2024

ref #2187 GPU & RDMA Joint Allocation

@ZiMengSheng
Copy link
Contributor Author

ZiMengSheng commented Sep 21, 2024

ref #2171 GPU 监控无法感知中心调度结果

@ZiMengSheng
Copy link
Contributor Author

ref #583 GPU 共享隔离方案

@ZiMengSheng
Copy link
Contributor Author

/area koordlet /area koord-manager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants