Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable resource naming in config #68

Merged
merged 1 commit into from
Sep 5, 2024

Conversation

MondayCha
Copy link

@MondayCha MondayCha commented Aug 1, 2024

Motivation

Volcano v1.9.0 introduces Capacity scheduling capabilities, which makes it possible to configure different quotas for different types of GPU queues (important in production environments). For example:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: queue1
spec:
  reclaimable: true
  deserved: # set the deserved field.
    cpu: 2
    memeory: 8Gi
    nvidia.com/t4: 40
    nvidia.com/a100: 20

However, the default Nvidia Device Plugin reports resources as nvidia.com/gpu, which does not support reporting different GPU models as shown in the example.

To address this, we need to customize the device plugin.

Change Details

The NVIDIA community has already had discussions about this issue:

This PR is modified based on the above discussion.

Further Impact

GPU resource renaming will prevent the DCGM Exporter from obtaining pod-level GPU resource usage monitoring, since the DCGM Exporter must exactly match the resource name nvidia.com/gpu or those with a prefix of nvidia.com/mig-.

@volcano-sh-bot
Copy link
Collaborator

Welcome @MondayCha!

It looks like this is your first PR to volcano-sh/devices.

Thank you, and welcome to Volcano. 😃

@Monokaix
Copy link
Member

Monokaix commented Aug 1, 2024

Hi,please add more description about this pr,and use git commit -s to sign off your commit.

@william-wang
Copy link
Member

Thanks for your contribution. I opened a issue #69 for this pr.

@william-wang
Copy link
Member

@MondayCha Would you like to add a doc to guide how to configure and use it ?

@MondayCha MondayCha changed the title Enable resource naming in config [WIP] Enable resource naming in config Aug 7, 2024
@hzxuzhonghu
Copy link
Collaborator

/ok-to-test

@MondayCha MondayCha force-pushed the capacity-0.16.1 branch 3 times, most recently from 07d89d1 to c613eca Compare August 16, 2024 07:24
@MondayCha MondayCha changed the title [WIP] Enable resource naming in config Enable resource naming in config Aug 22, 2024
Once this repo is updated and the ConfigMap is prepared, you can begin installing packages from it to deploy the `nvidia-device-plugin` helm chart.

```shell
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If on device plugin is installed, would helm upgrade be successful?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is referenced from the installation instructions in the official documentation.

If the user has previously installed an existing gpu device plugin in other ways, then helm upgrade may fail. In this case, it is recommended that the user uninstall and then reinstall it.

@MondayCha MondayCha force-pushed the capacity-0.16.1 branch 4 times, most recently from de7ced4 to 2ddccac Compare September 4, 2024 06:53
@Monokaix
Copy link
Member

Monokaix commented Sep 4, 2024

/lgtm

Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@volcano-sh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Monokaix
Copy link
Member

Monokaix commented Sep 5, 2024

/lgtm

@volcano-sh-bot volcano-sh-bot merged commit adef565 into volcano-sh:release-1.1 Sep 5, 2024
2 checks passed
@MondayCha MondayCha deleted the capacity-0.16.1 branch September 5, 2024 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants