Skip to content

Commit

Permalink
gpu: add notes about gpu-plugin modes
Browse files Browse the repository at this point in the history
Fixes: #1381

Signed-off-by: Tuomas Katila <[email protected]>
  • Loading branch information
tkatila committed Apr 18, 2023
1 parent 2d19520 commit 0c7fd81
Showing 1 changed file with 11 additions and 1 deletion.
12 changes: 11 additions & 1 deletion cmd/gpu_plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Table of Contents
* [Fractional resources details](#fractional-resources-details)
* [Verify Plugin Registration](#verify-plugin-registration)
* [Testing and Demos](#testing-and-demos)
* [Use Cases for Different Modes](#use-cases-for-different-modes)
* [Issues with media workloads on multi-GPU setups](#issues-with-media-workloads-on-multi-gpu-setups)
* [Workaround for QSV and VA-API](#workaround-for-qsv-and-va-api)

Expand Down Expand Up @@ -48,7 +49,7 @@ backend libraries can offload compute operations to GPU.
| -enable-monitoring | - | disabled | Enable 'i915_monitoring' resource that provides access to all Intel GPU devices on the node |
| -resource-manager | - | disabled | Enable fractional resource management, [see also dependencies](#fractional-resources) |
| -shared-dev-num | int | 1 | Number of containers that can share the same GPU device |
| -allocation-policy | string | none | 3 possible values: balanced, packed, none. It is meaningful when shared-dev-num > 1, balanced mode is suitable for workload balance among GPU devices, packed mode is suitable for making full use of each GPU device, none mode is the default. Allocation policy does not have effect when resource manager is enabled. |
| -allocation-policy | string | none | 3 possible values: balanced, packed, none. It is meaningful when shared-dev-num > 1: balanced mode is suitable for workload balance among GPU devices, packed mode is suitable for making full use of each GPU device, and none selects first available device from kubelet. None mode is the default. Allocation policy does not have effect when resource manager is enabled. |

The plugin also accepts a number of other arguments (common to all plugins) related to logging.
Please use the -h option to see the complete list of logging related options.
Expand Down Expand Up @@ -315,6 +316,15 @@ The GPU plugin functionality can be verified by deploying an [OpenCL image](../.
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 Insufficient gpu.intel.com/i915.
```

## Use Cases for Different Modes

Intel GPU-plugin supports a few different operation modes. Depending on the workloads the cluster is running, some modes make less sense than others. Below is a table that explains the differences between the modes and suggests workload types for each mode. The mode selection requires pre-though as it is cluster wide.

| Mode | Sharing | Workload examples | Time critical |
|:---- |:-------- |:------- |:------- |
| shared-dev-num == 1 | No, 1 container per GPU | AI/ML training or an application that can fully utilize a GPU e.g. pack multiple workloads. | Yes |
| shared-dev-num > 1 | Yes, >1 containers per GPU | Inference, media transcode, media analytics. Workloads that only require part of the GPU. See also [allocation profiles](#modes-and-configuration-options). | No |
| shared-dev-num > 1 && resource-management | Yes and no, 1>= containers per GPU | All workloads, but requires GPU resource allocations. GPUs can be dedicated and shared based on the GPU resources like memory and millicores. Requires [GAS](https://github.com/intel/platform-aware-scheduling/tree/master/gpu-aware-scheduling). See also [fractional use](#fractional-resources-details). | Depends on the Pod spec |

## Issues with media workloads on multi-GPU setups

Expand Down

0 comments on commit 0c7fd81

Please sign in to comment.