[RHOAIENG-11850] Update manifests #302

mholder6 · 2024-09-09T18:59:41Z

Motivation

Modifications

Updated the manifests to include resource requests and limits.

Result

PR checklist

Checklist items below are applicable for development targeted to both fast and stable branches/tags

Unit tests pass locally
FVT tests pass locally
If the PR adds a new container image or updates the tag of an existing image (not build within cpaas), is the corresponding change made in live-builder and cpaas-midstream to add/update the image tag in the operator CSV? Link the PRs if applicable

Checklist items below are applicable for development targeted to both fast and stable branches/tags

Tested modelmesh serving deployment with odh-manifests and ran odh-manifests-e2e tests locally

Jooho · 2024-09-19T17:15:34Z

@mholder6
The root cause of the issue is that there are no resource specifications for the pod. If no resources are specified with a resourceQuota, the pod won't start. Therefore, we should first check if any of the manifests in the model serving components (modelmesh/kserve/odh-model-controller) are missing resource specifications. If any are missing, we need to add them.

However, the values for each pod (CPU, memory) should be calculated properly. I’m not sure where the values in the PR came from, but it would be a good idea to determine the correct values for each pod.

Here’s the suggested approach:

Deploy all components and leave them running for 1 hour.
Check the CPU and memory usage metrics. This can be used to set the request values.
Create and delete several InferenceService (ISVC) instances multiple times.
Check the metrics again and compare the resource usage before and after ISVC instances are created. Based on this, we can calculate the required resources and set appropriate limits for CPU and memory.

@israel-hdez @spolti any other ideas?

spolti · 2024-09-19T17:32:44Z

Sounds good @Jooho that would be a better idea than just adding limits to the yamls. Plus, this would need to be configurable assuring that the values configured can be changed if the defaults are not good.
by the way, I suggested @mholder6 take these values from an estimation after deploying some modelmesh models and watch the etcd resource consumption.

openshift-ci · 2024-09-20T18:39:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mholder6, spolti

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from spolti and VedantMahabaleshwarkar September 9, 2024 18:59

mholder6 force-pushed the rhoaieng-11850 branch from 7c5f9ed to 5bbb513 Compare September 9, 2024 19:02

mholder6 force-pushed the rhoaieng-11850 branch 2 times, most recently from 9f94c7c to d6b7105 Compare September 17, 2024 17:23

spolti approved these changes Sep 17, 2024

View reviewed changes

openshift-ci bot added the approved label Sep 17, 2024

mholder6 requested a review from hdefazio September 18, 2024 14:22

mholder6 closed this Sep 20, 2024

mholder6 force-pushed the rhoaieng-11850 branch from a937860 to ead96c3 Compare September 20, 2024 18:39

openshift-ci bot removed the approved label Sep 20, 2024

mholder6 deleted the rhoaieng-11850 branch September 20, 2024 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RHOAIENG-11850] Update manifests #302

[RHOAIENG-11850] Update manifests #302

mholder6 commented Sep 9, 2024

Jooho commented Sep 19, 2024

spolti commented Sep 19, 2024

openshift-ci bot commented Sep 20, 2024

[RHOAIENG-11850] Update manifests #302

[RHOAIENG-11850] Update manifests #302

Conversation

mholder6 commented Sep 9, 2024

Motivation

Modifications

Result

PR checklist

Jooho commented Sep 19, 2024

spolti commented Sep 19, 2024

openshift-ci bot commented Sep 20, 2024