Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RHOAIENG-11850] Update manifests #302

Closed
wants to merge 0 commits into from

Conversation

mholder6
Copy link

@mholder6 mholder6 commented Sep 9, 2024

Motivation

Modifications

Updated the manifests to include resource requests and limits.

Result

PR checklist

Checklist items below are applicable for development targeted to both fast and stable branches/tags

  • Unit tests pass locally
  • FVT tests pass locally
  • If the PR adds a new container image or updates the tag of an existing image (not build within cpaas), is the corresponding change made in live-builder and cpaas-midstream to add/update the image tag in the operator CSV? Link the PRs if applicable

Checklist items below are applicable for development targeted to both fast and stable branches/tags

  • Tested modelmesh serving deployment with odh-manifests and ran odh-manifests-e2e tests locally

@Jooho
Copy link

Jooho commented Sep 19, 2024

@mholder6
The root cause of the issue is that there are no resource specifications for the pod. If no resources are specified with a resourceQuota, the pod won't start. Therefore, we should first check if any of the manifests in the model serving components (modelmesh/kserve/odh-model-controller) are missing resource specifications. If any are missing, we need to add them.

However, the values for each pod (CPU, memory) should be calculated properly. I’m not sure where the values in the PR came from, but it would be a good idea to determine the correct values for each pod.

Here’s the suggested approach:

  • Deploy all components and leave them running for 1 hour.
  • Check the CPU and memory usage metrics. This can be used to set the request values.
  • Create and delete several InferenceService (ISVC) instances multiple times.
  • Check the metrics again and compare the resource usage before and after ISVC instances are created. Based on this, we can calculate the required resources and set appropriate limits for CPU and memory.

@israel-hdez @spolti any other ideas?

@spolti
Copy link
Member

spolti commented Sep 19, 2024

Sounds good @Jooho that would be a better idea than just adding limits to the yamls. Plus, this would need to be configurable assuring that the values configured can be changed if the defaults are not good.
by the way, I suggested @mholder6 take these values from an estimation after deploying some modelmesh models and watch the etcd resource consumption.

Copy link

openshift-ci bot commented Sep 20, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mholder6, spolti

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot removed the approved label Sep 20, 2024
@mholder6 mholder6 deleted the rhoaieng-11850 branch September 20, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants