Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InferenceService should recover from FailedToLoad #208

Open
cfchase opened this issue Sep 20, 2023 · 3 comments
Open

InferenceService should recover from FailedToLoad #208

cfchase opened this issue Sep 20, 2023 · 3 comments

Comments

@cfchase
Copy link
Member

cfchase commented Sep 20, 2023

There was a storage issue when creating an InferenceService (through the UI). This resulted in the status.states.activeModelState: FailedToLoad. It never reconciled to a good state, even when the storage error was fixed. The InferenceService didn't retry to download and reconcile. In order to fix it, I had to update the InferenceService which triggered a reload of the model.

Perhaps it would be useful to periodically try to reload InferenceServices that are in the FailedToLoad state? Or perhaps the UI could try and refresh/reload?

Screenshot from 2023-09-20 12-31-36

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"serving.kserve.io/v1beta1","kind":"InferenceService","metadata":{"annotations":{"openshift.io/display-name":"stocks","serving.kserve.io/deploymentMode":"ModelMesh"},"labels":{"name":"stocks","opendatahub.io/dashboard":"true"},"name":"stocks","namespace":"pipelines-tutorial"},"spec":{"predictor":{"model":{"modelFormat":{"name":"onnx","version":"1"},"runtime":"stocks","storage":{"key":"minio-connection","path":"stocks.onnx"}}}}}
    openshift.io/display-name: stocks
    serving.kserve.io/deploymentMode: ModelMesh
  name: stocks
  namespace: pipelines-tutorial
  labels:
    name: stocks
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
        version: '1'
      runtime: stocks
      storage:
        key: minio-connection
        path: stocks.onnx
status:
  conditions:
    - lastTransitionTime: '2023-09-20T15:15:46Z'
      status: 'False'
      type: PredictorReady
    - lastTransitionTime: '2023-09-20T15:15:46Z'
      status: 'False'
      type: Ready
  modelStatus:
    copies:
      failedCopies: 1
      totalCopies: 1
    lastFailureInfo:
      location: 94c77f-9djbq
      message: "Failed to pull model from storage due to error: unable to list objects in bucket 'models': NoSuchBucket: The specified bucket does not exist\n\tstatus code: 404, request id: 1786A448FC13E392, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8"
      modelRevisionName: stocks__isvc-58f42146d7
      reason: ModelLoadFailed
      time: '2023-09-20T15:15:43Z'
    states:
      activeModelState: FailedToLoad
      targetModelState: ''
    transitionStatus: UpToDate
@israel-hdez
Copy link

Although self-healing sounds right, current behavior seems to be on the safe side.

Retries is, perhaps, something that IMO should be turned off by default, but the user should be able to enable if desired (even per ISVC via annotations/fields, if needed).

We don't want to retry indefinitely to the point that the Cloud bill "scales" accordingly.

@heyselbi
Copy link

heyselbi commented Dec 5, 2023

@cfchase thoughts?

@cfchase
Copy link
Member Author

cfchase commented Dec 5, 2023

So, a number retries would probably fill the need, with a sane default. There probably still needs to be a way to trigger a reload through the UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Status: No status
Status: New/Backlog
Development

No branches or pull requests

3 participants