Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The 7_get_data_train_upload.py generated pipeline gets stuck on ROSA-hosted OCP cluster #22

Open
adelton opened this issue Mar 12, 2024 · 2 comments

Comments

@adelton
Copy link

adelton commented Mar 12, 2024

I follow the tutorial at https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.7/html-single/openshift_ai_tutorial_-_fraud_detection_example/index which uses this repo https://github.com/rh-aiservices-bu/fraud-detection.

The section https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.7/html-single/openshift_ai_tutorial_-_fraud_detection_example/index#running-a-pipeline-generated-from-python-code shows the use of pipeline/7_get_data_train_upload.py to build pipeline/7_get_data_train_upload.yaml. (Small issue with that section reported in https://issues.redhat.com/browse/RHOAIENG-4448.)

However, when I import the generated pipeline YAML file, the triggered run keeps on being shown as Running in the OpenShift AI dashboard. Specifically, the get-data task is shown as Pending.

There sadly seems to be no way to debug this from the OpenShift AI dashboard. However, in OpenShift Console in the TaskRuns view, there is a stream of events

0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..

shown.

Checking the YAML of the imported pipeline back in the OpenShift AI dashboard shows

  workspaces:
    - name: train-upload-stock-kfp
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi

Logging in as admin to OpenShift Console, I see that the fraud-detection PVC (as well as the one created for MinIO), are of the class gp3-csi, not gp3.

Should the pipeline/7_get_data_train_upload.py try not to force the storage class?

Should the tutorial text be updated to document that DEFAULT_STORAGE_CLASS environment variable that pipeline/7_get_data_train_upload.py consumes? Checking https://rh-aiservices-bu.github.io/fraud-detection/fraud-detection-workshop/running-a-pipeline-generated-from-python-code.html it does not mention storage classes either.

@adelton
Copy link
Author

adelton commented Mar 13, 2024

Data point: removing that storageClassName: gp3 from the YAML and reimporting the pipeline makes the get-data task pass.

@erwangranger
Copy link
Collaborator

@rcarrata , I know you were going to pass through this content. If you happen to run through this one, can you test and update this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants