The `7_get_data_train_upload.py` generated pipeline gets stuck on ROSA-hosted OCP cluster #22

adelton · 2024-03-12T18:26:09Z

I follow the tutorial at https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.7/html-single/openshift_ai_tutorial_-_fraud_detection_example/index which uses this repo https://github.com/rh-aiservices-bu/fraud-detection.

The section https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/2.7/html-single/openshift_ai_tutorial_-_fraud_detection_example/index#running-a-pipeline-generated-from-python-code shows the use of pipeline/7_get_data_train_upload.py to build pipeline/7_get_data_train_upload.yaml. (Small issue with that section reported in https://issues.redhat.com/browse/RHOAIENG-4448.)

However, when I import the generated pipeline YAML file, the triggered run keeps on being shown as Running in the OpenShift AI dashboard. Specifically, the get-data task is shown as Pending.

There sadly seems to be no way to debug this from the OpenShift AI dashboard. However, in OpenShift Console in the TaskRuns view, there is a stream of events

0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..

shown.

Checking the YAML of the imported pipeline back in the OpenShift AI dashboard shows

  workspaces:
    - name: train-upload-stock-kfp
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2Gi

Logging in as admin to OpenShift Console, I see that the fraud-detection PVC (as well as the one created for MinIO), are of the class gp3-csi, not gp3.

Should the pipeline/7_get_data_train_upload.py try not to force the storage class?

Should the tutorial text be updated to document that DEFAULT_STORAGE_CLASS environment variable that pipeline/7_get_data_train_upload.py consumes? Checking https://rh-aiservices-bu.github.io/fraud-detection/fraud-detection-workshop/running-a-pipeline-generated-from-python-code.html it does not mention storage classes either.

The text was updated successfully, but these errors were encountered:

adelton · 2024-03-13T09:58:27Z

Data point: removing that storageClassName: gp3 from the YAML and reimporting the pipeline makes the get-data task pass.

erwangranger · 2024-03-26T20:59:38Z

@rcarrata , I know you were going to pass through this content. If you happen to run through this one, can you test and update this change?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `7_get_data_train_upload.py` generated pipeline gets stuck on ROSA-hosted OCP cluster #22

The `7_get_data_train_upload.py` generated pipeline gets stuck on ROSA-hosted OCP cluster #22

adelton commented Mar 12, 2024

adelton commented Mar 13, 2024

erwangranger commented Mar 26, 2024

The 7_get_data_train_upload.py generated pipeline gets stuck on ROSA-hosted OCP cluster #22

The 7_get_data_train_upload.py generated pipeline gets stuck on ROSA-hosted OCP cluster #22

Comments

adelton commented Mar 12, 2024

adelton commented Mar 13, 2024

erwangranger commented Mar 26, 2024

The `7_get_data_train_upload.py` generated pipeline gets stuck on ROSA-hosted OCP cluster #22

The `7_get_data_train_upload.py` generated pipeline gets stuck on ROSA-hosted OCP cluster #22