Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCEE CSI Driver POC #474

Closed
tmilos77 opened this issue Aug 14, 2024 · 2 comments
Closed

CCEE CSI Driver POC #474

tmilos77 opened this issue Aug 14, 2024 · 2 comments
Assignees

Comments

@tmilos77
Copy link
Contributor

tmilos77 commented Aug 14, 2024

@tmilos77 tmilos77 self-assigned this Aug 14, 2024
@tmilos77 tmilos77 mentioned this issue Aug 14, 2024
8 tasks
@tmilos77
Copy link
Contributor Author

tmilos77 commented Aug 15, 2024

Have modified the shoot spec as described in the given docs on the dev SKR c-8a33e0c. At first the reconcilation failed with

        task "Waiting until shoot infrastructure has been reconciled" failed:
        Error while waiting for Infrastructure
        shoot--kyma-dev--c-8a33e0c/c-8a33e0c to become ready: error during
        reconciliation: Error reconciling infrastructure: failed to apply the
        terraform config: Terraform execution for command 'apply' could not be
        completed:


        * Error creating sharenetwork: Request forbidden: [POST
        https://share-3.eu-de-1.cloud.sap/v2/share-networks], error message:
        {"forbidden": {"code": 403, "message": "Policy doesn't allow
        share_network:create to be performed."}}
          with openstack_sharedfilesystem_sharenetwork_v2.cluster,
          on main.tf line 112, in resource "openstack_sharedfilesystem_sharenetwork_v2" "cluster":
         112: resource "openstack_sharedfilesystem_sharenetwork_v2" "cluster" {

Have checked the shoots secret binding and found out it's using TKYMA_DEV_001 principal. Checked it's roles and found it does not have sharedfilesystem_admin permission. Have assigned it that permission and then reconcilation succeeded.

The storage classes created

  • csi-manila-nfs
  • csi-manila-nfs-auto
  • csi-manila-nfs-constrain-eu-de-1a
  • csi-manila-nfs-constrain-eu-de-1b
  • csi-manila-nfs-constrain-eu-de-1d
  • csi-manila-nfs-eu-de-1a
  • csi-manila-nfs-eu-de-1b
  • csi-manila-nfs-eu-de-1d

Created the PVC with storage class csi-manila-nfs

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-csi
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 200Mi 
  storageClassName: csi-manila-nfs

It had a Pending status message like "waiting for workload to be created"

Have created pod

apiVersion: v1
kind: Pod
metadata:
  name: test-csi2
spec:
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: test-csi
  containers:
    - name: cloud1
      image: ubuntu
      imagePullPolicy: IfNotPresent
      volumeMounts:
        - mountPath: "/mnt/data1"
          name: data
      command:
        - "/bin/bash"
        - "-c"
        - "--"
      args:
        - "sleep 86400 & wait"
  restartPolicy: Never

The PV was created and a share in CCEE, PVC got Bound and pod Ready. Execed into the pod df gave

10.250.1.126:/share_02a6a8a3_df08_4d89_8ffb_775c693a97c7  1.0G  256K  1.0G   1% /mnt/data1

Wrote to the mounted share path. Created second pod that was able to read and write to the same share.

Noticed the CSI driver is mounting the share on it's primary endpoint. Trying to get info from the CCEE what would happen with volumes when primary endpoint gets down and secondary is promoted.

The NetAPP storage box is active-active, both ip’s can be used to access the share, the primary is used as that is the “optimal path” meaning the controller owning the share is also owning the LIF(virtual NetAPP ip), in case of addressing the other LIF, the path will go trough the other controller, and it will re-direct the traffic to the primary one, so the path would be “sub-optimal”, but working with a minimal performance degradation. In case of the secondary gets promoted due to maintenance activity, the primary LIF will also move to the other cluster node, and after the activity the primary will get back the control. So the NFS access will not be interrupted.

@tmilos77
Copy link
Contributor Author

Gardener installs the credentials secret in the SKR that CSI driver uses to authenticate to CCEE. So this is a definite show stopper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant