Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClickHouse Operator Deletes the PVC #1473

Open
karthik-thiyagarajan opened this issue Aug 5, 2024 · 4 comments
Open

ClickHouse Operator Deletes the PVC #1473

karthik-thiyagarajan opened this issue Aug 5, 2024 · 4 comments

Comments

@karthik-thiyagarajan
Copy link

Here is the scenario, which I have seen this issue happening twice in my QA Cluster where I see my PVC got deleted by the Clickhouse Operator.

  1. In a 3 shard 3 replica setup, when one of the pod goes down due to some issues - space issue or due to "TOO_MANY_PARTS" issue, i had to scale down manually the STS on few pods. When I bring it up back, i suspect (i may be wrong) that 1-0-0 pod gets attached different pvc (could 2-0-1) and so on and so forth.
  2. In this kind of inconsistent state, when I redeploy clickhouse through operator, it fails miserably stating that "Failed to reconcile 1-0-0 pod and with some crud error" and asked me to reconcile / delete services and recreate.
  3. I did that and redeployed, now - this time 1-0-0 pod got created and same error happened for 1-0-1 pod and the CHI is in InProgress state and i deleted the CHI by mistake and it deleted my two PVCs that are in progress.

Though I've not seen this issue happening very often, i suspect this is happening only if scale up the sts manually and not through operator. Is it possible to fix this or by any chance, why is the operator deleting the PVC at any cost ?

@alex-zaitsev
Copy link
Member

@karthik-thiyagarajan , what you describe in #1 this is not possible. All PVCs have unique names with shard and replica ids at the end.

A few things to clarify:
0. Operator version

  1. What is your Kubernetes (EKS, self-managed etc.). What is the storage class and storage provider?
  2. Are you using ArgoCD?
  3. Are you using operator or STS managed persistence, maybe a CHI specification may help.

Do you have operator logs by any chance?

@karthik-thiyagarajan
Copy link
Author

I'm able to resolve the issue when the when I tried to restart the operator (by deleting the clickhouse operator pod).

Docker image of the "clickhouse-operator" version is 0.24.0

  1. Its 1.26.6. Its hosted in AWS but not as EKS. Its a custom storage class GP3 (AWS) with 16k iops.
  2. We are using spinnaker to deploy the operator.
  3. We are using ClickHouse Operator and all the pods are deployed as k8s statefulsets.
  4. Below is the describe chi output though i dont have the complete operator log available. I've happened to see this issue in the past couple of times but i could not exactly replicate this consistently. The output of chi was something similar to this but I will share the complete operator log next time when i happen to see this issue again.
    =====
    Status:
    Action: reconcile completed UNSUCCESSFULLY, task id: 52712520-75e1-4c16-9ed3-4fca9147fae6
    Actions:
    2024-08-12T11:28:48.828624122Z reconcile completed UNSUCCESSFULLY, task id: 52712520-75e1-4c16-9ed3-4fca9147fae6
    2024-08-12T11:23:29.95212056Z reconcile started, task id: 52712520-75e1-4c16-9ed3-4fca9147fae6
    2024-08-12T10:39:22.538428851Z reconcile completed UNSUCCESSFULLY, task id: 7cad48d6-418a-4d29-a74f-247cee5102df
    2024-08-12T10:33:35.842378311Z reconcile started, task id: 7cad48d6-418a-4d29-a74f-247cee5102df
    2024-08-09T09:03:36.199989865Z reconcile completed successfully, task id: fe474e10-a458-4164-86cf-462e699b18f3
    2024-08-09T08:31:40.443801769Z reconcile started, task id: fe474e10-a458-4164-86cf-462e699b18f3
    2024-08-08T07:42:45.238725268Z reconcile completed successfully, task id: 484d0d9c-915b-4b6b-b564-f885622b1da2
    2024-08-08T06:54:21.338104199Z reconcile started, task id: 484d0d9c-915b-4b6b-b564-f885622b1da2
    2024-08-07T12:06:17.925595524Z reconcile completed successfully, task id: 1464d1d5-2f0b-4c3d-9bbd-cb7ca0289414
    2024-08-07T11:06:46.980131315Z reconcile started, task id: 1464d1d5-2f0b-4c3d-9bbd-cb7ca0289414
    Chop - Commit: 4763e9d
    Chop - Date: 2024-08-11T15:42:40
    Chop - Version: 0.24.0
    Clusters: 1
    Endpoint: clickhouse-clickhouse-server.clickhouse.svc.cluster.local
    Error: FAILED to reconcile CHI clickhouse/clickhouse-server, err: crud error - should abort
    Errors:
    2024-08-12T11:28:48.777015704Z FAILED to reconcile CHI clickhouse/clickhouse-server, err: crud error - should abort
    2024-08-12T11:28:48.714002622Z FAILED to reconcile StatefulSet for host: 0-0
    2024-08-12T10:39:22.486244668Z FAILED to reconcile CHI clickhouse/clickhouse-server, err: crud error - should abort
    2024-08-12T10:39:22.425185875Z FAILED to reconcile StatefulSet for host: 0-0
    2024-08-02T11:02:32.090573445Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-2-2 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
    2024-08-02T10:59:53.95533682Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-2-1 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
    2024-08-02T10:55:53.299787593Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-2-0 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
    2024-08-02T10:53:07.937532163Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-1-2 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
    2024-08-02T10:50:35.054453061Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-1-1 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
    2024-08-02T10:46:52.219577538Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-1-0 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
    Fqdns:
    chi-clickhouse-server-cnxcluster-01-0-0.clickhouse.svc.cluster.local
    chi-clickhouse-server-cnxcluster-01-0-1.clickhouse.svc.cluster.local
    chi-clickhouse-server-cnxcluster-01-0-2.clickhouse.svc.cluster.local
    chi-clickhouse-server-cnxcluster-01-1-0.clickhouse.svc.cluster.local
    chi-clickhouse-server-cnxcluster-01-1-1.clickhouse.svc.cluster.local
    chi-clickhouse-server-cnxcluster-01-1-2.clickhouse.svc.cluster.local
    chi-clickhouse-server-cnxcluster-01-2-0.clickhouse.svc.cluster.local
    chi-clickhouse-server-cnxcluster-01-2-1.clickhouse.svc.cluster.local
    chi-clickhouse-server-cnxcluster-01-2-2.clickhouse.svc.cluster.local

@karthik-thiyagarajan
Copy link
Author

karthik-thiyagarajan commented Aug 19, 2024

Can you let me know how do i find the chi version ? I see the below version if this helps.

kubectl get chi -n clickhouse -o wide

NAME VERSION CLUSTERS SHARDS HOSTS TASKID STATUS UPDATED ADDED DELETED DELETE ENDPOINT
clickhouse-server 0.24.0 1 3 9 52712520-75e1-4c16-9ed3-4fca9147fae6 Aborted clickhouse-clickhouse-server.clickhouse.svc.cluster.local

@alex-zaitsev
Copy link
Member

Hi @karthik-thiyagarajan , error messages indicate that operator can not UPDATE statefulsets and services. So it has to DELETE and re-CREATE. That might happen to PVCs as well. This is not right, looks like some permissions are missing.

Could you check operator serviceaccount and corresponding role and role binding? Perhaps, something is not right when deploying operator using spinnaker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants