ClickHouse Operator Deletes the PVC #1473

karthik-thiyagarajan · 2024-08-05T17:04:29Z

Here is the scenario, which I have seen this issue happening twice in my QA Cluster where I see my PVC got deleted by the Clickhouse Operator.

In a 3 shard 3 replica setup, when one of the pod goes down due to some issues - space issue or due to "TOO_MANY_PARTS" issue, i had to scale down manually the STS on few pods. When I bring it up back, i suspect (i may be wrong) that 1-0-0 pod gets attached different pvc (could 2-0-1) and so on and so forth.
In this kind of inconsistent state, when I redeploy clickhouse through operator, it fails miserably stating that "Failed to reconcile 1-0-0 pod and with some crud error" and asked me to reconcile / delete services and recreate.
I did that and redeployed, now - this time 1-0-0 pod got created and same error happened for 1-0-1 pod and the CHI is in InProgress state and i deleted the CHI by mistake and it deleted my two PVCs that are in progress.

Though I've not seen this issue happening very often, i suspect this is happening only if scale up the sts manually and not through operator. Is it possible to fix this or by any chance, why is the operator deleting the PVC at any cost ?

alex-zaitsev · 2024-08-15T12:54:24Z

@karthik-thiyagarajan , what you describe in #1 this is not possible. All PVCs have unique names with shard and replica ids at the end.

A few things to clarify:
0. Operator version

What is your Kubernetes (EKS, self-managed etc.). What is the storage class and storage provider?
Are you using ArgoCD?
Are you using operator or STS managed persistence, maybe a CHI specification may help.

Do you have operator logs by any chance?

karthik-thiyagarajan · 2024-08-19T06:32:29Z

I'm able to resolve the issue when the when I tried to restart the operator (by deleting the clickhouse operator pod).

Docker image of the "clickhouse-operator" version is 0.24.0

Its 1.26.6. Its hosted in AWS but not as EKS. Its a custom storage class GP3 (AWS) with 16k iops.
We are using spinnaker to deploy the operator.
We are using ClickHouse Operator and all the pods are deployed as k8s statefulsets.
Below is the describe chi output though i dont have the complete operator log available. I've happened to see this issue in the past couple of times but i could not exactly replicate this consistently. The output of chi was something similar to this but I will share the complete operator log next time when i happen to see this issue again.
=====
Status:
Action: reconcile completed UNSUCCESSFULLY, task id: 52712520-75e1-4c16-9ed3-4fca9147fae6
Actions:
2024-08-12T11:28:48.828624122Z reconcile completed UNSUCCESSFULLY, task id: 52712520-75e1-4c16-9ed3-4fca9147fae6
2024-08-12T11:23:29.95212056Z reconcile started, task id: 52712520-75e1-4c16-9ed3-4fca9147fae6
2024-08-12T10:39:22.538428851Z reconcile completed UNSUCCESSFULLY, task id: 7cad48d6-418a-4d29-a74f-247cee5102df
2024-08-12T10:33:35.842378311Z reconcile started, task id: 7cad48d6-418a-4d29-a74f-247cee5102df
2024-08-09T09:03:36.199989865Z reconcile completed successfully, task id: fe474e10-a458-4164-86cf-462e699b18f3
2024-08-09T08:31:40.443801769Z reconcile started, task id: fe474e10-a458-4164-86cf-462e699b18f3
2024-08-08T07:42:45.238725268Z reconcile completed successfully, task id: 484d0d9c-915b-4b6b-b564-f885622b1da2
2024-08-08T06:54:21.338104199Z reconcile started, task id: 484d0d9c-915b-4b6b-b564-f885622b1da2
2024-08-07T12:06:17.925595524Z reconcile completed successfully, task id: 1464d1d5-2f0b-4c3d-9bbd-cb7ca0289414
2024-08-07T11:06:46.980131315Z reconcile started, task id: 1464d1d5-2f0b-4c3d-9bbd-cb7ca0289414
Chop - Commit: 4763e9d
Chop - Date: 2024-08-11T15:42:40
Chop - Version: 0.24.0
Clusters: 1
Endpoint: clickhouse-clickhouse-server.clickhouse.svc.cluster.local
Error: FAILED to reconcile CHI clickhouse/clickhouse-server, err: crud error - should abort
Errors:
2024-08-12T11:28:48.777015704Z FAILED to reconcile CHI clickhouse/clickhouse-server, err: crud error - should abort
2024-08-12T11:28:48.714002622Z FAILED to reconcile StatefulSet for host: 0-0
2024-08-12T10:39:22.486244668Z FAILED to reconcile CHI clickhouse/clickhouse-server, err: crud error - should abort
2024-08-12T10:39:22.425185875Z FAILED to reconcile StatefulSet for host: 0-0
2024-08-02T11:02:32.090573445Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-2-2 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
2024-08-02T10:59:53.95533682Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-2-1 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
2024-08-02T10:55:53.299787593Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-2-0 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
2024-08-02T10:53:07.937532163Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-1-2 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
2024-08-02T10:50:35.054453061Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-1-1 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
2024-08-02T10:46:52.219577538Z Update Service: clickhouse/chi-clickhouse-server-cnxcluster-01-1-0 failed with error: just recreate the service in case of service type change ''=>'ClusterIP'
Fqdns:
chi-clickhouse-server-cnxcluster-01-0-0.clickhouse.svc.cluster.local
chi-clickhouse-server-cnxcluster-01-0-1.clickhouse.svc.cluster.local
chi-clickhouse-server-cnxcluster-01-0-2.clickhouse.svc.cluster.local
chi-clickhouse-server-cnxcluster-01-1-0.clickhouse.svc.cluster.local
chi-clickhouse-server-cnxcluster-01-1-1.clickhouse.svc.cluster.local
chi-clickhouse-server-cnxcluster-01-1-2.clickhouse.svc.cluster.local
chi-clickhouse-server-cnxcluster-01-2-0.clickhouse.svc.cluster.local
chi-clickhouse-server-cnxcluster-01-2-1.clickhouse.svc.cluster.local
chi-clickhouse-server-cnxcluster-01-2-2.clickhouse.svc.cluster.local

karthik-thiyagarajan · 2024-08-19T06:33:06Z

Can you let me know how do i find the chi version ? I see the below version if this helps.

kubectl get chi -n clickhouse -o wide

NAME VERSION CLUSTERS SHARDS HOSTS TASKID STATUS UPDATED ADDED DELETED DELETE ENDPOINT
clickhouse-server 0.24.0 1 3 9 52712520-75e1-4c16-9ed3-4fca9147fae6 Aborted clickhouse-clickhouse-server.clickhouse.svc.cluster.local

alex-zaitsev · 2024-08-26T07:47:36Z

Hi @karthik-thiyagarajan , error messages indicate that operator can not UPDATE statefulsets and services. So it has to DELETE and re-CREATE. That might happen to PVCs as well. This is not right, looks like some permissions are missing.

Could you check operator serviceaccount and corresponding role and role binding? Perhaps, something is not right when deploying operator using spinnaker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClickHouse Operator Deletes the PVC #1473

ClickHouse Operator Deletes the PVC #1473

karthik-thiyagarajan commented Aug 5, 2024

alex-zaitsev commented Aug 15, 2024

karthik-thiyagarajan commented Aug 19, 2024

karthik-thiyagarajan commented Aug 19, 2024 •

edited

Loading

alex-zaitsev commented Aug 26, 2024

ClickHouse Operator Deletes the PVC #1473

ClickHouse Operator Deletes the PVC #1473

Comments

karthik-thiyagarajan commented Aug 5, 2024

alex-zaitsev commented Aug 15, 2024

karthik-thiyagarajan commented Aug 19, 2024

karthik-thiyagarajan commented Aug 19, 2024 • edited Loading

alex-zaitsev commented Aug 26, 2024

karthik-thiyagarajan commented Aug 19, 2024 •

edited

Loading