Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster service deleted on upgrade due to reconcile failure #1452

Open
mbrancato opened this issue Jul 10, 2024 · 3 comments
Open

Cluster service deleted on upgrade due to reconcile failure #1452

mbrancato opened this issue Jul 10, 2024 · 3 comments

Comments

@mbrancato
Copy link

While performing an upgrade via Helm from 0.23.2 to 0.23.6, I ran across a problem where the cluster service disappeared. I also included a minor upgrade of the altinitystable image, but I don't think that is related.

The important bits in my CHI resource:

spec:
  defaults:
    templates:
      podTemplate: default-clickhouse-pod
      dataVolumeClaimTemplate: default-data-volume
      logVolumeClaimTemplate: default-log-volume
      clusterServiceTemplate: default-service-template
  configuration:
    settings:
      logger/level: information
    clusters:
      - name: events
        layout:
          shardsCount: 1
          replicasCount: 3
        secret:
          auto: "true"
  templates:
    serviceTemplates:
      - name: default-service-template
        generateName: clickhouse-{chi}
        metadata:
          annotations:
            cloud.google.com/load-balancer-type: "Internal"
            service.beta.kubernetes.io/aws-load-balancer-internal: "true"
            service.beta.kubernetes.io/azure-load-balancer-internal: "true"
            service.beta.kubernetes.io/openstack-internal-load-balancer: "true"
            service.beta.kubernetes.io/cce-load-balancer-internal-vpc: "true"
        spec:
          ports:
            - name: http
              port: 8123
            - name: tcp
              port: 9000
          type: LoadBalancer

When the operator upgraded, it appeared to get stuck attempting to convert clickhouse-events from a LoadBalancer to a ClusterIP. I believe this is somehow related to this commit that changes the default from LoadBalancer to ClusterIP. However, this CHI has always explicitly set the template to use LoadBalancer.

On startup, I saw this in the logs:

I0710 05:05:48.675757       1 service.go:86] CreateServiceCluster():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:foo/clickhouse-events
I0710 05:05:48.676889       1 worker-chi-reconciler.go:907] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Service: foo/clickhouse-events not found. err: service "clickhouse-events" not found
I0710 05:05:48.840035       1 deleter.go:322] deleteServiceIfExists():foo/clickhouse-events:Not Found Service: foo/clickhouse-events err: services "clickhouse-events" not found
I0710 05:05:49.062109       1 worker.go:1480] createService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:OK Create Service: foo/clickhouse-events
I0710 05:05:49.883043       1 worker-chi-reconciler.go:922] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Service reconcile successful: foo/clickhouse-events

...

I0710 05:06:25.213119       1 worker-chi-reconciler.go:900] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Service found: foo/clickhouse-events. Will try to update
E0710 05:06:25.213168       1 worker-chi-reconciler.go:914] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Update Service: foo/clickhouse-events failed with error: just recreate the service in case of service type change 'LoadBalancer'=>'ClusterIP'
I0710 05:06:26.384478       1 deleter.go:329] deleteServiceIfExists():foo/clickhouse-events:OK delete Service: foo/clickhouse-events
E0710 05:06:26.584816       1 worker.go:1486] createService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:FAILED Create Service: foo/clickhouse-events err: object is being deleted: services "clickhouse-events" already exists
E0710 05:06:27.422151       1 worker-chi-reconciler.go:928] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:FAILED to reconcile Service: foo/clickhouse-events CHI: events

It now appears to be recreated on a forced restart of the operator, and then a minute or so later, is deleted again. It won't be recreated until the operator restarts again.

I0710 05:16:25.276854       1 service.go:86] CreateServiceCluster():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:foo/clickhouse-events
I0710 05:16:25.278246       1 worker-chi-reconciler.go:907] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Service: foo/clickhouse-events not found. err: service "clickhouse-events" not found
I0710 05:16:25.435511       1 deleter.go:322] deleteServiceIfExists():foo/clickhouse-events:Not Found Service: foo/clickhouse-events err: services "clickhouse-events" not found
I0710 05:16:25.805221       1 worker.go:1480] createService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:OK Create Service: foo/clickhouse-events
I0710 05:16:26.468825       1 worker-chi-reconciler.go:922] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Service reconcile successful: foo/clickhouse-events

...

I0710 05:17:26.904518       1 worker-chi-reconciler.go:900] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Service found: foo/clickhouse-events. Will try to update
E0710 05:17:26.904648       1 worker-chi-reconciler.go:914] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Update Service: foo/clickhouse-events failed with error: just recreate the service in case of service type change 'LoadBalancer'=>'ClusterIP'
I0710 05:17:28.073703       1 deleter.go:329] deleteServiceIfExists():foo/clickhouse-events:OK delete Service: foo/clickhouse-events
E0710 05:17:28.274057       1 worker.go:1486] createService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:FAILED Create Service: foo/clickhouse-events err: object is being deleted: services "clickhouse-events" already exists
E0710 05:17:29.119358       1 worker-chi-reconciler.go:928] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:FAILED to reconcile Service: foo/clickhouse-events CHI: events 

Note: When it is creates, it is created correctly as a LoadBalancer, but then the second resource reconciliation attempts to make it a ClusterIP again.

@Slach
Copy link
Collaborator

Slach commented Jul 10, 2024

@mbrancato
Copy link
Author

@Slach I did not update the CRD. I have done so now, and it still is happening. Do I need to manually set a status.hostsUnchanged value in the CHI status?

% kubectl -n clickhouse get deploy chop-altinity-clickhouse-operator -o yaml | grep "image:"              
        image: altinity/clickhouse-operator:0.23.6
        image: altinity/metrics-exporter:0.23.6
% kubectl get crd clickhouseinstallations.clickhouse.altinity.com -o yaml | grep "clickhouse.altinity.com/chop"
    clickhouse.altinity.com/chop: 0.23.6
I0710 17:16:03.682328       1 service.go:86] CreateServiceCluster():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:foo/clickhouse-events
I0710 17:16:03.683552       1 worker-chi-reconciler.go:907] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Service: foo/clickhouse-events not found. err: service "clickhouse-events" not found
I0710 17:16:03.850464       1 deleter.go:322] deleteServiceIfExists():foo/clickhouse-events:Not Found Service: foo/clickhouse-events err: services "clickhouse-events" not found
I0710 17:16:04.074642       1 worker.go:1480] createService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:OK Create Service: foo/clickhouse-events
I0710 17:16:04.882621       1 worker-chi-reconciler.go:922] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Service reconcile successful: foo/clickhouse-events


I0710 17:16:17.088227       1 worker-chi-reconciler.go:900] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Service found: foo/clickhouse-events. Will try to update
E0710 17:16:17.088280       1 worker-chi-reconciler.go:914] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Update Service: foo/clickhouse-events failed with error: just recreate the service in case of service type change 'LoadBalancer'=>'ClusterIP'
I0710 17:16:18.254178       1 deleter.go:329] deleteServiceIfExists():foo/clickhouse-events:OK delete Service: foo/clickhouse-events
E0710 17:16:18.453646       1 worker.go:1486] createService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:FAILED Create Service: foo/clickhouse-events err: object is being deleted: services "clickhouse-events" already exists
E0710 17:16:19.295985       1 worker-chi-reconciler.go:928] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:FAILED to reconcile Service: foo/clickhouse-events CHI: events 
Service: foo/clickhouse-events
Service: foo/clickhouse-events
Service: foo/clickhouse-events

@mbrancato
Copy link
Author

mbrancato commented Jul 10, 2024

I tried adding a value into status.hostsUnchanged (that was the only change compared the the old CRD installed), and it made no difference. The CHOP is still constantly deleting the cluster service.

--- deploy/operatorhub/0.23.2/clickhouseinstallations.clickhouse.altinity.com.crd.yaml	2024-07-10 14:26:54
+++ deploy/operatorhub/0.23.6/clickhouseinstallations.clickhouse.altinity.com.crd.yaml	2024-07-10 14:26:54
@@ -4,14 +4,14 @@
 # SINGULAR=clickhouseinstallation
 # PLURAL=clickhouseinstallations
 # SHORT=chi
-# OPERATOR_VERSION=0.23.2
+# OPERATOR_VERSION=0.23.6
 #
 apiVersion: apiextensions.k8s.io/v1
 kind: CustomResourceDefinition
 metadata:
   name: clickhouseinstallations.clickhouse.altinity.com
   labels:
-    clickhouse.altinity.com/chop: 0.23.2
+    clickhouse.altinity.com/chop: 0.23.6
 spec:
   group: clickhouse.altinity.com
   scope: Namespaced
@@ -53,6 +53,11 @@
           type: string
           description: CHI status
           jsonPath: .status.status
+        - name: hosts-unchanged
+          type: integer
+          description: Unchanged hosts count
+          priority: 1 # show in wide view
+          jsonPath: .status.hostsUnchanged
         - name: hosts-updated
           type: integer
           description: Updated hosts count
@@ -172,6 +177,10 @@
                   nullable: true
                   items:
                     type: string
+                hostsUnchanged:
+                  type: integer
+                  minimum: 0
+                  description: "Unchanged Hosts count"
                 hostsUpdated:
                   type: integer
                   minimum: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants