Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading GKE Autopilot to 1.25 breaks Otel Operator daemonset #1005

Closed
mohaldu opened this issue Nov 6, 2023 · 4 comments
Closed

Upgrading GKE Autopilot to 1.25 breaks Otel Operator daemonset #1005

mohaldu opened this issue Nov 6, 2023 · 4 comments

Comments

@mohaldu
Copy link

mohaldu commented Nov 6, 2023

We're using GKE Autopilot. Recently upgraded to 0.86.1 and then upgraded GKE to 1.25 Kubernetes. With the following values and Chart:

splunk-otel-collector:
  clusterName: development-us-central1
  environment: development
  agent:
    config:
      processors:
        resource/delete:
          attributes:
            - key: "telemetry.auto.version"
              action: delete
            - key: "net.protocol.version"
              action: delete
            - key: "telemetry.sdk.version"
              action: delete
            - key: "telemetry.sdk.language"
              action: delete
            - key: "telemetry.sdk.name"
              action: delete
            - key: "process.executable.path"
              action: delete
            - key: "process.command_args"
              action: delete
            - key: "process.runtime.name"
              action: delete
            - key: "process.runtime.version"
              action: delete
      exporters:
        signalfx:
          include_metrics:
            - metric_names: [cpu.interrupt, cpu.user, cpu.system]
            - metric_name: system.cpu.time
              dimensions:
                state: [interrupt, user, system]
          access_token: ${SPLUNK_OBSERVABILITY_ACCESS_TOKEN}
          api_url: https://api.us1.signalfx.com
          correlation: null
          ingest_url: https://ingest.us1.signalfx.com
          sync_host_metadata: true
      service:
        pipelines:
          metrics:
            exporters:
            - signalfx
            processors:
            - memory_limiter
            - batch
            - resourcedetection
            - resource
            - resource/delete
            - resource/add_environment
            receivers:
            - hostmetrics
            - kubeletstats
            - otlp
            - receiver_creator
            - signalfx
  secret:
    name: "splunk-otel-secret"
    create: false
    validateSecret: false
  splunkObservability:
    realm: us1
    profilingEnabled: true
  certmanager:
    enabled: false
    global:
      leaderElection:
        namespace: "cert-manager"
    installCRDs: false
  operator:
    enabled: true
    admissionWebhooks:
      certManager:
        certificateAnnotations:
          "helm.sh/hook": pre-install,post-upgrade
          "helm.sh/hook-weight": "1"
        issuerAnnotations:
          "helm.sh/hook": pre-install,post-upgrade
          "helm.sh/hook-weight": "1"
  distribution: gke/autopilot
  cloudProvider: gcp

Chart:

apiVersion: v2
name: otel-operator
description: A Helm library chart for Splunk Otel Operator in Kubernetes
type: application
version: 1.0.0
dependencies:
  - name: splunk-otel-collector 
    version: 0.86.1
    repository: https://signalfx.github.io/splunk-otel-collector-chart

This gives us this error when the daemonset tries to deploy:

Error creating: admission webhook "[warden-validating.common-webhooks.networking.gke.io](http://warden-validating.common-webhooks.networking.gke.io/)" denied the request: GKE Warden rejected the request because it violates one or more constraints. Violations details: {"[denied by autogke-disallow-hostnamespaces]":["enabling hostNetwork is not allowed in Autopilot."],"[denied by autogke-no-host-port]":["container otel-collector specifies host ports [14250 14268 4317 4318 55681 9080 9943 9411], which are disallowed in Autopilot."],"[denied by autogke-no-write-mode-hostpath]":["hostPath volume host-dev used in container otel-collector uses path /dev which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume host-etc used in container otel-collector uses path /etc which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume host-proc used in container otel-collector uses path /proc which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume host-run-udev-data used in container otel-collector uses path /run/udev/data which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume host-sys used in container otel-collector uses path /sys which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume host-var-run-utmp used in container otel-collector uses path /var/run/utmp which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/]."]} Requested by user: 'system:serviceaccount:kube-system:daemon-set-controller', groups: 'system:serviceaccounts,system:serviceaccounts:kube-system,system:authenticated'.
@dmitryax
Copy link
Contributor

dmitryax commented Nov 6, 2023

Hi @mohaldu. Can you help to identify which particular change caused it? Did it work with the same configuration on an older helm chart version?

@mohaldu
Copy link
Author

mohaldu commented Nov 7, 2023

The latest version of the chart worked on GKE 1.24, but after upgrade to 1.25 it broke down. I'll see if I can downgrade my way down to a working state. But from what I see in your chart here - https://github.com/signalfx/splunk-otel-collector-chart/blob/main/helm-charts/splunk-otel-collector/templates/daemonset.yaml#L57

hostNetwork is always true even when not on autopilot? Are we just meant to use a different deployment method other than daemonset for GKE autopilot?

@mohaldu
Copy link
Author

mohaldu commented Nov 9, 2023

Hi @dmitryax we found that v82 works perfectly fine.

@atoulme
Copy link
Contributor

atoulme commented Nov 17, 2023

We are working on this. Please open a support case to follow up with us directly. Thanks!

@atoulme atoulme closed this as completed Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants