Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CrashLoopBackOff after upgrading Datadog Agent to version 7.57.0 on AKS with security context #29427

Open
LQss11 opened this issue Sep 18, 2024 · 2 comments

Comments

@LQss11
Copy link

LQss11 commented Sep 18, 2024

Agent Environment

  • Agent version: 7.57.0
  • Cluster Agent version: 7.57.0
  • Operating System: Linux
  • Cloud Provider: Azure

Describe what happened:
We upgraded both the Cluster Agent and Datadog Agent from version 7.56.2 to 7.57.0. After the upgrade, the Datadog Agent began failing with a CrashLoopBackOff error. We were able to resolve the issue by downgrading the Datadog Agent back to version 7.56.2. However, we are concerned about compatibility issues when running different versions of the Cluster Agent and Datadog Agent.

The error encountered in the Datadog Agent logs is as follows:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7642b8f]

goroutine 668 [running]:
github.com/DataDog/datadog-agent/pkg/logs/launchers/integration.(*Launcher).run(0x0)
        /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/logs/launchers/integration/launcher.go:78 +0x2f
created by github.com/DataDog/datadog-agent/pkg/logs/launchers/integration.(*Launcher).Start in goroutine 345
        /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/logs/launchers/integration/launcher.go:66 +0x4f

The issue occurs when applying the following security context to the agent container:

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 100

Describe what you expected:
We expected the Datadog Agent to run successfully on version 7.57.0 with the specified security context applied. Additionally, we expect both the Cluster Agent and Datadog Agent to work on version 7.57.0 without encountering the CrashLoopBackOff error.

Steps to reproduce the issue:

  1. Upgrade both the Datadog Agent and Cluster Agent to version 7.57.0.
  2. Apply the following configuration for the agents and security context:
agents:
  containers:
    agent:
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 100
    initContainers:
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 100
    processAgent:
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 100
    traceAgent:
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 100
  image:
    doNotCheckTag: true
    tag: 7.57.0

clusterAgent:
  admissionController:
    configMode: service
    enabled: true
    mutateUnlabelled: true
  containers:
    clusterAgent:
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 100
    initContainers:
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        runAsNonRoot: true
        runAsUser: 100
  createPodDisruptionBudget: true
  env:
  - name: DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_INIT_SECURITY_CONTEXT
    value: '{"capabilities":{"drop":["ALL"]},"runAsNonRoot":true,"runAsUser":10000,"readOnlyRootFilesystem":true,"allowPrivilegeEscalation":false,"seccompProfile":{"type":"RuntimeDefault"}}'
  - name: DD_APM_INSTRUMENTATION_VERSION
    value: v1
  image:
    doNotCheckTag: true
    tag: 7.57.0
  replicas: 2

clusterChecksRunner:
  enabled: true
  securityContext:
    runAsNonRoot: true
    runAsUser: 100

datadog:
  apiKeyExistingSecret: datadog-secret
  apm:
    instrumentation:
      enabled: false
    portEnabled: true
    socketEnabled: false
  kubeStateMetricsCore:
    useClusterCheckRunners: true
  kubelet:
    tlsVerify: false
  logLevel: DEBUG
  logs:
    containerCollectAll: true
    enabled: true
  processAgent:
    enabled: true
    processCollection: true
  secretBackend:
    command: /readsecret_multiple_providers.sh
  securityContext:
    runAsNonRoot: true
    runAsUser: 100
  site: datadoghq.com
  tags:
  - env:prod

providers:
  aks:
    enabled: true

targetSystem: linux

Additional environment details (Operating System, Cloud provider, etc):

  • Cluster: AKS
@LQss11
Copy link
Author

LQss11 commented Sep 18, 2024

Full log below for further investigation:

2024-09-18 14:50:24 UTC | CORE | INFO | (pkg/util/log/log.go:841 in func1) | Starting to load the configuration
2024-09-18 14:50:24 UTC | CORE | WARN | (pkg/util/log/log.go:886 in func1) | Unknown environment variable: DD_GIT_REPOSITORY_URL
2024-09-18 14:50:24 UTC | CORE | WARN | (pkg/util/log/log.go:886 in func1) | Unknown environment variable: DD_GIT_COMMIT_SHA
2024-09-18 14:50:24 UTC | CORE | INFO | (pkg/util/log/log.go:841 in func1) | Loading proxy settings
2024-09-18 14:50:24 UTC | CORE | DEBUG | (pkg/util/log/log.go:806 in func1) | 'use_proxy_for_cloud_metadata' is enabled: adding cloud provider URL to the no_proxy list
2024-09-18 14:50:24 UTC | CORE | INFO | (pkg/util/log/log.go:841 in func1) | Starting to resolve secrets
2024-09-18 14:50:24 UTC | CORE | WARN | (pkg/util/log/log.go:886 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
2024-09-18 14:50:24 UTC | CORE | INFO | (pkg/util/log/log.go:841 in func1) | Finished resolving secrets
...
2024-09-18 14:50:28 UTC | CORE | DEBUG | (pkg/config/utils/trace.go:28 in GetTraceAgentDefaultEnv) | Setting DefaultEnv to "prod" (from `env:` entry under the 'tags' config option: "env:prod")
2024-09-18 14:50:28 UTC | CORE | INFO | (pkg/config/remote/service/util.go:48 in recreate) | Clear remote configuration database
2024-09-18 14:50:28 UTC | CORE | ERROR | (comp/remote-config/rcservice/rcserviceimpl/rcservice.go:59 in newRemoteConfigServiceOptional) | remote config service not initialized or started: unable to create remote config service: open /opt/datadog-agent/run/remote-config.db: permission denied
2024-09-18 14:50:28 UTC | CORE | INFO | (comp/core/gui/guiimpl/gui.go:105 in newGui) | GUI server port -1 specified: not starting the GUI.
2024-09-18 14:50:28 UTC | CORE | INFO | (comp/core/agenttelemetry/impl/agenttelemetry.go:113 in createAtel) | Agent telemetry is disabled
2024-09-18 14:50:28 UTC | CORE | WARN | (pkg/config/model/viper.go:225 in checkKnownKey) | config key runtime_security_config.sbom.enabled is unknown
2024-09-18 14:50:28 UTC | CORE | INFO | (comp/core/workloadmeta/impl/store.go:100 in start) | workloadmeta store initialized successfully
...
2024-09-18 14:50:28 UTC | CORE | INFO | (comp/core/autodiscovery/providers/config_reader.go:171 in read) | Searching for configuration files at: /opt/datadog-agent/bin/agent/dist/conf.d
2024-09-18 14:50:28 UTC | CORE | WARN | (comp/core/autodiscovery/providers/config_reader.go:175 in read) | Skipping, open /opt/datadog-agent/bin/agent/dist/conf.d: no such file or directory
2024-09-18 14:50:28 UTC | CORE | INFO | (comp/core/autodiscovery/providers/config_reader.go:171 in read) | Searching for configuration files at: 
2024-09-18 14:50:28 UTC | CORE | WARN | (comp/core/autodiscovery/providers/config_reader.go:175 in read) | Skipping, open : no such file or directory
2024-09-18 14:50:28 UTC | CORE | INFO | (pkg/config/autodiscovery/autodiscovery.go:111 in DiscoverComponentsFromEnv) | Adding KubeContainer provider from environment       
2024-09-18 14:50:28 UTC | CORE | INFO | (pkg/config/autodiscovery/autodiscovery.go:121 in DiscoverComponentsFromEnv) | Adding Kubelet listener from environment
...
2024-09-18 14:50:28 UTC | CORE | DEBUG | (comp/metadata/inventoryagent/inventoryagentimpl/inventoryagent.go:389 in Set) | setting inventory agent metadata 'logs_transport': 'HTTP'
2024-09-18 14:50:28 UTC | CORE | WARN | (pkg/logs/launchers/integration/launcher.go:49 in NewLauncher) | Unable to make integrations logs directory:  mkdir /opt/datadog-agent/run/integrations: permission denied
2024-09-18 14:50:28 UTC | CORE | INFO | (pkg/logs/auditor/auditor.go:203 in recoverRegistry) | Could not find state file at "/opt/datadog-agent/run/registry.json", will start with default offsets
2024-09-18 14:50:28 UTC | CORE | DEBUG | (pkg/logs/sds/scanner.go:65 in CreateScanner) | creating a new SDS scanner (internal id: 0xc001cea540)
2024-09-18 14:50:28 UTC | CORE | DEBUG | (pkg/logs/sds/scanner.go:65 in CreateScanner) | creating a new SDS scanner (internal id: 0xc001cea6c0)
2024-09-18 14:50:28 UTC | CORE | DEBUG | (pkg/logs/sds/scanner.go:65 in CreateScanner) | creating a new SDS scanner (internal id: 0xc001cea840)
2024-09-18 14:50:28 UTC | CORE | DEBUG | (pkg/logs/sds/scanner.go:65 in CreateScanner) | creating a new SDS scanner (internal id: 0xc001cea9c0)
2024-09-18 14:50:28 UTC | CORE | INFO | (comp/logs/agent/agentimpl/agent.go:198 in start) | logs-agent started
...
2024-09-18 14:50:28 UTC | CORE | INFO | (comp/forwarder/defaultforwarder/default_forwarder.go:394 in Start) | Forwarder started, sending to 1 endpoint(s) with 1 worker(s) each: "https://process.datadoghq.com" (1 api key(s))
2024-09-18 14:50:28 UTC | CORE | ERROR | (comp/dogstatsd/server/server.go:376 in start) | Can't init UDS listener on path /var/run/datadog/dsd.socket: can't listen: listen unixgram /var/run/datadog/dsd.socket: bind: permission denied
2024-09-18 14:50:28 UTC | CORE | DEBUG | (comp/dogstatsd/listeners/udp.go:100 in NewUDPListener) | dogstatsd-udp: [::]:8125 successfully initialized
2024-09-18 14:50:28 UTC | CORE | INFO | (pkg/aggregator/demultiplexer.go:224 in getDogStatsDWorkerAndPipelineCount) | Dogstatsd workers and pipelines count:  2  workers,  1  pipelines
2024-09-18 14:50:28 UTC | CORE | INFO | (pkg/aggregator/demultiplexer.go:142 in GetDogStatsDWorkerAndPipelineCount) | Dogstatsd configured to run with 2 workers and 1 pipelines
2024-09-18 14:50:28 UTC | CORE | DEBUG | (comp/dogstatsd/server/server.go:523 in handleMessages) | DogStatsD will run 2 workers
2024-09-18 14:50:28 UTC | CORE | INFO | (pkg/aggregator/demultiplexer.go:224 in getDogStatsDWorkerAndPipelineCount) | Dogstatsd workers and pipelines count:  2  workers,  1  pipelines
2024-09-18 14:50:28 UTC | CORE | INFO | (pkg/aggregator/demultiplexer.go:142 in GetDogStatsDWorkerAndPipelineCount) | Dogstatsd configured to run with 2 workers and 1 pipelines
....
2024-09-18 14:50:28 UTC | CORE | DEBUG | (pkg/util/containers/metrics/provider/collector.go:35 in bestCollector) | Using collector id: kubelet for type: provider.CollectorRef[github.com/DataDog/datadog-agent/pkg/util/containers/metrics/provider.ContainerIDForPodUIDAndContNameRetriever] and runtime: docker
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7642b8f]

goroutine 668 [running]:
github.com/DataDog/datadog-agent/pkg/logs/launchers/integration.(*Launcher).run(0x0)
        /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/logs/launchers/integration/launcher.go:78 +0x2f
created by github.com/DataDog/datadog-agent/pkg/logs/launchers/integration.(*Launcher).Start in goroutine 345
        /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/logs/launchers/integration/launcher.go:66 +0x4f

@FlorentClarret
Copy link
Member

Hello @LQss11, thanks for opening this issue.

I think it's the same issue mentioned in #29285. We are already preparing a new patch release with a fix for this, it will be shipped with Agent 7.57.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants