Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Trouble with HA for LAPI Pod #181

Closed
ImranR98 opened this issue Aug 26, 2024 · 7 comments · Fixed by #186
Closed

[Question] Trouble with HA for LAPI Pod #181

ImranR98 opened this issue Aug 26, 2024 · 7 comments · Fixed by #186
Assignees
Labels
kind/documentation Improvements or additions to documentation needs/triage Needs triage

Comments

@ImranR98
Copy link

I've been trying to get this to work in a small testing environment with Traefik. My current config seems to work fine with a single LAPI pod backed by a Postgres DB and connected to 2 agents on 2 nodes.

But if I try setting the lapi.replicas value to 2, I get the following error in one of the two pods when I try to run a cscli command (like cscli decisions list):
level=fatal msg="unable to retrieve decisions: performing request: Get \"http://localhost:8080/v1/alerts?has_active_decision=true&include_capi=false&limit=100\": API error: incorrect Username or Password" command terminated with exit code 1

This is my values.yaml:

config:
  config.yaml.local: |
    db_config:
      type:     postgresql
      user:     ${DB_USERNAME}
      password: ${DB_PASSWORD}
      db_name:  ${DB_NAME}
      host:     crowdsec-db.production.svc.cluster.local
      port:     5432
      sslmode:  disable

container_runtime: containerd

agent:
  acquisition:
    - namespace: production
      podName: traefik-*
      program: traefik
  env:
    - name: COLLECTIONS
      value: "crowdsecurity/traefik"
    - name: LEVEL_DEBUG
      value: "false"

lapi:
  replicas: 2 # Seems to not work with multiple replicas
  dashboard:
    enabled: true
  env:
    - name: BOUNCER_KEY_traefik
      value: "<some long value>"
    - name: DB_NAME
      valueFrom:
        secretKeyRef:
          name: crowdsec-db-secret
          key: POSTGRES_DB
    - name: DB_USERNAME
      valueFrom:
        secretKeyRef:
          name: crowdsec-db-secret
          key: POSTGRES_USER
    - name: DB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: crowdsec-db-secret
          key: POSTGRES_PASSWORD
  persistentVolume:
    config:
      enabled: false
    data:
      enabled: false
  secrets:
    csLapiSecret: "<some long value>" # I set this to try and fix the issue (it didn't)

My assumption was that since I have disabled persistent volumes and configured a DB instead, both LAPI instances would connect to the same DB and have no issues. But I've clearly misunderstood how everything fits together. Would appreciate anyone pointing me in the right direction!

@github-actions github-actions bot added the needs/triage Needs triage label Aug 26, 2024
Copy link

@ImranR98: Thanks for opening an issue, it is currently awaiting triage.

If you haven't already, please provide the following information:

  • kind : bug, enhancementor documentation
  • area : agent, appsec, configuration, cscli, local-api

In the meantime, you can:

  1. Check Crowdsec Documentation to see if your issue can be self resolved.
  2. You can also join our Discord.
  3. Check Releases to make sure your agent is on the latest version.
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the forked project rr404/oss-governance-bot repository.

@github-actions github-actions bot added the needs/kind Kind label required label Aug 26, 2024
Copy link

@ImranR98: There are no 'kind' label on this issue. You need a 'kind' label to start the triage process.

  • /kind bug
  • /kind documentation
  • /kind enhancement
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the forked project rr404/oss-governance-bot repository.

@ImranR98
Copy link
Author

/kind documentation
/area local-api

@github-actions github-actions bot added kind/documentation Improvements or additions to documentation and removed needs/kind Kind label required labels Aug 26, 2024
@crowdsecurity crowdsecurity deleted a comment Aug 26, 2024
@crowdsecurity crowdsecurity deleted a comment Aug 26, 2024
@crowdsecurity crowdsecurity deleted a comment Aug 26, 2024
@crowdsecurity crowdsecurity deleted a comment Aug 26, 2024
@he2ss
Copy link
Member

he2ss commented Aug 27, 2024

Hi, the solution is to check in the chart if the replica is enabled ( more than 1) then add suffix the env var CUSTOM_HOSTNAME with an index.

Discussed with @blotus.

@ImranR98
Copy link
Author

I'm not sure I understand, but glad to see there's a PR to fix it 🚀
Just to clarify, does this mean that - even without the PR you made - Crowdsec is actually working as expected aside from cscli availability? I assumed the lack of cscli access meant there was something else wrong with the pod.

@LaurenceJJones
Copy link
Contributor

LaurenceJJones commented Aug 30, 2024

I'm not sure I understand, but glad to see there's a PR to fix it 🚀 Just to clarify, does this mean that - even without the PR you made - Crowdsec is actually working as expected aside from cscli availability? I assumed the lack of cscli access meant there was something else wrong with the pod.

So a not so tldr;

When the LAPI pods come up because they need to have working credentials they execute a direct machine add command and by default the container choose the name "localhost" as by the default value for CUSTOM_HOSTNAME. Since both LAPI's are using the same name within the startup script they delete the previous LAPI credentials that were just registered (because it believes itself to be unique and if the name already exists it thinks that the LAPI pod has been deleted and the credentials have been lost) , hence why you have one LAPI that works with cscli and another that does not.

The side effect is that one of the LAPIs will work for a couple of hours due to the JWT token being valid and once the token expires that LAPI will start to get authentication errors since the previously registered username and password does now not exist within the database.

The fix, we now force each LAPI to have a unique name by using the pod metadata of the randomly generated name, this will stop the name collision.

@ImranR98
Copy link
Author

Okay that makes sense, thanks for the explanation!

@he2ss he2ss closed this as completed in #186 Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Improvements or additions to documentation needs/triage Needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@blotus @he2ss @LaurenceJJones @ImranR98 and others