Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ability to define GUNICORN_WORKERS and CONN_MAX_AGE env vars via helm chart's values #74

Open
Bregor opened this issue Aug 20, 2018 · 12 comments · May be fixed by teamhephy/workflow#66

Comments

@Bregor
Copy link
Contributor

Bregor commented Aug 20, 2018

Following this code without ability to set GUNICORN_WORKERS we have number of workers equal to number_of_cpus*4+1.
Django's default behaviour is to open connection per worker for GUNICORN_WORKERS times and keep them opened for CONN_MAX_AGE seconds.
CONN_MAX_AGE is strictly set right in code for 600 seconds and max_connections parameter of PostgreSQL in deis-database strictly set to 100 too.
So, lets imagine, we have a big metal chassis with 40 CPUs.
Our GUNICORN_WORKERS variable will be equal to 40*4+1=161 (which is already greater than 100), also, we have CONN_MAX_AGE parameter equal to 600 seconds.
This way after 500 seconds (health check is once per 5 seconds, 100 times) queries like SELECT 0 which are results of health check will own all 100 connections to PostgreSQL.

@kingdonb
Copy link
Member

kingdonb commented Aug 20, 2018

So do you need to change CONN_MAX_AGE? I'm not sure why

It's clear at least that you probably don't really want 161 workers on a 24-CPU controller node, unless it's intended to serve a lot of devs concurrently... but I'm not sure how changing CONN_MAX_AGE will help anything.

You need to reduce the number of workers, or increase the number of allowed connections.

It makes sense to me that, any workers that can't get a database handle cannot help at all. Timing them out won't make things better (well, maybe that's wrong... even with 161 healthy workers, you probably wouldn't want 161 idle database connections to remain persistent at all times unless you actually had 161 currently active workers serving client requests.)

@kingdonb
Copy link
Member

I don't see any reason we can't add both of these to controller params, I'm just not sure if we should also add a configuration option for the number of connections allowed to the database while we're in there. (You might actually want to serve thousands of concurrent app developers, and maybe the default value of GUNICORN_WORKERS is actually what you wanted.)

Probably that is a separate issue, though.

@Cryptophobia
Copy link
Member

@Bregor So you are currently setting using ENV vars on the deis-controller deployment object to fix this issue? We should alternately allow users to be able to define GUNICORN_WORKERS and CONN_MAX_AGE on the helm values passed to the workflow install.

However, are the default values set correct in most cases when there are not that many users?

@Bregor
Copy link
Contributor Author

Bregor commented Aug 21, 2018

you are currently setting using ENV vars on the deis-controller deployment object to fix this issue

Yes. Deis database is a server with only one client (deis-controller), so this client should respect its settings. There should not be possible to run more gunicorn workers than max_connections in deis-database.

However, are the default values set correct in most cases when there are not that many users?

TL;DR: for now nodes can not contains more than 24 CPU

It's pretty easy to count:

  • healthcheck period: 5 seconds
  • TTL of pooled connection: 600 seconds
  • actual limit of connections: 100
  • current number of workers: cpu_count * 4 + 1

This way every 5 seconds new worker became alive and is trying to run SELECT 0 amongst deis-database.
While there is no pooled connections aged more than 600 seconds AND pool size (which is equal to GUNICORN_WORKERS) still has open slots, it will open new connection for 600 seconds.
While we run one worker per 5 seconds, there will be 120 workers (600/5) before first connection in pool will be freed.
Considering formula for calculation of current number of workers and fact we can not change connection TTL, there is strict limit: 24 CPUs per node for deis-controller to work.

@kingdonb
Copy link
Member

Do you think this is an important knob? I kind of think we should just set a cap for the maximum allowed value of GUNICORN_WORKERS, unless we're also going to allow users to tune max_connections in deis-database.

Been reading this: https://github.com/concourse/concourse/wiki/Anti-Patterns#knobs

@kingdonb
Copy link
Member

I'd guess 64 as a reasonable high-end limit. If we're making it tuneable, it should be tuneable both ways.

@kingdonb
Copy link
Member

(Personally I use deis-database for something other than deis-controller so I don't want it to consume all 100 connections, if the limit is 100. I know that's against the advice, but I'm not sure it is documented anywhere as such...)

@Cryptophobia
Copy link
Member

unless we're also going to allow users to tune max_connections in deis-database.

Doesn't max_connections in deis-database also depend on where you are running the postgresql database? For example, you can tune max_connection setting in AWS RDS and restart the db. This seems like a very custom request to support larger servers. I support making it tunable but having sane defaults for users who are not going to go that deep at first.

@Bregor
Copy link
Contributor Author

Bregor commented Aug 25, 2018

Tuneable max_connections is a cherry on a cake. Most of the time it's absolutely useless. Rephrasing B. Gates: "100 connections are enough for everyone" ;)
With short-living pooled connections, 100 actual connection is pretty enough to serve hundreds of developers on site.

@kingdonb
Copy link
Member

kingdonb commented Aug 25, 2018

I'm just trying to avoid adding an unnecessary knob, because it's possible to tune a knob wrong.

I'm not actually in favor of adding either knob, if we can simply write a rule that will guarantee it never fails on machines with >24 CPUs. That's a good point though, we can't always guarantee max_connections will be defaulted to 100, from a quick Google it actually looks like the default setting of max_connections on RDS postgres is...

uhh, parameterized[1] based on the amount of memory in your database? Well that's super... my idea is therefore out the window. There is no rule we could write that is guaranteed to be correct, not at least without additional knowledge about your off-cluster databases, if that's the case.

I'll consider that as due-diligence done, I'm satisfied that we should add a way to tune GUNICORN_WORKERS and CONN_MAX_AGE, and skip any further consideration about max_connections for now, since nobody is really here asking for that.

@kingdonb
Copy link
Member

started in #75

@kingdonb
Copy link
Member

I haven't tested teamhephy/workflow#66 yet, but it should work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants