Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up scaling operations #58

Open
skuda opened this issue Oct 19, 2017 · 4 comments
Open

Speeding up scaling operations #58

skuda opened this issue Oct 19, 2017 · 4 comments

Comments

@skuda
Copy link

skuda commented Oct 19, 2017

Hi,

This is not a bug, sorry I didn't find a better way to communicate this!

I have been using the autoscaler and it's working great, but it's somewhat slow, for us adding new nodes is taking approximately 10 minutes.

Maybe our use case is a bit special but we have usually very small load that sometimes go up very fast, the specific service I am speaking about acts as a precomputed cache.
If the cache is full, hits are very cheap, if the cache is purged, something that happens 2 or 3 times per week, the load skyrocket for about 2 to 3 hours.

I understand that creating the nodes, installing everything and adding them to the cluster is something that takes its time, but I have been thinking, why not having a specific number of pre-configured nodes, only deallocated, it would be much faster to just put online existing servers that destroy them and recreate them from the very beginning every time, no?

Best,
Miguel.

@wbuchwalter
Copy link
Owner

wbuchwalter commented Oct 20, 2017

Hi @skuda,

This is the best place to discuss this :)
Keeping deallocated node is not a bad idea, but I think it will be quite complex to implement correctly.
What would be an acceptable scaling time in your case?

@oryagel
Copy link

oryagel commented Oct 20, 2017

We would like to see something like that as well - reduce the starvation time. I was thinking of a different approach, just keep extra nodes alive. For exmaple, if I will configure extra-nodes=2. the autoscaler will always keep extra cores and memory that match the resources of two nodes.

@skuda
Copy link
Author

skuda commented Oct 20, 2017

@wbuchwalter For me, 2 or even 3 minutes would be fast enough.

@alexquintero
Copy link

I'm not sure this is the fault of the autoscaler itself but rather a function of how long it takes for Azure to spin up a VM for your cluster. In my general tests it takes anywhere from 7-13 minutes to get a new VM in an Availability Set. This is in westus by the way. I wonder if the vm scale up time differs based on region?

I agree with @skuda that a shorter vm time of a few minutes would be ideal. I personally don't think this is possible without using VM Scale Sets. Maybe when those are supported by acs-engine we will get the shorter spin up time.

Or... when we are able to use a stable ACI connector, or some other method of having a virtual kubelet with infinite capacity (serverless containers), then there shouldn't be any VM spin up time anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants