Skip to content

ClusterScaler.py Overview

DailyDreaming edited this page Mar 16, 2018 · 5 revisions

BinPackedFit: Jobs are represented as objects with mem/core/disk/preemptive/wallTime attributes called jobShapes. Instances are also represented as objects with mem/core/disk/preemptive/wallTime attributes called nodeShapes. nodeShapes are in a sorted list, and jobShapes are compared with the first nodeShape in that list to see if it fits, if it doesn't, it tries the next... on through the list until a match is found and the first instance capable of running that jobShape is used. The nodeShape list is sorted in order of memory right now (which roughly correlates with price). It could be further optimized if sorted by price somehow.

The jobShapes "fit" a nodeShape if they match first the preemptive requirement, then to see if they fit the mem/core/disk requirements. JobShapes will be fit into nodeShape until one of mem/core/disk is limiting and cannot hold more. It will then decide whether to pack further based on the autoscaler's targetTime and each jobShape's wallTime. targetTime roughly represents when jobs ought to start by. Each jobShape's wallTime roughly represents how long Toil think it will take that job to run. If the wallTime of a jobShape is lower than the targetTime, Toil thinks it still has enough time to pack in the jobShape and will schedule another batch to run to fill up that targetTime block.

In practice, lower target times mean faster, more parallelized runs with more launched instances. Higher targetTimes pack more jobs per instance and parallelize less.

NodeReservation Object: Represents a series of Shape objects, with each Shape object representing the resources used for a reserved portion of time on a given instance. This list is linear in time, so self.Shape represents the current reservation object, and self.nReservation represents the resources that will be used next for a different length of time. To move through this list, self.nReservation is called recursively to fetch each reservation in order.

ClusterScaler: Manages automatically scaling the number of worker nodes. Calls the batch system to add or delete nodes. This is run with the ScalerThread wrapper as a Thread. It periodically runs a "check" on the batch every config.scaleInterval and calls on the provisioner to add or delete nodes as appropriate. This will ignore or unignore nodes and generally keeps track of the queue, what jobs are completed, and what is currently running.

ScalerThread: Wraps the ClusterScaler as a thread.

Notes!

  • Service jobs are one case where the degree of parallelization really, really matters. If you have multiple services, they may all need to be running simultaneously before any real work can be done. Previously, if your services had a fairly short runtime, the autoscaler would believe that it could pack them serially, one after the other. This would lead to service deadlocks in some situations.

  • Reservations are sort of like "time-slices" of resource usage at a node, so you can find the projected resource usage at each point in time.

  • A long-running job, which "extends" a node reservation for a very long time, could end up causing other jobs to be packed into its extended reservations, even though they may be projected to start several hours in the future. This ends up underprovisioning nodes if you have a few long-running jobs and many more short jobs. (I've noticed the old pre-multiple-node-types scaler underestimating the number of needed nodes with Cactus, though I don't know whether that's caused by this particular bug or not.)

  • When running services or other long-running jobs, the ignored nodes functionality would rarely lead to deadlocks, where all N nodes were ignored despite wanting to scale up to N nodes. Now the scaler will "unignore" nodes if it changes its mind about scaling up or down.

  • This scaler works entirely on the basis of queued jobs, and the "alpha" parameter is a time. The packing attempts to spawn enough nodes that all the queued jobs will be started by that time. Intuitively this is just a parameter which controls the degree of parallelization, without (hopefully) causing deadlocks if it can help it.

  • The scaler basically attempts to set the rate of job completion (assuming homogeneous jobs for simplicity) to n/a, n being the number of jobs and a being the alpha factor. But the scaler will run again in (by default) 60 seconds, and assuming some jobs have been completed, it will try to set the rate of job completion lower (because n is lower). The total number of jobs remaining, n, will decrease from the starting point N as roughly n = N/(t/a + 1). So the end result isn't the local result, that all jobs are started within alpha seconds--quite the contrary, they are almost guaranteed not to unless they can all fit into one node. This isn't a new problem: the old scalers had the exact same behavior, but ironically it didn't really matter because they couldn't scale down very well. We should solve this properly at some point, but for now I recommend increasing the scaler interval.

  • The beta parameter's current behavior is less useful than it could be, and can sometimes lead to problematic under-provisioning. Right now it is basically used to ignore small fluctuations, which is fine. But the problem usually with the scaler isn't with small fluctuations, it's with fluctuations that are too large, too fast. Usually this happens when thousands of small jobs are being dumped all at once, right after a series of long jobs, causing the autoscaler to freak out and hit its max -- then it rapidly figures out the jobs are pretty short and adjusts to something smaller than the max within 5 minutes, wasting some money. So I'd prefer the inertia parameter to be something much more like a moving average, which could give the autoscaler a chance to adjust, while still being able to hit the max in a reasonable time. (The moving average could still have some minor fluctuations, so maybe the existing beta parameter behavior would still be needed.) This was converted to an exponentially weighted moving average. No idea if having betaInertia is actually a good idea or not.

  • It's a useful discussion to have, because the current algorithm is obviously sub-optimal. It's easily possible that betaInertia should be set to 0.0 by default, or just plain removed.