Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cross-shard consumption fairness #2294

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Commits on Jun 20, 2024

  1. fair_queue: Scale fixed-point factor

    The vaulue is used to convert request tokens from float-point number to
    some "sane" integer. Currently on ~200k IOPS ~2GBs disk it produces the
    following token values for requests of different sizes
    
        512:    150k
        1024:   160k
        2048:   170k
        ...
        131072: 1.4M
    
    These values are pretty huge an when accumulated even in 64-bit counter
    can overflow it in months time scale. Current code sort of accounts for
    that by checking the overflow and resetting the counters, but in the
    future there will be the need to reset counters on different shards, and
    that's going to be problematic.
    
    This patch reduces the factor 8 times, so that the costs are now
    
        512:      19k
        1024:     20k
        2048:     21k
        ...
        131072:  170k
    
    That's much more friendly to accumulating counters (the overflow is now
    at the year's scale which is pretty comfortable). Reducing it even
    further is problematic, here's why.
    
    In order to provide cross-class fairness the costs are divided by class
    shares for accumulation. Given a class of 1000 shares, the 512-bytes
    request becomes indistinguishable from 1k one with smaller factor. Said
    that, even with the new factor it's worth taking more care when dividing
    the cost at shares use div-roundup math.
    
    Signed-off-by: Pavel Emelyanov <[email protected]>
    xemul committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    5991f91 View commit details
    Browse the repository at this point in the history
  2. fair_queue tests: Remember it is time-based

    Current tests on fair queue try to make the queue submit requests in
    extremely controllable way -- one-by-one. However, the fair queue
    nowadays is driven by rated token bucket and is very sensitive to time
    and durations. It's better to teach the test accept the fact that it
    cannot control fair-queue requests submissions on per-request
    granularity and tunes its accounting instead.
    
    The change affects two places.
    
    Main loop. Before the change it called fair_queue::dispatch_requests()
    as many times are the number of requests test case wants to pass, then
    performed the necessary checks. Now, the method is called infinitely,
    and the handling only processes the requested amount of requests. The
    rest is ignored.
    
    Drain. Before the change it called dispatch_requests() in a loop until
    it returned anything. Now it's called in a loop until fair queue
    explicitly reports that it's empty.
    
    Signed-off-by: Pavel Emelyanov <[email protected]>
    xemul committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    67c0eb1 View commit details
    Browse the repository at this point in the history
  3. fair_queue: Define signed_capacity_t type in fair_group

    For convenience
    
    Signed-off-by: Pavel Emelyanov <[email protected]>
    xemul committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    ecedc02 View commit details
    Browse the repository at this point in the history
  4. fair_queue: Introduce group-wide capacity balancing

    On each shard classes compete with each other by accumulating the sum of
    request costs that had been dispatched from them so far. Cost is the
    request capacity divided by the class shares. Dispatch loop then selects
    the class with the smallest accumulated value, thus providing
    shares-aware fairless -- the larger the shares value is, the slower the
    accumulator gorws, the more requests are picked from the class for
    dispatch.
    
    This patch implements similar approach across shards. For that, each
    shard accumnulates the dispatched cost from all classes. IO group keeps
    track of a vector of accumulated costs for each shard. When a shard
    wants to dispatch it first checks if it has run too far ahead of all
    other shards, and if it does, it skips the dispatch loop.
    
    Corner case -- when a queue gets drained, it "withdraws" itself from
    other shards' decisions by advancing its group counter to infinity.
    Respectively, when a group comes back it may forward its accumulator not
    to get too large advantage over other shards.
    
    When scheduling classes, shard has exclusive access to them and uses
    log-complex heap to pick the one with smallest consumption counter.
    Cross-shard balancing cannot afford it. Instead, each shard manipulates
    its own counter only, and to compare it with other shards' it scans the
    whole vector, which is not very cache-friendly and race-prone.
    
    Signed-off-by: Pavel Emelyanov <[email protected]>
    xemul committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    bd90f9f View commit details
    Browse the repository at this point in the history
  5. fair_queue: Drop per-dispatch-loop threshold

    The value is used to limit one shard in the amount of requests it's
    allowed to dispatch in one poll. This is to prevent it from consuming
    the whole capacity in one go and let other shards get their portion.
    
    Group-wide balancing (previous patch) made this fuse obsotele.
    
    Signed-off-by: Pavel Emelyanov <[email protected]>
    xemul committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    17309f3 View commit details
    Browse the repository at this point in the history
  6. fair_queue: Amortize cross-shard balance checking

    Looking at group balance counters is not very lightweight and is better
    be avoided when possible. For that -- when balance is achieved, arm a
    timer for quiscent period, and check only after it expires. When the
    group is not balanced, check balance more frequently.
    
    Signed-off-by: Pavel Emelyanov <[email protected]>
    xemul committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    2d444d0 View commit details
    Browse the repository at this point in the history
  7. test: Add manual test for cross-shard balancing

    It's pretty long, so not for automatic execition
    
    2-shards tests:
    
    {'shard_0': {'iops': 88204.3828, 'shares': 100}, 'shard_1': {'iops': 89686.5156, 'shares': 100}}
    IOPS ratio 1.02, expected 1.0, deviation 1%
    
    {'shard_0': {'iops': 60321.3125, 'shares': 100}, 'shard_1': {'iops': 117566.406, 'shares': 200}}
    IOPS ratio 1.95, expected 2.0, deviation 2%
    
    {'shard_0': {'iops': 37326.2422, 'shares': 100}, 'shard_1': {'iops': 140555.062, 'shares': 400}}
    IOPS ratio 3.77, expected 4.0, deviation 5%
    
    {'shard_0': {'iops': 21547.6152, 'shares': 100}, 'shard_1': {'iops': 156309.891, 'shares': 800}}
    IOPS ratio 7.25, expected 8.0, deviation 9%
    
    3-shards tests:
    
    {'shard_0': {'iops': 45211.9336, 'shares': 100}, 'shard_1': {'iops': 45211.9766, 'shares': 100}, 'shard_2': {'iops': 87412.9453, 'shares': 200}}
    shard-1 IOPS ratio 1.0, expected 1.0, deviation 0%
    shard-2 IOPS ratio 1.93, expected 2.0, deviation 3%
    
    {'shard_0': {'iops': 30992.2188, 'shares': 100}, 'shard_1': {'iops': 30992.2812, 'shares': 100}, 'shard_2': {'iops': 115887.609, 'shares': 400}}
    shard-1 IOPS ratio 1.0, expected 1.0, deviation 0%
    shard-2 IOPS ratio 3.74, expected 4.0, deviation 6%
    
    {'shard_0': {'iops': 19279.6348, 'shares': 100}, 'shard_1': {'iops': 19279.6934, 'shares': 100}, 'shard_2': {'iops': 139316.828, 'shares': 800}}
    shard-1 IOPS ratio 1.0, expected 1.0, deviation 0%
    shard-2 IOPS ratio 7.23, expected 8.0, deviation 9%
    
    {'shard_0': {'iops': 26505.9082, 'shares': 100}, 'shard_1': {'iops': 53011.9922, 'shares': 200}, 'shard_2': {'iops': 98369.4453, 'shares': 400}}
    shard-1 IOPS ratio 2.0, expected 2.0, deviation 0%
    shard-2 IOPS ratio 3.71, expected 4.0, deviation 7%
    
    {'shard_0': {'iops': 17461.8145, 'shares': 100}, 'shard_1': {'iops': 34923.8438, 'shares': 200}, 'shard_2': {'iops': 125470.43, 'shares': 800}}
    shard-1 IOPS ratio 2.0, expected 2.0, deviation 0%
    shard-2 IOPS ratio 7.19, expected 8.0, deviation 10%
    
    {'shard_0': {'iops': 14812.3037, 'shares': 100}, 'shard_1': {'iops': 58262, 'shares': 400}, 'shard_2': {'iops': 104794.633, 'shares': 800}}
    shard-1 IOPS ratio 3.93, expected 4.0, deviation 1%
    shard-2 IOPS ratio 7.07, expected 8.0, deviation 11%
    
    Signed-off-by: Pavel Emelyanov <[email protected]>
    xemul committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    3ef9817 View commit details
    Browse the repository at this point in the history