Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite job pool to support nested parallelism for improved scalability #339

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Commits on Jan 2, 2023

  1. Rewrite job pool to support nested parallelism for improved scalability

    Before this change, given N images to encode, the caller had two
    options:
    
    - Use basis_compress for each image; internally this API would use multiple
      threads (when possible) to process large images on multiple threads
    - Use basis_parallel_compress for the group of images; internally this
      API would encode each image on one thread
    
    Neither of these would be optimal. For example, given 4 4K images and a
    16-core system, basis_parallel_compress is suboptimal because it would
    only use 4 out of 16 cores for encoding; basis_compress would be better,
    but it still would be suboptimal because encoding of a large image
    doesn't scale perfectly and there would be gaps of inactivity where only
    one core would do the work (notably, decoding the image from PNG inputs
    or rescaling to compute mips). Also, scaling would decrease in
    efficiency for smaller mips even for one image.
    
    To fully saturate the system, you could use basis_compress and call it
    from multiple threads. This solves the problem of having enough
    concurrent work, but results in overhead due to context switches/thread
    preemption. It's difficult to size the two pools required in this
    case (outer and inner) wrt how many threads each has - especially if the
    goal is to reach a given level of total parallelism that is less than
    the system provides (e.g. -j16 on a 24-thread system) to avoid system
    wide performance issues due to oversubscription.
    
    For this to work well we ideally need to have basis_parallel_compress
    use a single job pool - both for "outer" processing, handling the flow
    of image decoding, and for "inner" processing (threading the work
    required for one image). This would get us the best of both worlds - a
    predictable total system fanout limited by the job pool capacity, and a
    full saturation of the said pool.
    
    For this to work well, we need to change the mechanics of job dispatch.
    Notably, when we wait for the jobs to finish, it's important that this
    only interacts with the jobs launched for the current image - both in
    terms of which jobs we wait for, and which jobs we help process. Without
    this there's going to be cross-talk between individual "outer" jobs
    which can result in poor scalability.
    
    To solve this, the job_pool API gets augmented with optional tokens -
    these are just outstanding job counters, and the pointer to the counter
    is saved in the job struct, which allows us to selectively steal only
    jobs relevant for a given token. The token is kept optional just in case
    simpler uses of this API are interesting elsewhere in this project,
    although all image encoding really should use tokens to work with nested
    parallelism.
    
    Note that all operations on tokens are protected by the mutex so we
    don't need to use atomics. In fact, the use of atomics in the old code
    contained a bug that lead to a rare deadlock on shutdown, see
    https://cohost.org/zeux/post/520125-condition-variables.
    
    This code has been extensively tested as part of gltfpack. Performance
    numbers on an example glTF model with 12 2K textures using ETC1S
    encoding on AMD Ryzen 5900X (12-core/24-thread system):
    
    - basis_compress one image at a time, no threading: 44s
    - basis_compress one image at a time: 12.5s
    - basis_parallel_compress before this change: 9.0s
    - basis_parallel_compress after this change: 4.2s
    
    (this change was tested on a variety of models and should never degrade
    performance and almost always result in an improvement, the above is
    just an example - notably, even though all textures there are the same
    size, they don't take the same amount of time to compress, so even with
    12 serially compressed textures which is what basis_parallel_compress
    did by default, we have a very significant improvement in overall time
    with the new code)
    zeux committed Jan 2, 2023
    Configuration menu
    Copy the full SHA
    0f80a2d View commit details
    Browse the repository at this point in the history

Commits on Oct 6, 2024

  1. Configuration menu
    Copy the full SHA
    853b4a3 View commit details
    Browse the repository at this point in the history
  2. Add &token usage for HDR encoding

    This does not matter for gltfpack as we don't support HDR but it's
    valuable for consistency.
    zeux committed Oct 6, 2024
    Configuration menu
    Copy the full SHA
    4f7d849 View commit details
    Browse the repository at this point in the history