Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite job pool to support nested parallelism for improved scalability #339

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

zeux
Copy link
Contributor

@zeux zeux commented Jan 2, 2023

Before this change, given N images to encode, the caller had two options:

  • Use basis_compress for each image; internally this API would use multiple threads (when possible) to process large images on multiple threads
  • Use basis_parallel_compress for the group of images; internally this API would encode each image on one thread

Neither of these would be optimal. For example, given 4 4K images and a 16-core system, basis_parallel_compress is suboptimal because it would only use 4 out of 16 cores for encoding; basis_compress would be better, but it still would be suboptimal because encoding of a large image doesn't scale perfectly and there would be gaps of inactivity where only one core would do the work (notably, decoding the image from PNG inputs or rescaling to compute mips). Also, scaling would decrease in efficiency for smaller mips even for one image.

To fully saturate the system, you could use basis_compress and call it from multiple threads. This solves the problem of having enough concurrent work, but results in overhead due to context switches/thread preemption. It's difficult to size the two pools required in this case (outer and inner) wrt how many threads each has - especially if the goal is to reach a given level of total parallelism that is less than the system provides (e.g. -j16 on a 24-thread system) to avoid system wide performance issues due to oversubscription.

For this to work well we ideally need to have basis_parallel_compress use a single job pool - both for "outer" processing, handling the flow of image decoding, and for "inner" processing (threading the work required for one image). This would get us the best of both worlds - a predictable total system fanout limited by the job pool capacity, and a full saturation of the said pool.

For this to work well, we need to change the mechanics of job dispatch. Notably, when we wait for the jobs to finish, it's important that this only interacts with the jobs launched for the current image - both in terms of which jobs we wait for, and which jobs we help process. Without this there's going to be cross-talk between individual "outer" jobs which can result in poor scalability.

To solve this, the job_pool API gets augmented with optional tokens - these are just outstanding job counters, and the pointer to the counter is saved in the job struct, which allows us to selectively steal only jobs relevant for a given token. The token is kept optional just in case simpler uses of this API are interesting elsewhere in this project, although all image encoding really should use tokens to work with nested parallelism - forgetting it in the nested calls may lead to deadlocks as two running jobs would wait for each other with no forward progress.

Note that all operations on tokens are protected by the mutex so we don't need to use atomics. In fact, the use of atomics in the old code contained a bug that lead to a rare deadlock on shutdown, see https://cohost.org/zeux/post/520125-condition-variables.

This code has been extensively tested as part of gltfpack. Performance numbers on an example glTF model with 12 2K textures using ETC1S encoding on AMD Ryzen 5900X (12-core/24-thread system):

  • basis_compress one image at a time, no threading: 44s
  • basis_compress one image at a time: 12.5s
  • basis_parallel_compress before this change: 9.0s
  • basis_parallel_compress after this change: 4.2s

(this change was tested on a variety of models and should never degrade performance and almost always result in an improvement, the above is just an example - notably, even though all textures there are the same size, they don't take the same amount of time to compress, so even with 12 serially compressed textures which is what basis_parallel_compress did by default, we have a very significant improvement in overall time with the new code)

Before this change, given N images to encode, the caller had two
options:

- Use basis_compress for each image; internally this API would use multiple
  threads (when possible) to process large images on multiple threads
- Use basis_parallel_compress for the group of images; internally this
  API would encode each image on one thread

Neither of these would be optimal. For example, given 4 4K images and a
16-core system, basis_parallel_compress is suboptimal because it would
only use 4 out of 16 cores for encoding; basis_compress would be better,
but it still would be suboptimal because encoding of a large image
doesn't scale perfectly and there would be gaps of inactivity where only
one core would do the work (notably, decoding the image from PNG inputs
or rescaling to compute mips). Also, scaling would decrease in
efficiency for smaller mips even for one image.

To fully saturate the system, you could use basis_compress and call it
from multiple threads. This solves the problem of having enough
concurrent work, but results in overhead due to context switches/thread
preemption. It's difficult to size the two pools required in this
case (outer and inner) wrt how many threads each has - especially if the
goal is to reach a given level of total parallelism that is less than
the system provides (e.g. -j16 on a 24-thread system) to avoid system
wide performance issues due to oversubscription.

For this to work well we ideally need to have basis_parallel_compress
use a single job pool - both for "outer" processing, handling the flow
of image decoding, and for "inner" processing (threading the work
required for one image). This would get us the best of both worlds - a
predictable total system fanout limited by the job pool capacity, and a
full saturation of the said pool.

For this to work well, we need to change the mechanics of job dispatch.
Notably, when we wait for the jobs to finish, it's important that this
only interacts with the jobs launched for the current image - both in
terms of which jobs we wait for, and which jobs we help process. Without
this there's going to be cross-talk between individual "outer" jobs
which can result in poor scalability.

To solve this, the job_pool API gets augmented with optional tokens -
these are just outstanding job counters, and the pointer to the counter
is saved in the job struct, which allows us to selectively steal only
jobs relevant for a given token. The token is kept optional just in case
simpler uses of this API are interesting elsewhere in this project,
although all image encoding really should use tokens to work with nested
parallelism.

Note that all operations on tokens are protected by the mutex so we
don't need to use atomics. In fact, the use of atomics in the old code
contained a bug that lead to a rare deadlock on shutdown, see
https://cohost.org/zeux/post/520125-condition-variables.

This code has been extensively tested as part of gltfpack. Performance
numbers on an example glTF model with 12 2K textures using ETC1S
encoding on AMD Ryzen 5900X (12-core/24-thread system):

- basis_compress one image at a time, no threading: 44s
- basis_compress one image at a time: 12.5s
- basis_parallel_compress before this change: 9.0s
- basis_parallel_compress after this change: 4.2s

(this change was tested on a variety of models and should never degrade
performance and almost always result in an improvement, the above is
just an example - notably, even though all textures there are the same
size, they don't take the same amount of time to compress, so even with
12 serially compressed textures which is what basis_parallel_compress
did by default, we have a very significant improvement in overall time
with the new code)
This does not matter for gltfpack as we don't support HDR but it's
valuable for consistency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant