-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release GIL when doing the compute-intensive CRC computation #47
Conversation
Reviewer's Guide by SourceryThis pull request optimizes the CRC computation by releasing the Global Interpreter Lock (GIL) during the compute-intensive part of the operation. This change allows for better multi-threading performance, significantly reducing the computation time when using multiple threads. File-Level Changes
Tips
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jonded94 - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.
_crc32c.c
Outdated
Py_BEGIN_ALLOW_THREADS | ||
crc ^= 0xffffffff; | ||
result = crc_fn(crc, bin_data, pbin.len); | ||
result ^= 0xffffffff; | ||
Py_END_ALLOW_THREADS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (performance): Consider profiling the GIL release impact on performance
While releasing the GIL can improve performance for CPU-bound operations, it's important to verify that this change actually provides a measurable benefit in real-world scenarios. Consider profiling the function with and without these changes to ensure the overhead of releasing and reacquiring the GIL doesn't outweigh the potential benefits, especially if the CRC calculation is typically fast.
#ifdef PROFILE_GIL_IMPACT
Py_BEGIN_ALLOW_THREADS
crc ^= 0xffffffff;
result = crc_fn(crc, bin_data, pbin.len);
result ^= 0xffffffff;
Py_END_ALLOW_THREADS
#else
crc ^= 0xffffffff;
result = crc_fn(crc, bin_data, pbin.len);
result ^= 0xffffffff;
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually benchmarked it against a ~1GB byte buffer and there releasing the GIL gave me an ~6x improvement of performance already on my developer machine.
On an even more multi-threaded compute server, the impact surely is even larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing your benchmark results. It's great to hear that you've already observed a significant performance improvement on your developer machine. Given your findings, it seems that releasing the GIL is indeed beneficial for this use case.
However, it might still be valuable to include a profiling option in the codebase, as it can help future developers understand the impact of this change in different environments and workloads. This can be particularly useful for maintaining and optimizing the library over time.
Would you be open to adding a compile-time option (like the PROFILE_GIL_IMPACT
macro) to enable or disable the GIL release? This way, users can easily benchmark and profile the impact in their specific scenarios without modifying the source code.
e881500
to
2e8b05a
Compare
@jonded94 thanks for this contribution. It's been a long time since I wrote the package originally, and after so long I had assumed we were releasing the GIL already. I agree with the change in principle; as you point out it unnecessary hinders multithreaded computation. Having said that, I'd like to have a look at the performance penalty as a function of input buffer size. Surely there's a cost associated with releasing/acquiring the GIL, which might affect performance in single-threaded cases, specially with small buffer sizes. I seem to remember I'll run some benchmarks locally and post results here. But again, thanks again for the contribution, most definitely a version of these changes will make it in. |
@jonded94 thanks again for taking interest in this issue, and providing the patch. I wrote the following benchmark to measure the overhead of releasing the GIL as a function of input buffer size, both for the software and hardware modes. import time
import seaborn as sns
from pandas import DataFrame
from matplotlib import pyplot as plt
from crc32c import crc32c
def _run(buf, n_iter: int, release_gil: bool, sw_mode: bool) -> float:
start = time.monotonic()
[crc32c(buf, 0, release_gil, sw_mode) for _ in range(n_iter)]
return time.monotonic() - start
def _run_with_bufsizes(n_iter: int) -> DataFrame:
bufsizes = (2 ** n for n in range(1, 20))
buffers = [b'0' * size for size in bufsizes]
return DataFrame.from_records(
[(len(buffer), _run(buffer, n_iter, False, sw_mode), _run(buffer, n_iter, True, sw_mode), sw_mode) for buffer in buffers for sw_mode in (False, True)],
columns=("bufsize", "time_no_release", "time_release", "sw_mode")
)
n_iter = 10000
df = _run_with_bufsizes(n_iter)
df["release_overhead"] = (df["time_release"] / df["time_no_release"]) - 1
sns.set_theme()
sns.catplot(df, x="bufsize", y="release_overhead", hue="sw_mode", kind="bar")
plt.show() You'll note that the --- a/_crc32c.c
+++ b/_crc32c.c
@@ -41,6 +41,8 @@ PyObject* crc32c_crc32c(PyObject *self, PyObject *args) {
Py_buffer pbin;
unsigned char *bin_data = NULL;
uint32_t crc = 0U, result;
+ int release_gil = 0;
+ int sw_mode = 0;
/* In python 3 we accept only bytes-like objects */
const char *format =
@@ -49,17 +51,26 @@ PyObject* crc32c_crc32c(PyObject *self, PyObject *args) {
#else
"s*"
#endif
- "|I:crc32";
+ "|Ipp:crc32c";
- if (!PyArg_ParseTuple(args, format, &pbin, &crc) )
+ if (!PyArg_ParseTuple(args, format, &pbin, &crc, &release_gil, &sw_mode) )
return NULL;
+ crc_function the_crc_fn = (sw_mode ? _crc32c_sw_slicing_by_8 : crc_fn);
+
bin_data = pbin.buf;
+ if (release_gil) {
Py_BEGIN_ALLOW_THREADS
crc ^= 0xffffffff;
- result = crc_fn(crc, bin_data, pbin.len);
+ result = the_crc_fn(crc, bin_data, pbin.len);
result ^= 0xffffffff;
Py_END_ALLOW_THREADS
+ }
+ else {
+ crc ^= 0xffffffff;
+ result = the_crc_fn(crc, bin_data, pbin.len);
+ result ^= 0xffffffff;
+ }
PyBuffer_Release(&pbin);
return PyLong_FromUnsignedLong(result); These are the results on my system (AMD Ryzen 7 5825U, CPython 3.12): As expected, the penalty is greater at smaller buffer sizes, and for the HW mode. Very roughly, and of course very specifically to my system, the penalty in HW mode decreases to ~2% at 32KB, and ~1% at 128KB. For SW those happen at somewhere around ~4KB and at ~8KB. Also, as I remembered, the In summary, I think we should restrict the releasing of the GIL to a minimum buffer size. I'm a bit torn on the actual value though: it should be small enough that it covers most user cases, but big enough that the overhead of releasing the GIL isn't too big. A buffer size that I think would be suitable would be similar to that used for file or socket reading operations, since the data fed into this package comes likely from those. These buffer sizes are usually in the single or double-digit KBs. And given the numbers I got in my system, I'm inclined to make the cut at 32 KB. @jonded94 would you want to put that change in? If we also wanted more flexibility, we could offer a module-level function that altered this limit, or even a flag like in the diff above that can be used on a per-call basis. |
Thank you very much for your in-depth analysis, @rtobar ! Yes, it probably makes sense only releasing GIL after a certain threshold. I actually was unaware that it creates so much of an overhead. TODOs:
Will do this as soon as I find the time for it (probably this weekend or slightly later) :) |
I now made this package PEP 561 compliant and added typehint stub files. This seems to have worked, now even a Also, I added a new keyword argument
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @jonded94 for the further changes :)
Two things that I'd like to point out here:
- The new GIL releasing behavior, and the new argument, need to be described in the documentation (i.e., the
README
file). - Similarly, let's add a new entry to the
CHANGELOG
file, under the Development section. - I'm happy to have annotations added, but let's leave that for a different PR to avoid mixing things together.
39f93b3
to
685a413
Compare
Thanks rtobar for your in-depth review! :) I hopefully addressed all of your concerns and pushed a bunch of new commits. Specifically, I also removed the adding of typehints from this PR and moved them into a separate one (#49). Please let me know if you see anything that still needs further refinement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you so much @jonded94 for the continuing effort and patience while going through this review. This is looking excellent! Just a few more minor comments. Could you also rebase on top of the latest master? If possible I'd also like to push just one or two commits instead of 11 so far.
3919766
to
bd99609
Compare
Implemented the suggestions from your review and squashed the changes into one commit. Should be fine now? 😄 |
Many thanks again @jonded94 for the effort and patience! I'll wait for CI to check that everything's fine and I'll merge. I'm more than happy to publish a new release on PyPI after this, would you want to get the other PR through first? |
Yes, surely, as we're using Setting "sw mode" as an additional kwarg could be interesting for a later PR, after the release. |
Unfortunately, this library captures the Python GIL for the entirety of its computation. I chose this library because it's quite fast compared to others, but this is pretty worthless if it's forcing one to use multiprocessing for a realtively mundane task such as computing a hash of a given bytebuffer.
I think this should work. On a very simply test benchmark, I saw a sizeable performance improvement using this fork (~46s vs. ~8s using 16 threads on my 13th Gen Intel(R) Core(TM) i7-1370P):
Summary by Sourcery
Enhance the CRC computation by releasing the GIL during the compute-intensive operation, resulting in significant performance improvements in multi-threaded scenarios.
Enhancements: