Release GIL when doing the compute-intensive CRC computation #47

jonded94 · 2024-08-02T18:24:05Z

Unfortunately, this library captures the Python GIL for the entirety of its computation. I chose this library because it's quite fast compared to others, but this is pretty worthless if it's forcing one to use multiprocessing for a realtively mundane task such as computing a hash of a given bytebuffer.

I think this should work. On a very simply test benchmark, I saw a sizeable performance improvement using this fork (~46s vs. ~8s using 16 threads on my 13th Gen Intel(R) Core(TM) i7-1370P):

import crc32c
import concurrent.futures
import io
import time


def calc_crc(inp: io.BytesIO) -> int:
    return crc32c.crc32c(inp.getbuffer())


N_DATA = 1_000_000_000
N_COUNT = 1000

data = io.BytesIO(b"0"*N_DATA)

ts = time.perf_counter()
res = [calc_crc(data) for _ in range(N_COUNT)]
te = time.perf_counter()
print(te-ts)  # 46.33s


with concurrent.futures.ThreadPoolExecutor(max_workers=16) as executor:
    ts = time.perf_counter()
    res_threaded = list(executor.map(calc_crc, [data for _ in range(N_COUNT)]))
    te = time.perf_counter()
    print(te-ts)  # 8.02s
    assert res == res_threaded

Summary by Sourcery

Enhance the CRC computation by releasing the GIL during the compute-intensive operation, resulting in significant performance improvements in multi-threaded scenarios.

Enhancements:

Release the Global Interpreter Lock (GIL) during the compute-intensive CRC computation to improve performance in multi-threaded environments.

sourcery-ai · 2024-08-02T18:24:11Z

Reviewer's Guide by Sourcery

This pull request optimizes the CRC computation by releasing the Global Interpreter Lock (GIL) during the compute-intensive part of the operation. This change allows for better multi-threading performance, significantly reducing the computation time when using multiple threads.

File-Level Changes

Files	Changes
`_crc32c.c`	Introduced GIL release and reacquisition around the CRC computation to improve multi-threading performance.

Tips

Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
Continue your discussion with Sourcery by replying directly to review comments.
You can change your review settings at any time by accessing your dashboard:
- Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
- Change the review language;
You can always contact us if you have any questions or feedback.

sourcery-ai

Hey @jonded94 - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟡 General issues: 1 issue found
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

sourcery-ai · 2024-08-02T18:25:21Z

_crc32c.c

+	Py_BEGIN_ALLOW_THREADS
 	crc ^= 0xffffffff;
 	result = crc_fn(crc, bin_data, pbin.len);
 	result ^= 0xffffffff;
+	Py_END_ALLOW_THREADS


suggestion (performance): Consider profiling the GIL release impact on performance

While releasing the GIL can improve performance for CPU-bound operations, it's important to verify that this change actually provides a measurable benefit in real-world scenarios. Consider profiling the function with and without these changes to ensure the overhead of releasing and reacquiring the GIL doesn't outweigh the potential benefits, especially if the CRC calculation is typically fast.

#ifdef PROFILE_GIL_IMPACT Py_BEGIN_ALLOW_THREADS crc ^= 0xffffffff; result = crc_fn(crc, bin_data, pbin.len); result ^= 0xffffffff; Py_END_ALLOW_THREADS #else crc ^= 0xffffffff; result = crc_fn(crc, bin_data, pbin.len); result ^= 0xffffffff; #endif

I actually benchmarked it against a ~1GB byte buffer and there releasing the GIL gave me an ~6x improvement of performance already on my developer machine.
On an even more multi-threaded compute server, the impact surely is even larger.

Thank you for sharing your benchmark results. It's great to hear that you've already observed a significant performance improvement on your developer machine. Given your findings, it seems that releasing the GIL is indeed beneficial for this use case.

However, it might still be valuable to include a profiling option in the codebase, as it can help future developers understand the impact of this change in different environments and workloads. This can be particularly useful for maintaining and optimizing the library over time.

Would you be open to adding a compile-time option (like the PROFILE_GIL_IMPACT macro) to enable or disable the GIL release? This way, users can easily benchmark and profile the impact in their specific scenarios without modifying the source code.

rtobar · 2024-08-03T02:55:50Z

@jonded94 thanks for this contribution. It's been a long time since I wrote the package originally, and after so long I had assumed we were releasing the GIL already.

I agree with the change in principle; as you point out it unnecessary hinders multithreaded computation. Having said that, I'd like to have a look at the performance penalty as a function of input buffer size. Surely there's a cost associated with releasing/acquiring the GIL, which might affect performance in single-threaded cases, specially with small buffer sizes. I seem to remember binascii.crc32 does (or did) something similar.

I'll run some benchmarks locally and post results here. But again, thanks again for the contribution, most definitely a version of these changes will make it in.

rtobar · 2024-08-03T14:39:47Z

@jonded94 thanks again for taking interest in this issue, and providing the patch.

I wrote the following benchmark to measure the overhead of releasing the GIL as a function of input buffer size, both for the software and hardware modes.

import time

import seaborn as sns
from pandas import DataFrame
from matplotlib import pyplot as plt

from crc32c import crc32c

def _run(buf, n_iter: int, release_gil: bool, sw_mode: bool) -> float:
    start = time.monotonic()
    [crc32c(buf, 0, release_gil, sw_mode) for _ in range(n_iter)]
    return time.monotonic() - start

def _run_with_bufsizes(n_iter: int) -> DataFrame:
    bufsizes = (2 ** n for n in range(1, 20))
    buffers = [b'0' * size for size in bufsizes]
    return DataFrame.from_records(
        [(len(buffer), _run(buffer, n_iter, False, sw_mode), _run(buffer, n_iter, True, sw_mode), sw_mode) for buffer in buffers for sw_mode in (False, True)],
        columns=("bufsize", "time_no_release", "time_release", "sw_mode")
    )


n_iter = 10000
df = _run_with_bufsizes(n_iter)
df["release_overhead"] = (df["time_release"] / df["time_no_release"]) - 1
sns.set_theme()
sns.catplot(df, x="bufsize", y="release_overhead", hue="sw_mode", kind="bar")
plt.show()

You'll note that the crc32c function used for this accepts two extra options to release/not-release the GIL, and force using the SW implementation or not. This is the diff on top of your changes:

--- a/_crc32c.c
+++ b/_crc32c.c
@@ -41,6 +41,8 @@ PyObject* crc32c_crc32c(PyObject *self, PyObject *args) {
        Py_buffer pbin;
        unsigned char *bin_data = NULL;
        uint32_t crc = 0U, result;
+       int release_gil = 0;
+       int sw_mode = 0;
 
        /* In python 3 we accept only bytes-like objects */
        const char *format =
@@ -49,17 +51,26 @@ PyObject* crc32c_crc32c(PyObject *self, PyObject *args) {
 #else
        "s*"
 #endif
-       "|I:crc32";
+       "|Ipp:crc32c";
 
-       if (!PyArg_ParseTuple(args, format, &pbin, &crc) )
+       if (!PyArg_ParseTuple(args, format, &pbin, &crc, &release_gil, &sw_mode) )
                return NULL;
 
+       crc_function the_crc_fn = (sw_mode ? _crc32c_sw_slicing_by_8 : crc_fn);
+
        bin_data = pbin.buf;
+       if (release_gil) {
        Py_BEGIN_ALLOW_THREADS
        crc ^= 0xffffffff;
-       result = crc_fn(crc, bin_data, pbin.len);
+       result = the_crc_fn(crc, bin_data, pbin.len);
        result ^= 0xffffffff;
        Py_END_ALLOW_THREADS
+       }
+       else {
+       crc ^= 0xffffffff;
+       result = the_crc_fn(crc, bin_data, pbin.len);
+       result ^= 0xffffffff;
+       }
 
        PyBuffer_Release(&pbin);
        return PyLong_FromUnsignedLong(result);

These are the results on my system (AMD Ryzen 7 5825U, CPython 3.12):

As expected, the penalty is greater at smaller buffer sizes, and for the HW mode. Very roughly, and of course very specifically to my system, the penalty in HW mode decreases to ~2% at 32KB, and ~1% at 128KB. For SW those happen at somewhere around ~4KB and at ~8KB.

Also, as I remembered, the binascii.crc32 function does indeed conditionally release the GIL, see https://github.com/python/cpython/blob/cc6839a1810290e483e5d5f0786d9b46c4294d47/Modules/binascii.c#L772-L798. They put the limit at 5 KB, which more or less matches the SW mode at somewhere the ~1/2%.

In summary, I think we should restrict the releasing of the GIL to a minimum buffer size.

I'm a bit torn on the actual value though: it should be small enough that it covers most user cases, but big enough that the overhead of releasing the GIL isn't too big. A buffer size that I think would be suitable would be similar to that used for file or socket reading operations, since the data fed into this package comes likely from those. These buffer sizes are usually in the single or double-digit KBs. And given the numbers I got in my system, I'm inclined to make the cut at 32 KB. @jonded94 would you want to put that change in?

If we also wanted more flexibility, we could offer a module-level function that altered this limit, or even a flag like in the diff above that can be used on a per-call basis.

jonded94 · 2024-08-03T16:23:43Z

Thank you very much for your in-depth analysis, @rtobar ! Yes, it probably makes sense only releasing GIL after a certain threshold. I actually was unaware that it creates so much of an overhead.

TODOs:

Implement dynamic GIL release behaviour, probably with this final function header: crc32c(buf: bytes, checksum: int = 0, release_gil: bool | None = None)
- My idea would be that release_gil = None should dynamically determine whether to release the GIL (i.e. after a 32KiB threshold for example). The user can specify on its own if he knows it better by giving a boolean explicitly.
Make this package a PEP 561 (https://peps.python.org/pep-0561/) compliant package
- Because of a missing stub files, mypy will error with this message: error: Skipping analyzing "crc32c": module is installed, but missing library stubs or py.typed marker [import-untyped]
- Since there practically are only two functions in this package (with one being deprecated) I think it should be fine to include this in the PR? I thought of this because we're planning to slightly change the function header anyways, so proper type hints could be nice.

Will do this as soon as I find the time for it (probably this weekend or slightly later) :)

jonded94 · 2024-08-03T20:40:11Z

I now made this package PEP 561 compliant and added typehint stub files. This seems to have worked, now even a crc32c.crc32c(io.BytesIO(b"123").getbuffer()) properly typechecks with mypy --strict.

Also, I added a new keyword argument release_gil. This can be bool | int, and has this behaviour:

(default) -1 (or less): Automatically decide whether to release GIL or not. Will do it starting from buffer sizes >= 32KiB
0 or False: Never release GIL
1 or True: Always release GIL

rtobar

Thank you very much @jonded94 for the further changes :)

Two things that I'd like to point out here:

The new GIL releasing behavior, and the new argument, need to be described in the documentation (i.e., the README file).
Similarly, let's add a new entry to the CHANGELOG file, under the Development section.
I'm happy to have annotations added, but let's leave that for a different PR to avoid mixing things together.

_crc32c.c

MANIFEST.in

setup.py

crc32c/__init__.pyi

test/test_crc32c.py

_crc32c.c

jonded94 · 2024-08-05T10:12:58Z

Thanks rtobar for your in-depth review! :)

I hopefully addressed all of your concerns and pushed a bunch of new commits. Specifically, I also removed the adding of typehints from this PR and moved them into a separate one (#49).

Please let me know if you see anything that still needs further refinement.

rtobar

First of all, thank you so much @jonded94 for the continuing effort and patience while going through this review. This is looking excellent! Just a few more minor comments. Could you also rebase on top of the latest master? If possible I'd also like to push just one or two commits instead of 11 so far.

README.rst

_crc32c.c

CHANGELOG.md

test/test_crc32c.py

_crc32c.c

test/test_crc32c.py

_crc32c.c

…CRC32C hash

jonded94 · 2024-08-06T08:47:41Z

Implemented the suggestions from your review and squashed the changes into one commit. Should be fine now? 😄

rtobar · 2024-08-06T10:59:21Z

Many thanks again @jonded94 for the effort and patience! I'll wait for CI to check that everything's fine and I'll merge.

I'm more than happy to publish a new release on PyPI after this, would you want to get the other PR through first?

jonded94 · 2024-08-06T11:31:56Z

would you want to get the other PR through first

Yes, surely, as we're using mypy to a broad degree in our entire codebase and this would be very helpful (I'm having to add # type: ignore[import-untyped] everywhere where I import this library right now).

Setting "sw mode" as an additional kwarg could be interesting for a later PR, after the release.

sourcery-ai bot reviewed Aug 2, 2024

View reviewed changes

jonded94 force-pushed the make-crc32c-multithreaded branch from e881500 to 2e8b05a Compare August 2, 2024 18:27

rtobar reviewed Aug 4, 2024

View reviewed changes

jonded94 mentioned this pull request Aug 5, 2024

Make package PEP 561 compliant; add typehint stub file #49

Merged

jonded94 force-pushed the make-crc32c-multithreaded branch from 39f93b3 to 685a413 Compare August 5, 2024 10:12

rtobar reviewed Aug 6, 2024

View reviewed changes

Introduce gil_release_mode for releasing GIL during computation of …

bd99609

…CRC32C hash

jonded94 force-pushed the make-crc32c-multithreaded branch from 3919766 to bd99609 Compare August 6, 2024 08:46

rtobar merged commit c67c95b into ICRAR:master Aug 6, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release GIL when doing the compute-intensive CRC computation #47

Release GIL when doing the compute-intensive CRC computation #47

jonded94 commented Aug 2, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 2, 2024 •

edited

Loading

sourcery-ai bot left a comment

sourcery-ai bot Aug 2, 2024

jonded94 Aug 2, 2024

sourcery-ai bot Aug 2, 2024

rtobar commented Aug 3, 2024 •

edited

Loading

rtobar commented Aug 3, 2024

jonded94 commented Aug 3, 2024 •

edited

Loading

jonded94 commented Aug 3, 2024 •

edited

Loading

rtobar left a comment

jonded94 commented Aug 5, 2024

rtobar left a comment •

edited

Loading

jonded94 commented Aug 6, 2024

rtobar commented Aug 6, 2024

jonded94 commented Aug 6, 2024

Release GIL when doing the compute-intensive CRC computation #47

Release GIL when doing the compute-intensive CRC computation #47

Conversation

jonded94 commented Aug 2, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Aug 2, 2024 • edited Loading

Reviewer's Guide by Sourcery

File-Level Changes

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Aug 2, 2024

Choose a reason for hiding this comment

jonded94 Aug 2, 2024

Choose a reason for hiding this comment

sourcery-ai bot Aug 2, 2024

Choose a reason for hiding this comment

rtobar commented Aug 3, 2024 • edited Loading

rtobar commented Aug 3, 2024

jonded94 commented Aug 3, 2024 • edited Loading

jonded94 commented Aug 3, 2024 • edited Loading

rtobar left a comment

Choose a reason for hiding this comment

jonded94 commented Aug 5, 2024

rtobar left a comment • edited Loading

Choose a reason for hiding this comment

jonded94 commented Aug 6, 2024

rtobar commented Aug 6, 2024

jonded94 commented Aug 6, 2024

jonded94 commented Aug 2, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Aug 2, 2024 •

edited

Loading

rtobar commented Aug 3, 2024 •

edited

Loading

jonded94 commented Aug 3, 2024 •

edited

Loading

jonded94 commented Aug 3, 2024 •

edited

Loading

rtobar left a comment •

edited

Loading