Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release GIL when doing the compute-intensive CRC computation #47

Merged
merged 1 commit into from
Aug 6, 2024

Conversation

jonded94
Copy link
Contributor

@jonded94 jonded94 commented Aug 2, 2024

Unfortunately, this library captures the Python GIL for the entirety of its computation. I chose this library because it's quite fast compared to others, but this is pretty worthless if it's forcing one to use multiprocessing for a realtively mundane task such as computing a hash of a given bytebuffer.

I think this should work. On a very simply test benchmark, I saw a sizeable performance improvement using this fork (~46s vs. ~8s using 16 threads on my 13th Gen Intel(R) Core(TM) i7-1370P):

import crc32c
import concurrent.futures
import io
import time


def calc_crc(inp: io.BytesIO) -> int:
    return crc32c.crc32c(inp.getbuffer())


N_DATA = 1_000_000_000
N_COUNT = 1000

data = io.BytesIO(b"0"*N_DATA)

ts = time.perf_counter()
res = [calc_crc(data) for _ in range(N_COUNT)]
te = time.perf_counter()
print(te-ts)  # 46.33s


with concurrent.futures.ThreadPoolExecutor(max_workers=16) as executor:
    ts = time.perf_counter()
    res_threaded = list(executor.map(calc_crc, [data for _ in range(N_COUNT)]))
    te = time.perf_counter()
    print(te-ts)  # 8.02s
    assert res == res_threaded

Summary by Sourcery

Enhance the CRC computation by releasing the GIL during the compute-intensive operation, resulting in significant performance improvements in multi-threaded scenarios.

Enhancements:

  • Release the Global Interpreter Lock (GIL) during the compute-intensive CRC computation to improve performance in multi-threaded environments.

Copy link

sourcery-ai bot commented Aug 2, 2024

Reviewer's Guide by Sourcery

This pull request optimizes the CRC computation by releasing the Global Interpreter Lock (GIL) during the compute-intensive part of the operation. This change allows for better multi-threading performance, significantly reducing the computation time when using multiple threads.

File-Level Changes

Files Changes
_crc32c.c Introduced GIL release and reacquisition around the CRC computation to improve multi-threading performance.

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jonded94 - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

_crc32c.c Outdated
Comment on lines 58 to 62
Py_BEGIN_ALLOW_THREADS
crc ^= 0xffffffff;
result = crc_fn(crc, bin_data, pbin.len);
result ^= 0xffffffff;
Py_END_ALLOW_THREADS
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Consider profiling the GIL release impact on performance

While releasing the GIL can improve performance for CPU-bound operations, it's important to verify that this change actually provides a measurable benefit in real-world scenarios. Consider profiling the function with and without these changes to ensure the overhead of releasing and reacquiring the GIL doesn't outweigh the potential benefits, especially if the CRC calculation is typically fast.

#ifdef PROFILE_GIL_IMPACT
    Py_BEGIN_ALLOW_THREADS
    crc ^= 0xffffffff;
    result = crc_fn(crc, bin_data, pbin.len);
    result ^= 0xffffffff;
    Py_END_ALLOW_THREADS
#else
    crc ^= 0xffffffff;
    result = crc_fn(crc, bin_data, pbin.len);
    result ^= 0xffffffff;
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually benchmarked it against a ~1GB byte buffer and there releasing the GIL gave me an ~6x improvement of performance already on my developer machine.
On an even more multi-threaded compute server, the impact surely is even larger.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing your benchmark results. It's great to hear that you've already observed a significant performance improvement on your developer machine. Given your findings, it seems that releasing the GIL is indeed beneficial for this use case.

However, it might still be valuable to include a profiling option in the codebase, as it can help future developers understand the impact of this change in different environments and workloads. This can be particularly useful for maintaining and optimizing the library over time.

Would you be open to adding a compile-time option (like the PROFILE_GIL_IMPACT macro) to enable or disable the GIL release? This way, users can easily benchmark and profile the impact in their specific scenarios without modifying the source code.

@rtobar
Copy link
Contributor

rtobar commented Aug 3, 2024

@jonded94 thanks for this contribution. It's been a long time since I wrote the package originally, and after so long I had assumed we were releasing the GIL already.

I agree with the change in principle; as you point out it unnecessary hinders multithreaded computation. Having said that, I'd like to have a look at the performance penalty as a function of input buffer size. Surely there's a cost associated with releasing/acquiring the GIL, which might affect performance in single-threaded cases, specially with small buffer sizes. I seem to remember binascii.crc32 does (or did) something similar.

I'll run some benchmarks locally and post results here. But again, thanks again for the contribution, most definitely a version of these changes will make it in.

@rtobar
Copy link
Contributor

rtobar commented Aug 3, 2024

@jonded94 thanks again for taking interest in this issue, and providing the patch.

I wrote the following benchmark to measure the overhead of releasing the GIL as a function of input buffer size, both for the software and hardware modes.

import time

import seaborn as sns
from pandas import DataFrame
from matplotlib import pyplot as plt

from crc32c import crc32c

def _run(buf, n_iter: int, release_gil: bool, sw_mode: bool) -> float:
    start = time.monotonic()
    [crc32c(buf, 0, release_gil, sw_mode) for _ in range(n_iter)]
    return time.monotonic() - start

def _run_with_bufsizes(n_iter: int) -> DataFrame:
    bufsizes = (2 ** n for n in range(1, 20))
    buffers = [b'0' * size for size in bufsizes]
    return DataFrame.from_records(
        [(len(buffer), _run(buffer, n_iter, False, sw_mode), _run(buffer, n_iter, True, sw_mode), sw_mode) for buffer in buffers for sw_mode in (False, True)],
        columns=("bufsize", "time_no_release", "time_release", "sw_mode")
    )


n_iter = 10000
df = _run_with_bufsizes(n_iter)
df["release_overhead"] = (df["time_release"] / df["time_no_release"]) - 1
sns.set_theme()
sns.catplot(df, x="bufsize", y="release_overhead", hue="sw_mode", kind="bar")
plt.show()

You'll note that the crc32c function used for this accepts two extra options to release/not-release the GIL, and force using the SW implementation or not. This is the diff on top of your changes:

--- a/_crc32c.c
+++ b/_crc32c.c
@@ -41,6 +41,8 @@ PyObject* crc32c_crc32c(PyObject *self, PyObject *args) {
        Py_buffer pbin;
        unsigned char *bin_data = NULL;
        uint32_t crc = 0U, result;
+       int release_gil = 0;
+       int sw_mode = 0;
 
        /* In python 3 we accept only bytes-like objects */
        const char *format =
@@ -49,17 +51,26 @@ PyObject* crc32c_crc32c(PyObject *self, PyObject *args) {
 #else
        "s*"
 #endif
-       "|I:crc32";
+       "|Ipp:crc32c";
 
-       if (!PyArg_ParseTuple(args, format, &pbin, &crc) )
+       if (!PyArg_ParseTuple(args, format, &pbin, &crc, &release_gil, &sw_mode) )
                return NULL;
 
+       crc_function the_crc_fn = (sw_mode ? _crc32c_sw_slicing_by_8 : crc_fn);
+
        bin_data = pbin.buf;
+       if (release_gil) {
        Py_BEGIN_ALLOW_THREADS
        crc ^= 0xffffffff;
-       result = crc_fn(crc, bin_data, pbin.len);
+       result = the_crc_fn(crc, bin_data, pbin.len);
        result ^= 0xffffffff;
        Py_END_ALLOW_THREADS
+       }
+       else {
+       crc ^= 0xffffffff;
+       result = the_crc_fn(crc, bin_data, pbin.len);
+       result ^= 0xffffffff;
+       }
 
        PyBuffer_Release(&pbin);
        return PyLong_FromUnsignedLong(result);

These are the results on my system (AMD Ryzen 7 5825U, CPython 3.12):

benchmark

As expected, the penalty is greater at smaller buffer sizes, and for the HW mode. Very roughly, and of course very specifically to my system, the penalty in HW mode decreases to ~2% at 32KB, and ~1% at 128KB. For SW those happen at somewhere around ~4KB and at ~8KB.

Also, as I remembered, the binascii.crc32 function does indeed conditionally release the GIL, see https://github.com/python/cpython/blob/cc6839a1810290e483e5d5f0786d9b46c4294d47/Modules/binascii.c#L772-L798. They put the limit at 5 KB, which more or less matches the SW mode at somewhere the ~1/2%.

In summary, I think we should restrict the releasing of the GIL to a minimum buffer size.

I'm a bit torn on the actual value though: it should be small enough that it covers most user cases, but big enough that the overhead of releasing the GIL isn't too big. A buffer size that I think would be suitable would be similar to that used for file or socket reading operations, since the data fed into this package comes likely from those. These buffer sizes are usually in the single or double-digit KBs. And given the numbers I got in my system, I'm inclined to make the cut at 32 KB. @jonded94 would you want to put that change in?

If we also wanted more flexibility, we could offer a module-level function that altered this limit, or even a flag like in the diff above that can be used on a per-call basis.

@jonded94
Copy link
Contributor Author

jonded94 commented Aug 3, 2024

Thank you very much for your in-depth analysis, @rtobar ! Yes, it probably makes sense only releasing GIL after a certain threshold. I actually was unaware that it creates so much of an overhead.

TODOs:

  • Implement dynamic GIL release behaviour, probably with this final function header: crc32c(buf: bytes, checksum: int = 0, release_gil: bool | None = None)
    • My idea would be that release_gil = None should dynamically determine whether to release the GIL (i.e. after a 32KiB threshold for example). The user can specify on its own if he knows it better by giving a boolean explicitly.
  • Make this package a PEP 561 (https://peps.python.org/pep-0561/) compliant package
    • Because of a missing stub files, mypy will error with this message: error: Skipping analyzing "crc32c": module is installed, but missing library stubs or py.typed marker [import-untyped]
    • Since there practically are only two functions in this package (with one being deprecated) I think it should be fine to include this in the PR? I thought of this because we're planning to slightly change the function header anyways, so proper type hints could be nice.

Will do this as soon as I find the time for it (probably this weekend or slightly later) :)

@jonded94
Copy link
Contributor Author

jonded94 commented Aug 3, 2024

I now made this package PEP 561 compliant and added typehint stub files. This seems to have worked, now even a crc32c.crc32c(io.BytesIO(b"123").getbuffer()) properly typechecks with mypy --strict.

Also, I added a new keyword argument release_gil. This can be bool | int, and has this behaviour:

  • (default) -1 (or less): Automatically decide whether to release GIL or not. Will do it starting from buffer sizes >= 32KiB
  • 0 or False: Never release GIL
  • 1 or True: Always release GIL

Copy link
Contributor

@rtobar rtobar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @jonded94 for the further changes :)

Two things that I'd like to point out here:

  • The new GIL releasing behavior, and the new argument, need to be described in the documentation (i.e., the README file).
  • Similarly, let's add a new entry to the CHANGELOG file, under the Development section.
  • I'm happy to have annotations added, but let's leave that for a different PR to avoid mixing things together.

_crc32c.c Outdated Show resolved Hide resolved
_crc32c.c Outdated Show resolved Hide resolved
_crc32c.c Outdated Show resolved Hide resolved
MANIFEST.in Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
crc32c/__init__.pyi Outdated Show resolved Hide resolved
test/test_crc32c.py Outdated Show resolved Hide resolved
_crc32c.c Outdated Show resolved Hide resolved
@jonded94
Copy link
Contributor Author

jonded94 commented Aug 5, 2024

Thanks rtobar for your in-depth review! :)

I hopefully addressed all of your concerns and pushed a bunch of new commits. Specifically, I also removed the adding of typehints from this PR and moved them into a separate one (#49).

Please let me know if you see anything that still needs further refinement.

Copy link
Contributor

@rtobar rtobar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you so much @jonded94 for the continuing effort and patience while going through this review. This is looking excellent! Just a few more minor comments. Could you also rebase on top of the latest master? If possible I'd also like to push just one or two commits instead of 11 so far.

README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
_crc32c.c Outdated Show resolved Hide resolved
_crc32c.c Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
test/test_crc32c.py Outdated Show resolved Hide resolved
_crc32c.c Outdated Show resolved Hide resolved
test/test_crc32c.py Outdated Show resolved Hide resolved
_crc32c.c Outdated Show resolved Hide resolved
@jonded94
Copy link
Contributor Author

jonded94 commented Aug 6, 2024

Implemented the suggestions from your review and squashed the changes into one commit. Should be fine now? 😄

@rtobar
Copy link
Contributor

rtobar commented Aug 6, 2024

Many thanks again @jonded94 for the effort and patience! I'll wait for CI to check that everything's fine and I'll merge.

I'm more than happy to publish a new release on PyPI after this, would you want to get the other PR through first?

@jonded94
Copy link
Contributor Author

jonded94 commented Aug 6, 2024

would you want to get the other PR through first

Yes, surely, as we're using mypy to a broad degree in our entire codebase and this would be very helpful (I'm having to add # type: ignore[import-untyped] everywhere where I import this library right now).

Setting "sw mode" as an additional kwarg could be interesting for a later PR, after the release.

@rtobar rtobar merged commit c67c95b into ICRAR:master Aug 6, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants