Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a hashlib-esque wrapper class + slight general cleanup which was necessary #50

Merged
merged 8 commits into from
Aug 9, 2024

Conversation

jonded94
Copy link
Contributor

@jonded94 jonded94 commented Aug 7, 2024

Now that this project is implemented as a python package (instead of a single-file distribution), I thought about adding some nice pythonic sugar on to it to make using it a bit nicer.

Internally, we're using this project to calculate hashes of large files and wrap it in something which already basically conforms to the interfaces given by the stdlib library hashlib.

This means that there are some wrapper classes with helper functions such as digest, hexdigest, update, ..
More info can be found here:

Instead of implementing this purely on our side, I asked myself why not to bring it upstream. The tests already are using modern pytest syntax and I also included additional type information in the stub file.

Let me know what you think about it.

Summary by Sourcery

Introduce a Crc32cHash class with a hashlib-like interface for CRC32C hashing, enhance type annotations, and add comprehensive tests for the new class.

New Features:

  • Introduce a Crc32cHash class that provides a hashlib-like interface for CRC32C hashing, including methods like digest, hexdigest, update, and base64.

Enhancements:

  • Add type annotations to the Crc32cHash class in the stub file for better type checking and code clarity.

Tests:

  • Add new test cases for the Crc32cHash class, including tests for digest size, name, copy functionality, and specific value checks using pytest.

Copy link

sourcery-ai bot commented Aug 7, 2024

Reviewer's Guide by Sourcery

This pull request introduces a new Crc32cHash class that wraps the crc32c functionality with a hashlib-like interface. The class includes methods for updating data, generating digests, and copying instances. Corresponding type annotations have been added. Additionally, comprehensive tests have been implemented to ensure the correctness of the new class.

File-Level Changes

Files Changes
src/crc32c/__init__.py
src/crc32c/__init__.pyi
Introduced a new Crc32cHash class with methods and properties to wrap the crc32c functionality, and added corresponding type annotations.
test/test_crc32c.py Added comprehensive tests for the new Crc32cHash class, including basic property checks, copy functionality, and specific value tests.

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jonded94 - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟡 Testing: 4 issues found
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

def base64(self):
return base64.b64encode(self.digest()).decode(encoding="ascii")

def digest(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Document the use of big-endian byte order

The digest method uses big-endian byte order. Consider adding a comment to explicitly document this choice, as it can be important for interoperability and understanding the implementation.

Suggested change
def digest(self):
def digest(self) -> bytes:
"""
Returns the CRC-32C checksum as a 4-byte big-endian byte string.
"""
return self._checksum.to_bytes(4, "big")

test/test_crc32c.py Show resolved Hide resolved
test/test_crc32c.py Show resolved Hide resolved

assert crc32c_hash.digest() != crc32c_hash_copy.digest()

@pytest.mark.parametrize("data,digest,hexdigest,base64", [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider adding more diverse test data for parameterized tests

The parameterized tests currently cover a few specific cases. It would be beneficial to add more diverse test data, such as inputs with special characters, different lengths, and binary data, to ensure the Crc32cHash class handles a wide range of inputs correctly.

Suggested change
@pytest.mark.parametrize("data,digest,hexdigest,base64", [
@pytest.mark.parametrize("data,digest,hexdigest,base64", [
(b"", b"\x00\x00\x00\x00", "00000000", "AAAAAA=="),
(b"23456789", b"\xbf\xe9\x2a\x83", "bfe92a83", "v+kqgw=="),
(b"Hello, World!", b"\x6f\xa1\x1e\x6c", "6fa11e6c", "b6EebA=="),
(b"\x00\xff\x55\xaa", b"\x51\x66\x29\x6c", "5166296c", "UWYpbA=="),
(b"a" * 1000, b"\x20\x4b\xc5\x39", "204bc539", "IEvFOQ=="),
])

crc32c_hash.update(x)
self._check_values(crc32c_hash, digest, hexdigest, base64)

def test_all(self, data: bytes, digest: bytes, hexdigest: str, base64: str) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider renaming test_all to a more descriptive name

The method name test_all is somewhat generic. Consider renaming it to something more descriptive, such as test_single_update, to better convey the purpose of the test.

        def test_single_update(self, data: bytes, digest: bytes, hexdigest: str, base64: str) -> None:

Copy link
Contributor

@rtobar rtobar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonded94! Yes, I do like the idea. Even though hashlib contains only secure hashes, I do recognise the utility of having this class in place.

Could you make sure the README and CHANGELOG files are updated accordingly? We also need some small, concise docstrings for the new code.

src/crc32c/__init__.py Outdated Show resolved Hide resolved
src/crc32c/__init__.py Outdated Show resolved Hide resolved
src/crc32c/__init__.py Outdated Show resolved Hide resolved
src/crc32c/__init__.py Outdated Show resolved Hide resolved
src/crc32c/__init__.py Outdated Show resolved Hide resolved
test/test_crc32c.py Outdated Show resolved Hide resolved
test/test_crc32c.py Show resolved Hide resolved
test/test_crc32c.py Show resolved Hide resolved
src/crc32c/__init__.pyi Outdated Show resolved Hide resolved
@jonded94
Copy link
Contributor Author

jonded94 commented Aug 8, 2024

Okay, so, as you can see, a bit more commits flew in.

I took the "bold" decision to really drop support for Python <3.7 because it enables quite a bunch of much cleaner implementation structures.

Note that Python 3.6 is EOL since 2021-12-23 (and Python 3.7 is EOL since 2023-06-27, but supporting it is not that hard), so I think this really is not a too bad thing to do. Rocking a Python version this old is bad even just from a security perspective.

This enabled me to use stub files only for the shared library part which simply has no other means of type hinting. All the other Python code is type hinted directly.

The collections.abc.Buffer thing is solved by using typing_extensions, but behind a if typing.TYPE_CHECKING block. So no runtime dependency was introduced here.

You will notice that there are some other changes that are more of a general cleanup. I'd really like to see them included and not again be moved out. Separating every little change in separate PRs is just of no value here (IMHO), compared to just do one, slightly larger cleanup in one go.

EDIT: Notice that mypy, with the changes I introduced, is now also able to test the entire src and test directory instead of a single file.

@jonded94
Copy link
Contributor Author

jonded94 commented Aug 8, 2024

I'm a bit unsure about this .travis.yml file. AFAIK, it's not used anywhere. I touched it in this PR, but can it just be removed entirely?

@jonded94 jonded94 changed the title Add a hashlib-esque wrapper class Add a hashlib-esque wrapper class + slight general cleanup which was necessary Aug 8, 2024
@rtobar
Copy link
Contributor

rtobar commented Aug 9, 2024

@jonded94 thanks for the further updates.

Indeed I would ideally have requested some of changes to come in separate PRs -- at least the general black + cleanup, and the Python 3.7 update. Unlike you, I do see value in doing these separately, as it allows discussions and changesets to be more focused, and helps with better bookkeeping. You are moving fast though, are very eager, and I'm currently on parental leave, so I can't act as fast as I'd like to. I thus prefer to be practical and don't hinder progress.

I see nothing wrong with the further changes in principle. The push for 3.7+ is fine, and indeed I did ponder about doing it when dropping 2.7 the other day, but seeing how we still only had the C extension to annotate there was no further gain from restricting 3.x versions. I'll again take the liberty to force-push some small things (again, just trying to be practical). For instance, now that there's formatting, CI should enforce it, otherwise there's no point. I'll also add some text to the CHANGELOG and README, fix some minor things around docs, things like that. I'll also delete the .travis.yaml file, it hasn't been used since we moved from Travis to GHA.

rtobar and others added 6 commits August 9, 2024 11:37
We moved from Travis to GHA a long time ago, this was a leftover from
old times.

Signed-off-by: Rodrigo Tobar <[email protected]>
This will allow us more modern features in python code to be added to
this package, like type annotations (3.5+) and dataclasses (3.7+). 3.7
is EOL already, but it's the most popular Python version our users are
based on.

Signed-off-by: Rodrigo Tobar <[email protected]>
rtobar: Let's also add it to CI so we enforce its usage.

Signed-off-by: Rodrigo Tobar <[email protected]>
This will allow us to then add annotations in __init__ directly

Signed-off-by: Rodrigo Tobar <[email protected]>
@rtobar rtobar force-pushed the add-hashlib-esque-wrapper-class branch from e8451c4 to 43d0179 Compare August 9, 2024 07:03
Comment from rtobar: while crc32c is no not strictly a secure hash, it
is useful to have a hashlib-like "hash" class for our checksum
algorithm, as it allows it to more easily plug into code using hashlib
hash objects.

Signed-off-by: Rodrigo Tobar <[email protected]>
@rtobar rtobar force-pushed the add-hashlib-esque-wrapper-class branch from 43d0179 to 5cb7d90 Compare August 9, 2024 07:14
@rtobar
Copy link
Contributor

rtobar commented Aug 9, 2024

OK, I've addressed the points I mentioned above and force pushed, see diff in https://github.com/jonded94/crc32c/compare/e8451c4..add-hashlib-esque-wrapper-class. I re-arranged commits slightly so they are self-contained, and in a way that changes are introduced in the correct order (declare 3.7+ compatibility first, then use 3.7+ features). Other than what I mentioned before, I also removed the __all__ declaration in __init__.py, and opted for using the from X import Y as Y that makes mypy happy regarding names being exported, while removing the ability to do from crc32c import *.

Let me know if you're happy with the changes, or if there's something else you think is missing. If happy, I'll merge and release.

@jonded94
Copy link
Contributor Author

jonded94 commented Aug 9, 2024

Hi @rtobar, thank you very much for your very fast responses and much effort that goes into these PRs!

I re-arranged commits slightly so they

Sorry that you had to do this, I usually push a lot of WIP commits, then let the review happen, implement necessary changes and then squash/split up into self-contained commits. Thank you for cleaning up my work!

This all looks great, very happy to have this in. Thank you for the great, productive time doing this!

Hope you can concentrate on your probably well deserved parental leave time.

@rtobar rtobar merged commit 5cb7d90 into ICRAR:master Aug 9, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants