Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wrappers for zarr v3 #524

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Add wrappers for zarr v3 #524

wants to merge 31 commits into from

Conversation

normanrz
Copy link
Contributor

@normanrz normanrz commented May 6, 2024

The Zarr v3 specification only lists a few codecs that are officially supported. However, it is desirable to expose the codecs in numcodecs for use with v3 arrays as well. This PR adds wrapper classes for numcodecs support.

The name of the codecs is prefixed with numcodecs. to avoid naming collisions in case some codecs of numcodecs get added to the Zarr spec. Also, there is a warning that numcodecs codecs are not officially supported and will likely not work in any other Zarr implementation.

Most array-to-array ("filters") and bytes-to-bytes codecs are supported. Absent are the variable-length codecs as well as json, msgpack and pickle.

Here is an example of the persisted configuration:

{
  "name": "numcodecs.fixedoffsetscale",
  "configuration": {"offset": 0, "scale": 51, "astype": "uint16"}
}

Use of numcodecs in v2 arrays is not affected.

Fixes #502

@pep8speaks
Copy link

pep8speaks commented May 6, 2024

Hello @normanrz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-05-08 20:53:05 UTC

@normanrz normanrz changed the title Add wrappers for zarr v3 [DRAFT] Add wrappers for zarr v3 May 6, 2024
@MSanKeys963 MSanKeys963 requested a review from jakirkham May 8, 2024 16:20
@normanrz normanrz changed the title [DRAFT] Add wrappers for zarr v3 Add wrappers for zarr v3 May 8, 2024
@rabernat
Copy link
Contributor

The name of the codecs is prefixed with https://zarr.dev/numcodecs/ to avoid naming collisions in case some codecs of numcodecs get added to the Zarr spec

I am not sure about the idea of using a URL that does not actually resolve to anything useful.

@rabernat
Copy link
Contributor

pcodec is actually an "Array to Bytes" codec: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/pcodec.py

How would that fit in here?

@martindurant
Copy link
Member

Any thoughts about what to do with numcodecs codecs not defined in this repo, but currently used via entrypoints?

@d-v-b
Copy link
Contributor

d-v-b commented Jun 19, 2024

The name of the codecs is prefixed with https://zarr.dev/numcodecs/ to avoid naming collisions in case some codecs of numcodecs get added to the Zarr spec

I am not sure about the idea of using a URL that does not actually resolve to anything useful.

seconding this sentiment, a URL that doesn't resolve to anything is rather confusing. I think numcodecs.<codec_name> or numcodecs/<codec_name> are simpler templates for a numcodecs-qualified name.

@rabernat
Copy link
Contributor

Any thoughts about what to do with numcodecs codecs not defined in this repo, but currently used via entrypoints?

Could we ask those codecs to implement Zarr codec entrypoints directly? Which codecs do you have in mind?

The challenge is that the V3 codecs are quite a bit more explicit in their typing (Array to Bytes, Bytes to Bytes, etc.) than legacy numcodecs codecs. So automatically translating an arbitrary numcodecs codec to a V3 codec is not possible.

@martindurant
Copy link
Member

I am thinking of https://github.com/fsspec/kerchunk/blob/main/kerchunk/codecs.py and imagecodecs. There are probably others.

@normanrz
Copy link
Contributor Author

The name of the codecs is prefixed with https://zarr.dev/numcodecs/ to avoid naming collisions in case some codecs of numcodecs get added to the Zarr spec

I am not sure about the idea of using a URL that does not actually resolve to anything useful.

I had asked @MSanKeys963 to setup the respective redirects to the numcodecs docs. That should solve that.

@normanrz
Copy link
Contributor Author

pcodec is actually an "Array to Bytes" codec: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/pcodec.py

How would that fit in here?

Must have missed pcodec. I'll add it.

Copy link

codecov bot commented Jun 24, 2024

Codecov Report

Attention: Patch coverage is 12.45283% with 232 lines in your changes missing coverage. Please review.

Project coverage is 90.97%. Comparing base (cabecae) to head (20aa698).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
numcodecs/_zarr3.py 8.52% 161 Missing ⚠️
numcodecs/tests/test_zarr3.py 7.79% 71 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #524      +/-   ##
==========================================
- Coverage   99.91%   90.97%   -8.94%     
==========================================
  Files          59       62       +3     
  Lines        2328     2593     +265     
==========================================
+ Hits         2326     2359      +33     
- Misses          2      234     +232     
Files with missing lines Coverage Δ
numcodecs/tests/test_zarr3_import.py 100.00% <100.00%> (ø)
numcodecs/tests/test_zarr3.py 7.79% <7.79%> (ø)
numcodecs/_zarr3.py 8.52% <8.52%> (ø)

@dstansby
Copy link
Contributor

Could you say something a bit about why this code makes sense to be in numcodecs instead of zarr-python? My first thoughts are that numcodecs sees much less development/maintenance than zarr-python at the moment, so unless there's a good reason maybe this code should live in zarr-python?

@normanrz
Copy link
Contributor Author

The idea was that the zarr package contains the "official" Zarr3 codecs and other libraries can add other codecs via the entrypoint mechanism. The zarr package provides base classes for doing so.
numcodecs is one of these libraries that can provide additional codecs. There is quite a bit of glue code to make the existing v2 codecs work with the new v3 base classes. This glue code is tightly coupled with the codecs itself, which is why I think it makes more sense to have it in numcodecs rather than zarr. If we would add a new codec to numcodecs, we would need to make a new zarr release to support it.
I see that it is a bit weird because zarr itself depends on numcodecs.

@dstansby
Copy link
Contributor

That makes sense - I'll try and give this a proper review in the next couple of days!

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't quite looked at everything, but here's some initial comments

numcodecs/tests/test_zarr3.py Outdated Show resolved Hide resolved
numcodecs/tests/test_zarr3.py Outdated Show resolved Hide resolved
numcodecs/tests/test_zarr3.py Show resolved Hide resolved
numcodecs/tests/test_zarr3.py Outdated Show resolved Hide resolved
numcodecs/tests/test_zarr3.py Outdated Show resolved Hide resolved
numcodecs/zarr3.py Outdated Show resolved Hide resolved
class NumcodecsCodec:
codec_config: dict[str, JSON]

def __init__(self, *, codec_id: str | None = None, codec_config: dict[str, JSON]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the codec_id type be narrowed to a Literal list of supported codecs?

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the code in detail because I haven't got the time to get the context on zarr v3 I need to do that. I've left some comments on what I'd like to see addressed before this is merged though, mainly to ensure we maintain compatibility with zarr v2.

I also wonder if we should make all of the zarr3 api private, just exposing it via the entrypoints, to give more flexibility for modifying it in the future?

numcodecs/tests/test_zarr3.py Outdated Show resolved Hide resolved
.github/workflows/ci.yaml Outdated Show resolved Hide resolved
numcodecs/zarr3.py Outdated Show resolved Hide resolved
@normanrz
Copy link
Contributor Author

@dstansby I incorporated your feedback:

  • zarr3.py -> _zarr3.py
  • Added a test for trying to import zarr-python 2. I had to add another CI matrix run for installing zarr-python 2. Not sure if this test is worth the extra CI time.

Not sure what is going on with codecov. I think all lines should be tested by at least a few CI runs. Help appreciated.

@normanrz
Copy link
Contributor Author

Thinking about how users might interact with this PR in zarr-developers/zarr-python#2398, I think we should make numcodecs.zarr3 public. While the entrypoints are nice for reading data, there is no json-configuration-based method of creating arrays and codec pipelines in zarr-python. So, the only way to use numcodecs codecs is through the Python classes.

@dstansby
Copy link
Contributor

Making it public sounds good - the API reference will need adding to https://numcodecs--524.org.readthedocs.build/en/524/api.html if it's going to be public. I'll have a look and try and debug the codecov issue now.

@@ -15,6 +15,7 @@ jobs:
python-version: ["3.10", "3.11", "3.12", "3.13.0"]
# macos-12 is an intel runner, macos-14 is a arm64 runner
platform: [ubuntu-latest, windows-latest, macos-12, macos-14]
zarr-version: ["zarr>=2,<3", "zarr==3.0.0b0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I lost track of this - I think you convinced me earlier in the comments of this PR that we could just install zarr==3 here and not bother testing with zarr v2?

@dstansby
Copy link
Contributor

The numcodecs issue is because the version check isn't working. If you check the logs of the runs with zarr v3, they contain the line

SKIPPED [2] numcodecs/tests/test_zarr3.py:11: module 'zarr' has __version__ '3.0.0b0', required is: '3.0.0'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supporting Zarr-Python 3 Codec API
7 participants