Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr-v3 Consolidated Metadata #2113

Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Aug 23, 2024

Implements the optional Consolidated Metadata feature of zarr-v3: zarr-developers/zarr-specs#309, along with reading zarr v2 metadata.

The implementation defines a new dataclass: ConsoliatedMetadata. It's an optional field on the existing GroupMetadata object. Opening as a draft until the PR to the zarr-specs repo finishes up.

closes #1161

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

src/zarr/codecs/crc32c_.py Outdated Show resolved Hide resolved
src/zarr/core/array.py Outdated Show resolved Hide resolved
src/zarr/core/common.py Outdated Show resolved Hide resolved
src/zarr/core/group.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

This isn't handling nested groups correctly yet. My understanding (and I'll clarify this in the spec) is that given a structure like

/root
  /arr-1
  /arr-2
  /child-group
    /child-arr-1
    /child-arr-2

we'd expect that the metadata from child-group, child-arr-1, and child-arr-2 should end up in the consolidated metadata.

Ensures that nested children are listed properly.
This PR adds a recursive=True flag to Group.members, for recursively
listing the members of some hierarhcy.

This is useful for Consolidated Metadata, which needs to recursively
inspect children. IMO, it's useful (and simple) enough to include
in the public API.
@TomAugspurger TomAugspurger mentioned this pull request Aug 25, 2024
6 tasks
Implements the optional Consolidated Metadata feature of zarr-v3.
@TomAugspurger TomAugspurger force-pushed the user/tom/feature/consolidated-metadata branch from 4943be2 to 65a8bd4 Compare August 25, 2024 18:53
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start here Tom! I have a few questions about parts you haven't implemented yet that should help explore the design a bit more:

  1. How are you thinking about wiring this into read operations (e.g. getitem, listdir, etc.)?
  2. How should we think about consistency w.r.t. the consolidated metadata field? When should we invalidate the consolidated metadata and when is it safe to keep using it?
  3. How can this approach be used to support reading/writing v2 consolidated metadata (.zmetadata)?

src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/core/common.py Outdated Show resolved Hide resolved
src/zarr/core/group.py Show resolved Hide resolved
src/zarr/core/group.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

Quick update here:

  • AsyncGroup now uses consolidated metadata in some operations (primarily getitem. I'm going through some more spots now). This means you can do AsyncGroup.getitem(key) to get a child node without any additional I/O.
  • fc901eb added support for reading zarr V2 consolidated metadata. Needs some cleanup and probably more testing, but the basics seem to work.

src/zarr/api/asynchronous.py Show resolved Hide resolved
src/zarr/core/group.py Outdated Show resolved Hide resolved
src/zarr/core/group.py Outdated Show resolved Hide resolved
src/zarr/core/group.py Show resolved Hide resolved
src/zarr/core/group.py Outdated Show resolved Hide resolved
src/zarr/core/group.py Outdated Show resolved Hide resolved
src/zarr/core/group.py Show resolved Hide resolved
src/zarr/core/group.py Outdated Show resolved Hide resolved
src/zarr/core/group.py Outdated Show resolved Hide resolved
src/zarr/core/group.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

As I start to use this a bit, I'm rethinking the in-memory representation of consolidated metadata for a nested hierarchy. Specifically the consolidated_metadata.metadata dictionary which maps keys to ArrayMetadata | GroupMetadata objects. Our options are:

  1. A flat structure, where the keys include all path segments from the root group and the root ConsolidatedMetadata.metadata dict is equal to the length of all child nodes (not just immediate children). This matches the representation on disk.
  2. A nested structure, where the keys include just the name (so not the leading segments of the path). The root ConsolidatedMetadata.metadata dict holds just immediate children nodes. Metadata for nested groups can still be accessed, but through the .consolidated_metadata on the children. This matches the logical representation.

The motivation for this rethink is from the current implementation having to be careful about just using consolidated_metdata.metadata directly. If you want to propagate consolidated metadata, you need to use the Group.getitem method. See Group.members where this becomes relevant.

Here's an example:

Given a hierarchy like

root/
  g0/
   c0/
    array0
    array1
   c1/
     array0
     array1
  g1/
    c0/
      array0
      array1
    c1/
      array0
      array1

We'll represent the consolidated metadata as a flat mapping of (store) keys to values.

"g0": {"attributes": ..., "node_type": "group"},
"g1": {"attributes": ..., "node_type": "group"},
"g0/c0": {"attributes": ..., "node_type": "group"},
...
"g0/c0/array0": {"shape": [...], "node_type": "array"},
...
"g1/c1/array1": {"shape": [...], "node_type": "array"}

But in memory, what is the Group.metadata.consolidated_metadata for each of theses groups?

Should it match the flat structure on disk, where the keys imply the structure?

# g0
GroupMetadata(
    attributes=...,
    consolidated_metadata=ConsolidatedMetadata(
	    metadata={
		    "c0": GroupMetadata(...),
		    "c0/array0": ArrayMetadata(...),
		    ...,
		    "c1/array1": ArrayMetadata(...),
		},
)

Or should it have a nested / tree-like structure, where just immediate children appear in group.metadata.consolidated_metadata.metadata, and nested members can be accessed through that dictionary?

# g0
GroupMetadata(
    attributes=...,
    consolidated_metadata=ConsolidatedMetadata(
	    metadata={
		    "c0": GroupMetadata(
			    attributes=...,
			    consoliated_metadata=ConsolidatedMetadata(
				    metadata={
					    "array0": ArrayMetadata(...),
					    "array1": ArrayMetadata(...),
				    }
			    )
			),
			"c1": GroupMetadata(
				attributes=...,
				consolidated_metadata=ConsolidatedMetadata(
					metadata={
					    "array0": ArrayMetadata(...),
					    "array1": ArrayMetadata(...),
					}
				)
			)
		},
)

Right now I've implemented option 1. I'll out option 2 today.

@d-v-b
Copy link
Contributor

d-v-b commented Sep 12, 2024

do we have to pick just 1 in-memory representation? Over in pydantic-zarr I do something similar to metadata consolidation, and I use a tree representation or a flat dict[str, Array | Group] representation situationally (see the to_flat and from_flat functions).

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 1, 2024

There's a behavior change here that's causing some xarray tests to fail. It comes down to whether we interpret zarr.open_consolidated(store, path="path/to/group") as either:

  1. Open the consolidated metadata at the root of the store (.zmetadata) and then select the (nested) group path/to/group. Or,
  2. Open the consolidated metadata at "path/to/group/"

Zarr 2.x implemented the first one. This branch currently implements the second version.

The rough flow is

  1. create a Group / arrays at some level down in the store: e.g. a/b/group with arrays a/b/group/x, a/b/group/y, ...
  2. Consolidate the store at the root level with zarr.consolidate_metadata(store).
  3. open the (nested) group with zarr.open_consolidated(store=store, path="a/b/group")

I'll see how hard it is to implement what 2.x was doing, but I want to confirm that's the behavior we want.

A limitation of 2.x's way of doing things is that you can only have one consolidated metadata for the entire Store: at the root.

@TomAugspurger TomAugspurger added the downstream Downstream libraries using zarr label Oct 9, 2024
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger -- this came together so nicely! Way to go 😄

I left a few comments of substance but I think this can go in very soon (tomorrow?)

Consolidated Metadata
=====================

zarr-python implements the `Consolidated Metadata_` extension to the Zarr Spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
zarr-python implements the `Consolidated Metadata_` extension to the Zarr Spec.
Zarr-Python implements the `Consolidated Metadata_` extension to the Zarr Spec.

metadata reads get child Group or Array nodes will *not* require reads from the store.

In Python, the consolidated metadata is available on the ``.consolidated_metadata``
attribute of the Group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attribute of the Group.
attribute of the Group metadata.

of the metadata, at the time they read the root node with its consolidated
metadata.

.. _Consolidated Metadata: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#consolidated-metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs are great @TomAugspurger! 👏

Comment on lines 315 to 319
async def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> AsyncGroup:
"""
Alias for :func:`open_group` with ``use_consolidated=True``.
"""
return await open_group(*args, use_consolidated=use_consolidated, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> AsyncGroup:
"""
Alias for :func:`open_group` with ``use_consolidated=True``.
"""
return await open_group(*args, use_consolidated=use_consolidated, **kwargs)
async def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> AsyncGroup:
"""
Alias for :func:`open_group` with ``use_consolidated=True``.
"""
return await open_group(*args, use_consolidated=True, **kwargs)

Seems to me that open_consolidated should always use use_consolidated=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal here is to avoid a user accidentally passing use_consolidated in **kwargs and us silently overwriting it. I'll update this to raise if use_consolidated isn't True.

Comment on lines 162 to 166
if metadata_dict.get("node_type") != "array":
# This KeyError is load bearing for `open`. That currently tries
# to open the node as an `array` and then falls back to opening
# as a group.
raise KeyError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps raise NodeTypeValidationError from #2310

src/zarr/core/group.py Outdated Show resolved Hide resolved
src/zarr/core/group.py Outdated Show resolved Hide resolved
Comment on lines +570 to +572
# We already read zattrs and zgroup. Should we ignore these?
v2_consolidated_metadata.pop(".zattrs")
v2_consolidated_metadata.pop(".zgroup")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we may choose to inform the caller that the consolidated metadata does not match the root group metadata but for now, this seems fine.

src/zarr/core/group.py Outdated Show resolved Hide resolved
src/zarr/core/group.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/core/array.py Outdated Show resolved Hide resolved
@@ -82,42 +89,310 @@ def _parse_async_node(node: AsyncArray | AsyncGroup) -> Array | Group:
raise TypeError(f"Unknown node type, got {type(node)}")


def _json_convert(o: object) -> Any:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to actually delete this block.

@jhamman jhamman merged commit 3964eab into zarr-developers:v3 Oct 10, 2024
20 checks passed
@jhamman
Copy link
Member

jhamman commented Oct 10, 2024

Huge @TomAugspurger 🎉 🎉 🎉 !!!

@d-v-b
Copy link
Contributor

d-v-b commented Oct 10, 2024

awesome work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
downstream Downstream libraries using zarr V3 Affects the v3 branch
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

V3 Consolidated Metadata
3 participants