Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added consolidated metadata to spec #309

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

TomAugspurger
Copy link

@TomAugspurger TomAugspurger commented Aug 23, 2024

This PR adds a new optional field to the core metadata for consolidating all the metadata of all child nodes under a single object. The motivation is similar to consolidated metadata in Zarr V2: without consolidated metadata, the time to load metadata for an entire hierarchy scales linearly with the number of nodes. This can be costly, especially for large hierarchies served HTTP from a remote storage system (like Blob Storage).

The primary points to discuss:

  1. This currently proposes storing the consolidated metadata in the root zarr.json (kind="inline). For *very* large hierarchies, this will bloat the size of the root zarr.json`, slowing down operations that just want to open the metadata for the root.
  2. This overlaps slightly with Explicitly listing groups/arrays inside group metadata? #284, which proposes to store the child paths (but not full metadata) in the zarr.json. That might be a nice alternative to consolidated metadata for very large hierarchies (you could do an initial read to get the list of nodes, and the load the metadata for each child concurrently).
  3. Like Zarr v2, consolidated metadata introduces the possibility of "inconsistent" metadata between the consolidated and non-consolidated forms. Should the spec take any stance on how to handle this? I've currently worded things to say that readers should always use the consolidated metadata if it's present.

Closes #136

xref zarr-developers/zarr-python#2113, which is a completed implementation in zarr-python and LDeakin/zarrs#55, a draft implementation in zarrrs..

@joshmoore
Copy link
Member

joshmoore commented Aug 27, 2024

Going back to #136 (comment) (to ZEP or not to ZEP), I think this could potentially use some discussion around whether there is an opportunity to make the consolidated metadata one implementation of a wider interface.

The most general abstraction I've been considering is "metadata loader" (similar to the storage transformer that was initially proposed for sharding). Basically, could we have extensible (i.e. you can write your own) ways of looking up metadata that would be registered in the zarr.json?

I owe you a better description but to get them down on paper before I get inundated again, here are some similar requirements/features that I could see:

  • load metadata from .zmetadata (to be backwards compatible)
  • load metadata from ro-crate-metadata.json currently under investigation in NGFF land
  • load metadata from a database, binary or even encrypted file
  • hell, even yaml.
  • store consolidated metadata per hierarchy level (rather than all at the top), suggested by @DennisHeimbigner
  • ...

In my mind, if one of these alternative mechanisms is active, it could be the sole source of information, rather than duplicating the metadata. Some of these are more useful for user attributes rather than the entire metadata, and certainly several may be more difficult without the concept of the root metadata, but it feels like this is an opportunity for us to go beyond the original consolidated metadata workaround.

@TomAugspurger
Copy link
Author

Thanks Josh! I like the idea of metadata loading being an extension point.

We'll want to hash out some details before merging this, since a top-level consolidated_metadata key in the zarr.json would clash with this goal. Roughly, I'm hoping we can have something like

// zarr.json
{
  "metadata_loader": {
    "name": "consolidated_metadata",
    "kind": "inline",
    "metadata": { ... },
}

We could have the name be inline_consolidated_metadata and drop kind if we wanted. The key thing is having a field in the serialized representation we could use to know how we should go about loading the rest of the metadata.

I'll play with that a bit to see how it feels in zarr-python.

docs/v3/core/v3.0.rst Outdated Show resolved Hide resolved
@d-v-b
Copy link
Contributor

d-v-b commented Aug 28, 2024

Thanks for working on this! What do you think about multiple instances of metadata consolidation in the same zarr hieararchy? E.g., Group A contains Group B, Group B generates consolidated metadata, then Group A generates consolidated metadata, which ends up containing 2 models of the hierarchy rooted at Group B. Because the consolidated metadata is part of the Group metadata, then consolidating recursively will end up with a lot of repeated content. A few options:

  • just let this happen -- nest consolidated metadata at your own risk
  • don't let this happen -- disallow nested metadata consolidation
  • explicitly exclude the consolidated_metadata key from the metadata that gets consolidated. We could achieve this by defining the "metadata" of an array / group to be "everything except the consolidated_metadata key, if present", and then metadata consolidation is defined as an aggregation of said array / group metadata. Or we just say that the consolidated_metadata key should be skipped 🤷 .

The latter option seems best to me; curious to hear other perspectives.

@TomAugspurger
Copy link
Author

Coming back to @joshmoore's "metadata loader" concept, the thing we need to solve now is what sort of metadata to put in the Zarr Group object to indicate "here's how you find more stuff about this hierarchy". And I'll focus that a bit more to "listing files on blob storage is messy / slow, so I want to avoid that if possible".

I'm going to make a couple of statements that I think are true:

  1. The user has some sort of Group object
  2. That Group object was loaded using some kind of Store, which knows how to interact with the "outside world" that's providing the metadata (files on local disk / object store, a database, objects in memory, etc.)

It's that interface with the store that's tripping me up when trying to generalize this. Why do we need to standardize that in the Zarr Group object? By the time the user has loaded some zarr Group, they have already specified their connection to the metadata store, right? That's how they loaded the group in the first place.

With something like the current proposal, we have a way for the Store concept (reading local / network files, a JSON API endpoint, etc.) to provide additional metadata for the common case of "also give me information about my children".

In short, I think my claim is that the "metadata loader" concept is mostly orthogonal to the consolidated metadata concept. Consolidated metadata is all about reducing the number of trips to the "Store", by giving a place for a Group to hold information about its children. I think I'm comfortable that a top-level consolidated_metadata key in the Group object will not conflict with the desire to load metadata from other sources.


One thing that probably should be addressed now / soon is the idea of storing subsets of the metadata. I'd propose something like a depth field on the consolidated_metadata object, which is an integer stating how far down has been consolidated. I think the two most common will be

  1. None / not specified: Everything is consolidated. Loading any arbitrarily nested child group will can be done without additional I/O
  2. 1: indicating that just my immediate children have been consolidated. Loading an immediate child can be done without I/O, but loading any of its children will require I/O.

docs/v3/core/v3.0.rst Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Author

sorry for missing your questions @d-v-b.

What do you think about multiple instances of metadata consolidation in the same zarr hieararchy?

I like the framing (which I think was yours originally?) of thinking about consolidated metadata as a cache. Since you can have arbitrarily nested Groups / Arrays, IMO you should be able to have arbitrary consolidated metadata at any level of the hierarchy and the spec shouldn't do anything to dissuade that.

As for the question about potentially duplicating data, I think that we won't have that in zarr-python. Working through your example:

Group A contains Group B, Group B generates consolidated metadata, then Group A generates consolidated metadata, which ends up containing 2 models of the hierarchy rooted at Group B. Because the consolidated metadata is part of the Group metadata, then consolidating recursively will end up with a lot of repeated content

I think that we'll have "repeated content" across the files. But we won't have repeated content within a single zarr.json file (i.e. I don't think that the nested arrays should repeat their child consolidated_metadata. I call this out below under "note"):

Given an example like

a/     # group
  b/   # group
    x  # array
    y  # array

The zarr.json for group b will look like

{
    "zarr_format": 3,
    "node_type": "group",
    "consolidated_metadata": {
        "metadata": {
            "x": {...},
            "y": {...}
        },
        ...
    }
}

And the zarr.json for a will look like

{
    "zarr_format": 3,
    "node_type": "group",
    "consolidated_metadata": {
        "metadata": {
            "b": {...},  # note: this probably shouldn't have consolidated_metadata
            "b/x": {...},
            "b/y": {...},
        },
        ...
    }
}

A note that this might be different from how zarr v2 treated consolidated metadata. In zarr-developers/zarr-python#2113 (comment), I think I found that zarr v2 only supported consolidated metadata at the root of a Store, not nested paths within a store. I think this might be an area where we want to diverge from zarr v2.

@TomAugspurger
Copy link
Author

I think this should be about ready to go.

@LDeakin I saw that you at least started on this at LDeakin/zarrs#55. Anything come up in your implementation that might adjust what you want to see in the spec?

@LDeakin
Copy link

LDeakin commented Oct 2, 2024

@LDeakin I saw that you at least started on this at LDeakin/zarrs#55. Anything come up in your implementation that might adjust what you want to see in the spec?

Nope, the spec looks good, and no issues popped up. Thanks for writing it.

docs/v3/core/v3.0.rst Outdated Show resolved Hide resolved
@joshmoore
Copy link
Member

Sorry for the slow response, @TomAugspurger. To be clear lest it not come through here, I'm super excited to have CM going into the spec!

TomAugspurger commented 3 weeks ago
It's that interface with the store that's tripping me up when trying to generalize this. Why do we need to standardize that in the Zarr Group object? By the time the user has loaded some zarr Group, they have already specified their connection to the metadata store, right? That's how they loaded the group in the first place.

I definitely agree that this is a tricky (though critical) part of the bootstrapping. The use case I tend to be most concerned with is everything under attributes (i.e., "metadata" is an overloaded term) which is more tractable. Other use cases include: some of the group info is there; and none of the group info is there (i.e. it's solely a redirect). Is it possible that you're saying there's no use case for the some-info use case?

In short, I think my claim is that the "metadata loader" concept is mostly orthogonal to the consolidated metadata concept. Consolidated metadata is all about reducing the number of trips to the "Store", by giving a place for a Group to hold information about its children.

Can you help me see why it's orthogonal? My hope was that that would be one of the things this interface could allow a user to do.

I think I'm comfortable that a top-level consolidated_metadata key in the Group object will not conflict with the desire to load metadata from other sources.

Except then it's already done in one specific way, no? With the metadata, e.g., currently being duplicated, no?

One thing that probably should be addressed now / soon is the idea of storing subsets of the metadata. I'd propose something like a depth field on the consolidated_metadata object, which is an integer stating how far down has been consolidated. I think the two most common will be...

Agreed that this is another "parameter" for the consolidation, and I imagine there will be more.

I think this should be about ready to go.

Not that I necessarily agree, but what are the remaining (related) steps from your point-of-view? From my side, I'd be for trying to encourage some more feedback from implementers (huge kudos to @LDeakin)

@TomAugspurger
Copy link
Author

Thanks @joshmoore.

I definitely agree that this is a tricky (though critical) part of the bootstrapping[...]

Sorry, I don't quite understand this set of questions and the "some" / "no" info concepts. Do you have an example of what you have in mind?

Can you help me see why it's orthogonal? My hope was that that would be one of the things this interface could allow a user to do.

To me it just goes back to the original motivation for consolidated metadata: A way for a Group in a Zarr hierarchy to know a bit about its children (to avoid potentially costly repeat trips to a remote Store for common scenarios like "load all the metadata for arrays in this Group / dataset"). How exactly you load that initial Group (your "metadata loader" concept) doesn't really bear on the question of "where do I store information about children", right?

Except then it's already done in one specific way, no? With the metadata, e.g., currently being duplicated, no?

Answered above, I think. All where specifying here is how to present information about your children; no constraints on how that information gets there. In particular, the spec takes no opinion on how the consolidated metadata is actually stored. A database presenting a Zarr API could construct consolidated metadata on the fly if it wanted (and so no duplication).

Agreed that this is another "parameter" for the consolidation, and I imagine there will be more.

This is the only one I can imagine now. If you have any other in mind let me know. I'm not worried about making a depth field backwards compatible but if there are breaking changes we have in mind it'd be best to address them now.

what are the remaining (related) steps from your point-of-view?

Nothing else. We're merging the zarr-python implementation today.


Overall, I'd reiterate that this PR has a narrow scope: a standardized way for Groups know about their children. Effectively, a cache for what you'd get from listing nodes under some prefix, which we already know is needed for serving Zarr data over object storage in a reasonable time.

@joshmoore
Copy link
Member

Ok. Bear with me @TomAugspurger, I find myself unfortunately playing bad cop across the repo. First some immediate responses to your points and then I'll take a step back and try to explain.


I don't quite understand this set of questions and the "some" / "no" info concepts. Do you have an example of what you have in mind?

Looking back, I think I misunderstood you. Let's hold off on this for the moment.

How exactly you load that initial Group (your "metadata loader" concept) doesn't really bear on the question of "where do I store information about children", right?

I thought it did, but you may have found a hole in the plan. Ultimately, I'm trying to find an underlying abstraction for what you've built. But more on that in a second.

A database presenting a Zarr API could construct consolidated metadata on the fly if it wanted (and so no duplication).

I think this helps me see how the Store is becoming intwined. Maybe "MetadataLoader" is the wrong metaphor, but more "MetadataTransformer" to go with the "StorageTransformer". (If, however, CM can be achieved with the StorageTransformer, 👍) The difference in my mind is the loader/mapper/transformer is config to tell us what keys to look up (whether in databases, filesystems, or buckets) rather than the existing JSON objects.

(and so no duplication).

except the current definition would still have 2 keys for the same information even if the storage and/or retrieval has been optimized, right?

we have in mind it'd be best to address them now.

This concerns me, more to that below.

We're merging the zarr-python implementation today.

And in fact, now it is merged. Congrats on that and all the work but I think to be fair I will ignore that for this discussion, because though a great validation of the work here it's the cart before the horse.


Ok, finally, to back up:

Overall, I'd reiterate that this PR has a narrow scope: a standardized way for Groups know about their children.

I understand the goal to get CM in ASAP. However, this is introducing a key into the metadata that could be with us for many years. Spec work is often about looking beyond the immediate scope, especially if we are asking others to implement. So a few, partially overlapping strategies to try not to get in your way:

Broadening the scope

This is what I was trying above and with my suggestion (somewhere?!) of going for a ZEP. By identifying an abstraction that would cover more use cases and that we would all feel comfortable having in https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points I was hoping to we would have more buy-in for the change. (Obviously, that's a big ask.)

Lowering risk

The addition of extensions to Zarr v3 is still pretty untested. What you are trying to do should be made possible: dev wants to hack out a new feature, implement it and move on. 👍 Your use of "must_understand" does the minimum required by the text of the spec, but what it doesn't do is allow us to change anything about this proposal down the road. The only way forward I can really see is to move to "Yet-Another-Key" if we realize that CM is broken. @rabernat brought up https://stac-extensions.github.io/ as a model that we should match but if you look at https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#extending-stac, you'll see that some semantic for separating suggestions like yours (there, via key prefixes ex:...) would be needed.

Preparing for change

If we see the need for an evolution of the configuration, it might make sense to introduce versioning internally.

@LDeakin
Copy link

LDeakin commented Oct 10, 2024

@joshmoore

Your use of "must_understand" does the minimum required by the text of the spec, but what it doesn't do is allow us to change anything about this proposal down the road

This is not entirely true for consolidated_metadata. If a breaking change were introduced, an implementation can fail to understand it and parse it as an unknown metadata field that it need not understand. It can fallback to reading non-consolidated metadata until it is updated to support the new revision.

If we see the need for an evolution of the configuration, it might make sense to introduce versioning internally.

If the configuration changes, an implementation can support that without needing an explicit version key.

@TomAugspurger
Copy link
Author

TomAugspurger commented Oct 10, 2024

(and so no duplication).

except the current definition would still have 2 keys for the same information even if the storage and/or retrieval has been optimized, right?

That's up to whatever storage is providing the Zarr metadata. In the case of the consolidated metadata produced by zarr-python, which would typically be written to a plain file / object store, yes. But that hypothetical database could store it however it wants.

Stepping back: why is duplicating the metadata a problem? That's fundamental to this. Maybe substitute "cache" for "duplicate" and see whether it still causes any concerns.

So a few, partially overlapping strategies to try not to get in your way:

"Broadening the scope" and "Lowering the risk" seem pretty much directly in tension with each other :) I'll push back pretty strongly on broadening the scope; this is deliberately scoped to narrowly solve the "open group and immediately inspect its children" use case, and I think that's a good thing.

As for lowing the risk, I think we point to zarr-python 2.x's consolidated metadata. It seems to have worked pretty well, and this spec is essentially identical (aside from moving it to the single zarr.json object, along with the attributes). We've already got a lot of history showing that something like this is useful and practical.

The addition of extensions to Zarr v3 is still pretty untested[...]

I think stac's extension mechanism if much nicer than what zarr v3 provides today. When I first started on consolidated metadata, I actually started on the extension mechanism. JSON schemas for the core metadata plus a top-level zarr_extensions array would get you most of the way there. I think that's worth doing. But I don't think it needs to block this work, right? That's something that could be done independently. I can write that up as an issue, based on my experience with STAC, if you'd like.

If we see the need for an evolution of the configuration, it might make sense to introduce versioning internally.

I think that would be solved by the zarr spec having a better extension mechanism (previous point) or by the consolidated_metadata object having some kind of version key. I'm fine with either, but would lean slightly towards making a better extension mechanism. It seemed a bit strange to me to have a piece of the core metadata evolve separately from the rest of the metadata, which is why I think an extension system like STAC's makes more sense.

@joshmoore
Copy link
Member

General 💯 for thinking through the future-proofing with you guys, so super brief comments for 00:00 CET:

If a breaking change were introduced, an implementation can fail...

My gut reaction is to say we're putting too much on the back of must_understand, but I see what you're saying: if this element is essentially optional then implementations can (MUST) be tolerant of changes. It doesn't feel like an optimal strategy though.

"Broadening the scope" and "Lowering the risk" seem pretty much directly in tension with each other :)

Largely, but you could of course draft a broader-scoped extension externally first, too ;)

As for lowing the risk, I think we point to zarr-python 2.x's consolidated metadata... a lot of history showing that something like this is useful and practical.

But not across implementations and I have the feeling with a good deal of heuristics in consuming libraries like xarray to get the creation/detection in place. That a way to tune the metadata loading for different cases is a given, but I would hope we could also improve on it.

JSON schemas for the core metadata plus a top-level zarr_extensions array would get you most of the way there

nods certainly something that doesn't give the sense that we might need a series of 3.x verisons.

But I don't think it needs to block this work, right?

If we think we can sufficiently lower the risk then agreed, but as currently written my instinct is we're not there yet (though perhaps @LDeakin's suggestion will find traction).

a better extension mechanism

❤️

@TomAugspurger
Copy link
Author

TomAugspurger commented Oct 11, 2024

That a way to tune the metadata loading for different cases is a given, but I would hope we could also improve on it.

Maybe, but I try not to get too caught up in hypotheticals. We have a very real use case where the layout of Zarr metadata on a remote file system causes issues. So we have a tailored solution for that. Why do we need to complicate it to handle other stuff (stuff that at least I don't have a clear grasp on)? If we come up with something better that does... something related to metadata loading, then great. Let's add that to the spec. And it can use consolidated_metadata to store information about its children (or not, since it's optional).


And just to reiterate again: this is a small change. A change to the core metadata, sure, but it's completely optional. And it really isn't inventing any new concepts, both because the use-case was discovered (and solved!) by users of zarr-python, and because all of the complicated objects going in the metadata object are already in the spec since we're just storing a mapping of key: node.

If that's not enough de-risking, then can you lay out exactly what you're looking for?

@TomAugspurger
Copy link
Author

#316 has stuff on the extension side. Happy to discuss more there, but I don't think the limitations of Zarr's extension story should bear on this PR.

@joshmoore
Copy link
Member

The reason I think we're in this discussion is that as things stand this adds a feature (i.e., if we were under semver a minor release). The two paths I see are:

  • make this less than a minor release but that embroils you with the extension story; or
  • get more buy-in from implementers so we have the confidence we need, i.e. a ZEP.

Let me try to give one scenario of it's impact:

  • implementation A understands CM and creates a dataset with the configuration
  • implementation B does not grok CM, opens the dataset, modifies metadata
  • implementation A opens the dataset and does not see the modification

This is just a part of caches being hard but I think previously was less of an issue since:

  • there were fewer writing implementations; and
  • it was a user-driven option, i.e. had to be passed to the library to activate.

Users were then responsible for cleaning things. But that no longer holds, does it?

@TomAugspurger
Copy link
Author

(i.e., if we were under semver a minor release). The two paths I see are:

I avoid semver conversations as a matter of course (aside from linking to https://hynek.me/articles/semver-will-not-save-you/), and I don't know how Zarr versions its spec, but the key thing here is this is backwards compatible.

https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points does mention that top-level keys can be added with must_understand: False so this should be compatible with that, right?

get more buy-in from implementers so we have the confidence we need, i.e. a ZEP.

We've got two implementations from two very different languages, plus the experience of zarr-python 2.x. I think that's sufficient.

Let me try to give one scenario of it's impact:

You don't need to bring in multiple implementations to hit that: you can get in that situation perfectly well using just zarr-python. And I think that's OK.

I don't think the spec can or should have much to say about how objects are actually stored, or how to synchronize concurrent changes among multiple objects. That should be left to the implementations to explore. Looking through https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#id21 and https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#storage it seems like the spec is restrained with respect to Stores and Storage. Let's keep it that way.

it was a user-driven option, i.e. had to be passed to the library to activate.

That's again up to the library. In zarr-python we default to using it if present, and provide APIs for explicitly using it or not.

@joshmoore
Copy link
Member

I avoid semver conversations as a matter of course

Ha! Probably a good idea, and I don't think it's a natural fit for a spec anyway, but major/minor/patch is a distinction that at least most people understand to simplify discussing the scope of changes.

I don't know how Zarr versions its spec, but the key thing here is this is backwards compatible.

(Per the zarr-python meeting) versioning is definitely an issue, but regardless, as things stand we also have to consider the forwards compatibility.

#extension-points does mention that top-level keys can be added with must_understand: False so this should be compatible with that, right?

It's not wrong for what's written in the spec, but I wonder if from the zarr-python conversation just now changed your mind here at all.

We've got two implementations from two very different languages, plus the experience of zarr-python 2.x. I think that's sufficient.

For an extension (with a working extension mechanism) I'd call that more than fair! As things stand, I think I'm certainly still struggling with the fact that merging this is a minor release and so don't consider it sufficient. (Hence my call for a ZEP from the beginning.)

You don't need to bring in multiple implementations to hit that: you can get in that situation perfectly well using just zarr-python. And I think that's OK.

I disagree especially if it's happening silently.

I don't think the spec can or should have much to say about how objects are actually stored, or how to synchronize concurrent changes among multiple objects.

I see what you're saying, but this is why for such a core feature I was so intent on avoiding the need for synchronization!

@TomAugspurger
Copy link
Author

but the key thing here is this is backwards compatible.

(Per the zarr-python meeting) versioning is definitely an issue, but regardless, as things stand we also have to consider the forwards compatibility.

What exactly is the forwards compatibility issue here?

top-level keys can be added with must_understand: False so this should be compatible with that, right?

It's not wrong for what's written in the spec, but I wonder if from the zarr-python conversation just now changed your mind here at all.

I don't think so, but what do you have in mind?

You don't need to bring in multiple implementations to hit that: you can get in that situation perfectly well using just zarr-python. And I think that's OK.

I disagree especially if it's happening silently.

How do you do that non-silently? The entire point of this consolidated metadata is to avoid looking at other objects in the hierarchy for static file systems. Regardless, this is surely, clearly, obviously out of scope for the spec, right? Why would it need to take a stance on how the underlying file system operates? Leave that up to the implementations.

I get the feeling that we have wildly different priors on the purpose and scope of this change. All I'm proposing is a spot for Groups to hold information about their children. Issue like how to synchronize multiple writes already affect any system built on top of Zarr using a distributed file store without support for transactions spanning multiple objects; it's not unique to consolidated metadata. Use dimension_names? The spec doesn't say anything about having to form a DAG and ensure that arrays used as dimension names are written before ones referring to it. AFAIK, it doesn't say anything about what order you should write the metadata relative to the chunks. These challenges aren't unique to consolidated metadata.

@joshmoore
Copy link
Member

I get the feeling that we have wildly different priors on the purpose and scope of this change.

:) Not on this change specifically, @TomAugspurger, it's unfortunately more pervasive. That was my point about the call: we talked through how the immutability of the spec, lack of versioning, and untested extension mechanism makes even scoped changes nerve-wracking, and I've made the same mistake here as a I did in #312 to not clearly spell that out upfront -- apologies. You're coming into a loaded situation and I was trying to help by doing what I would have with a ZEP. I still think my points (abstract interface, versioning, dealing with stale data) would ultimately make this a more robust (hopefully valuable 😄) addition, but as @rabernat describes in #312, what you are looking for at the moment seems to be a proper extension mechanism.

@TomAugspurger
Copy link
Author

TomAugspurger commented Oct 12, 2024

immutability of the spec, lack of versioning, and untested extension mechanism makes even scoped changes nerve-wracking

The changes here use the extension and change mechanism we have. Let's use it!

abstract interface,

I still don't have a great understanding of what exactly that proposal is, but I think #309 (comment) (from a month ago!) still has my thoughts: that seems separate and can use consolidated_metadata to store information about children if it wants.

dealing with stale data

I'm surprised you raised that again... I assumed it was something you raised earlier because it was a fleeting "this would be nice to have". I had that thought too. But once you dig into the implementation, it should become clear that's not really feasible for all storage systems. I can go into more detail about it if you want, but:

  • A system that provides optimistic concurrency for Zarr datasets over storage like S3, which don't provide multi-object transactions, can and should exist.
  • That's a very complicated, and different, system than the current Zarr spec.
  • That system can and should use consolidated_metadata for Group's to store information about their children, if only we could get it standardized.

As an analogy: S3 didn't even provide strong read after write consistency till a few years ago. If we wanted the spec to take that strong of a stance on storage like "you should be able to read data immediately after writing it", we wouldn't have Zarr in the first place. Let the Zarr spec focus on the core components of the spec like where to place information, what information to put there; let the implementations handle the details of how that actually works in practice.

And now that I'm going why stop: "stale" metadata has been valuable for zarr v2. That offers a way for data providers to update the metadata for many nodes in a hierarchy atomically.

what you are looking for at the moment seems to be a proper extension mechanism.

I am so not looking for that. I just want ensure that workflows using xarray and can use Zarr. I don't particularly care about Zarr itself.

would ultimately make this a more robust (hopefully valuable 😄) addition

I'm probably getting short, and this aside was probably an attempt to lighten the mood, but do you think the value of consolidated metadata questionable at this stage? After all the experience with zarr V2?


I'll try to summarize how I see things:

  1. We have an existing spec with a defined (if not ideal) way of handling changes to the spec (top level keys with must_understand=False. This PR uses that change mechanism in a way that's compatible with its backwards compatibility requirements.
  2. We have years of experience with zarr v2 on the value of consolidated metadata as a concept, and this implementation in particular.
  3. This PR doesn't really invent anything new: all the node objects in consolidated_metadata already exist in the spec, and we've future proofed the one early objection in review of this PR: to storing the consolidated metadata inline in the Group document, like the user-defined attributes.
  4. We have two implementations. We really don't want zarr-python getting ahead of the spec, and we have the opportunity to keep that from happening with the click of a certain green button.

So, given all that, how do we move forward here?

@joshmoore
Copy link
Member

So, given all that, how do we move forward here?

Following #312 (comment), I'm hesitant to try my standard strategy so I'll try to be blunt because it seems I haven't been communicating effectively: I'm strongly of the opinion this should be a ZEP in order to gain wider endorsement and review, or (for the moment) should be a true extension not defined in the core metadata.

I'm surprised you raised that again... I assumed it was something you raised earlier because it was a fleeting "this would be nice to have". ... not really feasible for all storage systems.

Sorry, it wasn't fleeting, no. Minimally, for example, I'd spell out the semantics and expectations for an ideal storage system.

Let the Zarr spec focus on the core components of the spec like where to place information, what information to put there

Agreed. And even there, purely at the information level, I see the potential for corruption. (FWIW, I've pondered through defrag and invalidation like steps, but none of them are really tractable.) I'd be much more comfortable if CM owned the metadata, i.e. it's less a cache and a choice that you make when saving, e.g., read-only data, because you know your access patterns.

I just want ensure that workflows using xarray and can use Zarr. I don't particularly care about Zarr itself.

I appreciate that you are trying to fix an upstream library, but that is a large part of my concern that the larger picture, including other implementations, is not being taken into account.

would ultimately make this a more robust (hopefully valuable 😄) addition

I'm probably getting short, and this aside was probably an attempt to lighten the mood, but do you think the value of consolidated metadata questionable at this stage? After all the experience with zarr V2?

Apologies, yes. That was an attempt at levity and to be clear the "more" applied to both, i.e. "make this a more robust and hopefully more valuable as well.

I know how important consolidated metadata is; there's data waiting to be converted; libraries waiting to update; science waiting to be done.

But being blunt again, it's initial implementation was implemented "without particular care to Zarr", and that's something I've struggled with in my other ecosystems. So perhaps I've being overly cautious, but I'm concerned that I can't represent the voices of all the other implementations (@LDeakin, thanks for joining in!) and that's the real purpose of ZEPs -- to make sure that you are taking the wider community along with you.

@TomAugspurger
Copy link
Author

this should be a ZEP in order to gain wider endorsement and review, or (for the moment) should be a true extension not defined in the core metadata.

FWIW, as an outsider the state of https://github.com/zarr-developers/zeps doesn't give me a ton of confidence that's it's a functioning system. That impression, plus the sentence in the spec saying that the metadata can't contain other fields (before I understood what this must_understand thing was saying) led me to push it here.

I'd spell out the semantics and expectations for an ideal storage system.

Maybe do that for the existing spec, and we can incorporate it here? There won't be anything unique to consolidated metadata.

I'd be much more comfortable if CM owned the metadata

That sounds like a separate proposal.

that is a large part of my concern that the larger picture, including other implementations, is not being taken into account.

As long as we're discussing the larger picture: keep in mind that this experience hasn't been especially friendly to a newcomer. I'm used to developing open-source projects and establishing consensus, so it's fine. But I could easily see other newcomers being pushed away, given a similar experience.

including other implementations

And again, this change is compatible with existing implementations.


Who are the other implementers to reach out to? I see teams for @zep1-editors, @zep1-reviewers, @core-devs, @implementation-council. Which should I ping? implementation-council seems pretty on point, did I miss that in the contributing guidelines? I don't see that text in the repo.

@joshmoore
Copy link
Member

My time to respond now is unfortunately limited since I'm getting to get on a bike for a week, so I'll focus on the most critical points:

doesn't give me a ton of confidence

Definitely a lot of that at the moment. See zarr-developers/governance#44 for more.

keep in mind that this experience hasn't been especially friendly to a newcomer

Very fair feedback.

I don't see that text in the repo.

I would hope that would have been more obvious starting from https://zarr.dev/zeps/active/ZEP0000.html

@TomAugspurger
Copy link
Author

My time to respond now is unfortunately limited since I'm getting to get on a bike for a week,

Thanks, enjoy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

consolidated metadata storage for Zarr v3
4 participants