Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] support write_empty_chunks #2409

Open
Tracked by #2412
jhamman opened this issue Oct 18, 2024 · 3 comments · May be fixed by #2429
Open
Tracked by #2412

[v3] support write_empty_chunks #2409

jhamman opened this issue Oct 18, 2024 · 3 comments · May be fixed by #2429
Assignees
Labels
bug Potential issues with the zarr-python library V3 Affects the v3 branch
Milestone

Comments

@jhamman
Copy link
Member

jhamman commented Oct 18, 2024

Zarr version

3.0.0.beta

Numcodecs version

0.13

Python Version

3.11

Operating System

Mac

Installation

pip

Description

write_empty_chunks was a very useful write-time optimization that we should bring back to 3.0

https://zarr.readthedocs.io/en/stable/tutorial.html#empty-chunks

Steps to reproduce

In [23]: a = zarr.create(shape=(10, 10), chunks=(5, 5), write_empty_chunks
    ...: =True)
/Users/jhamman/miniforge3/envs/icechunk-demo/lib/python3.12/site-packages/zarr/core/sync.py:100: RuntimeWarning: write_empty_chunks is not yet implemented

Additional output

No response

@jhamman jhamman added the bug Potential issues with the zarr-python library label Oct 18, 2024
@jhamman jhamman added this to the 3.0.0 milestone Oct 18, 2024
@jhamman jhamman added the V3 Affects the v3 branch label Oct 18, 2024
@d-v-b d-v-b self-assigned this Oct 21, 2024
@d-v-b d-v-b linked a pull request Oct 22, 2024 that will close this issue
6 tasks
@d-v-b
Copy link
Contributor

d-v-b commented Oct 23, 2024

I opened #2429 to address this issue, but I think we should maybe take some time here to discuss what kind of API we want before moving forward with that PR. I think write_empty_chunks is just one of many possible runtime properties that we might want to support, and we should probably settle on a coherent strategy for all of them.

problems with runtime stuff as simple Array attributes.

The zarr-python 2 solution to this problem was to create a runtime attribute of the Array class, namely Array.write_empty_chunks, that would influence the the chunk-writing behavior of a given array. I have a few problems with this approach:

  • You can only specify the write_empty_chunks parameter when you are creating a zarr array with a function call. If you are using something like Group.__getitem__ / __setitem__ to get an array handle, then there's no way to specify write_empty_chunks (or any other runtime properties of the array).
  • Once you created your array, you were locked in to a specific setting of write_empty_chunks. The attribute itself could of course be mutated after calling __init__ but this is underhanded stuff we want to avoid. There was no easy way to create a new array handle with a different write_empty_chunks attribute.

putting runtime stuff in a global config

Those are my main complaints with the simple approach we took in zarr-python v2. In #2429 I'm opting for a very different approach by making write_empty_chunks a global config option that can be locally overridden in the context of individual array write operations. Instead of making write_empty_chunks an attribute of an array, it's an attribute of an array writing context. I see advantages to this approach, but it might also be net bad -- it might not match the way users think about write_empty_chunks. In particular, if the mental model is that a zarr array is "write-empty-chunks-typed", then disconnecting write_empty_chunks from array instances will be a frustrating violation of expectations.

another idea: Array-local config

What if we do something in between -- we give arrays a config or runtime attribute that's scoped to an individual array instance. A few behaviors I'm imagining:

  • Array.config is an immutable dataclass that can be defined from a dict.
    • The config attribute can evolve without cluttering the array creation constructor
  • If the array config is not specified, one is created from the global config values.
  • An array config can be defined for groups; unless overridden, arrays created from these groups will inherit that config. This solves the problem of controlling the write_empty_chunks attribute of arrays retrieved via group.__getitem__.
  • Once created, arrays ignore the global config values and always use the config defined on the array instance. This reduces surprising "action at a distance" for array operations.
  • Array handles with different config values are cheap to create, e.g. array.with_config({...}), which returns a new array object with a different config arrangement.

Does this seem reasonable? If so, I would steer #2429 in this direction. If not, I'm curious to hear other ideas.

@jhamman
Copy link
Member Author

jhamman commented Oct 23, 2024

Here's the situation I was objecting to in #2429:

with config.set({'array.write_empty_chunks': True}):
    arr = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)

arr[:] = 0  # will this use write_empty_chunks or not? how would I know?

Ultimately, I think the user experience is best if they have a way to check how an array is configured. Array.write_empty_chunks (or Array.config.write_empty_chunks as proposed below) provides that ability. In general, I like config for defaults but we should try to make it possible for users to override the config context by directly passing things like write_empty_chunks to the Array constructor.


tldr; I like the idea of a Array.config dataclass. I suspect there will be more options like this as we grow applications and extensions.

Array handles with different config values are cheap to create, e.g. array.with_config({...}), which returns a new array object with a different config arrangement.

I think we could hold off on this part of the proposal for now if you want reduce scope. Not saying its

@d-v-b
Copy link
Contributor

d-v-b commented Oct 23, 2024

Here's the situation I was objecting to in #2429:

with config.set({'array.write_empty_chunks': True}):
    arr = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)

arr[:] = 0  # will this use write_empty_chunks or not? how would I know?

In a universe where we take the path of #2429, users with zarr-python 2 experience would have to change their expectations about how write_empty_chunks works; instead of expecting a write_empty_chunks array attribute, they would need to expect that "write empty chunksness" is a runtime context thing, and so the flow would look like this:

arr1 = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=0)
arr2 = zarr.create(shape=(20, 20), chunks=(5, 5), fill_value=1)

default =  config.get('array.write_empty_chunks')
print(f'the current default for write_empy_chunks is {default}')
arr1[:] = 0  # this will use the current default

# this will use the opposite of the default, for BOTH array writes
with config.set({'array.write_empty_chunks': not default}):
  arr1[:] = 0  
  arr2[:] = 1

I'm not advocating that we do this, just sketching out how it would work. There are definitely pros and cons, and perhaps the largest is requiring a new mental model for array runtime properties. Maybe the array.config idea gives us a nice balance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library V3 Affects the v3 branch
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

2 participants