Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Sharding Prototype #1

Closed
wants to merge 5 commits into from
Closed

Initial Sharding Prototype #1

wants to merge 5 commits into from

Conversation

jstriebel
Copy link

@jstriebel jstriebel commented Nov 17, 2021

Internal only: Please provide any feedback you find, the below part is just the description for the PR on the official zarr repo.


This PR is for an early prototype of sharding support, as described in the corresponding issue TODO. It mainly be used to discuss the overall implementation approach for sharding. This PR is not meant to be merged.

This prototype

  • allows to specify shards as the number of chunks that should be contained in a shard (e.g. using arr.zeros((20, 3), chunks=(3, 3), shards=(2, 2), …)).
    One shard corresponds to one storage key, but can contain multiple chunks:
    sharding
  • ensures that this setting is persisted in the .zarray config and loaded when opening an array again,
  • adds a ShardedStore class that is used to wrap the chunk-store when sharding is enabled. This store handles the grouping of multiple chunks to one shard and transparently reads and writes them via the inner store. The original store API does not need to be adapted, it just stores shards instead of chunks, which are translated back to chunks by ShardedStore.
  • adds a small script chunking_test.py for demonstration purposes, this will not be part of the final PR.

If the overall direction of this PR is pursued, the following steps (and possibly more) are missing:

  • Functionality
    • Group reads and writes by their shards in the store by implementing getitems, setitems & delitems on the ShardedStore
      (also document such optimization possibilities on the Store or BaseStore class)
    • Group chunk-wise operations in Array where possible (e.g. in digest & _resize_nosync)
    • Consider locking mechanisms to guard against concurrency issues within a shard
    • Allow partial reads and writes if the wrapped store supports them
    • Allow compressed chunks by adding jumptable at the start of shards (only if chunk-sizes aren't constant)
    • Implement key deletion and cleanup of chunks which are completely deleted (tricky for uncompressed data)
    • Add support for prefixes before the chunk-dimensions in the storage key, e.g. for arrays that are contained in a group
    • Add warnings for inefficient reads/writes (might be configured)
    • Maybe allow to configure the order of chunks in a shard (C/Fortran/Morton-order)
    • Maybe use the new partial read method on the Store also for the current PartialReadBuffer usage (to detect if this is possible and reading via it)
  • Tests
    • Add unit tests and/or doctests in docstrings
    • Test coverage is 100% (Codecov passes)
  • Documentation
    • Add docstrings and API docs for any new/modified user-facing classes and functions
    • New/modified features documented in docs/tutorial.rst
    • Changes documented in docs/release.rst

@jstriebel jstriebel self-assigned this Nov 17, 2021
chunking_test.py Outdated
import zarr

store = zarr.DirectoryStore("data/chunking_test.zarr")
z = zarr.zeros((20, 3), chunks=(3, 3), shards=(2, 2), store=store, overwrite=True, compressor=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shards is specified in units of chunks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 👍

Copy link
Member

@philippotto philippotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very promising to me 👍

zarr/core.py Outdated Show resolved Hide resolved
zarr/core.py Show resolved Hide resolved
zarr/util.py Outdated Show resolved Hide resolved
zarr/core.py Outdated Show resolved Hide resolved
zarr/core.py Outdated Show resolved Hide resolved
zarr/_storage/sharded_store.py Outdated Show resolved Hide resolved
zarr/_storage/sharded_store.py Outdated Show resolved Hide resolved
zarr/_storage/sharded_store.py Outdated Show resolved Hide resolved
zarr/_storage/sharded_store.py Outdated Show resolved Hide resolved
zarr/_storage/sharded_store.py Outdated Show resolved Hide resolved
@jstriebel
Copy link
Author

jstriebel commented Dec 3, 2021

Closing this, all further action is happening at the original zarr-developers repo

@jstriebel jstriebel closed this Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants