allow creating references for empty archival datasets #260

keewis · 2024-10-17T15:31:27Z

Uninitialized variables don't have chunks, but can have a fill value set:

import h5netcdf

with h5netcdf.File("test.nc", mode="w") as f:
    f.dimensions = {"x": 100, "y": 200}
    f.create_variable("var", dimensions=("x", "y"), dtype="int64", fillvalue=100, chunks=(50, 100))

xarray opens this file just fine, but kerchunk won't return chunks for it. virtualizarr uses the fill value to construct the variable instead, which doesn't work because the variable is not actually 0D. This usually indicates an issue with the files, so I chose to raise an error (it doesn't have to, though, so I'm not sure this is the best way forward).

Tests added
Tests passing
Full type hint coverage
Changes are documented in docs/releases.rst

mdsumner · 2024-10-17T15:41:47Z

another example fwiw is this seaice product, it started in 1978 and initially only had data every 2 days, but became daily later ... the old binary files just skipped a day but the recent-ish netcdf rewrite filled the blanks with these empty files, they only have the dims, attributes, and the dummy crs variable (which now I think about it will probably be fine for virtual-ref).

https://n5eil01u.ecs.nsidc.org/PM/NSIDC-0051.002/1978.10.27/NSIDC0051_SEAICE_PS_S25km_19781027_v2.0.nc

that needs earthdata creds, so zipped and attached:

NSIDC0051_SEAICE_PS_S25km_19781027_v2.0.nc.zip

But, I wanted to put it out there because these empty files do occur for various reasons.

keewis · 2024-10-17T16:58:28Z

I wanted to put it out there because these empty files do occur for various reasons.

thanks for confirming that these are not just broken files.

In that case I wonder how to best support these: in theory, writing .zarray / .zattrs should be enough, since zarr will also use the fill_value to fill missing chunks. In other words, how do we generate a ManifestArray that does not contain chunks (i.e., a size-0 ManifestArray, but with the shape / chunksizes from zarray)?

TomNicholas · 2024-10-17T17:13:11Z

Size-0 ManifestArray

ManifestArrays are just 3 numpy arrays in a trenchcoat, and you can have length-0 numpy arrays right? So this should be possible, but might need some special casing.

We should also check if size-0 Zarr arrays are possible.

mdsumner · 2024-10-17T17:19:48Z

also fwiw as a todo for me, the GDAL autotest suite has some metadata-only examples, and I wanted to explore how xarray treats related, as opposed to zarr python itself, in case there was some misalignment in how GDAL should behave too:

https://github.com/OSGeo/gdal/tree/master/autotest/gdrivers/data/zarr/array_attrs.zarr

(xarray is fine with the empty-but-for-scalar-var netcdf, but not with the empty GDAL zarr)

keewis · 2024-10-17T17:42:13Z

https://github.com/OSGeo/gdal/tree/master/autotest/gdrivers/data/zarr/array_attrs.zarr

There's multiple issues with that (I think): .zarray is not valid json (it uses single quotes instead of double quotes for the dtype), and !b1 is unknown to numpy.dtype (remove the exclamation mark, maybe?).

Once those are fixed, xarray uses zarr.open_group to open files, which complains about the directory being an array, not a group. To fix that, you'd have to move .zarray and .zattrs into a subdirectory (var) and create a .zgroup file one level up. And finally, xarray needs the _ARRAY_DIMENSIONS attribute to be set and contain dimension names.

keewis · 2024-10-17T18:11:21Z

I've repurposed this PR to instead allow reading and writing variables without chunks (detected by no chunks and ndim > 0).

This appears to work properly, but I need help with the typing of ChunkManifest.empty: why does mypy think np.array((), dtype=np.dtypes.StringDType) is of type ndarray[Any, dtype[Any]]? Do I need to use cast for that?

TomNicholas · 2024-10-17T18:16:18Z

I think you might need to cast. I don't think numpy's handling of generics like this fully works yet. See also the PR I recently merged that fixed some similar typing errors.

virtualizarr/manifests/manifest.py

TomNicholas

So how do these get represented in the manifest? Size-0 numpy arrays? If so have you tried concatenating them or doing any other operations to make sure that ManifestArray doesn't break?

virtualizarr/readers/kerchunk.py

keewis · 2024-10-18T12:03:48Z

after some more investigation, I believe we won't be able to use entries={} (nor size-0 arrays) as-is: we need to somehow pass the chunk grid (which exists, there's just no data we can refer to) to ChunkManifest. One option would be to follow this comment:

VirtualiZarr/virtualizarr/manifests/manifest.py

Lines 112 to 114 in e6407e0

    
           # TODO should we actually optionally pass chunk grid shape in, 
        
           # in case there are not enough chunks to give correct idea of full shape? 
        
           shape = get_chunk_grid_shape(entries.keys())

and explicitly pass the chunk grid shape. This would allow us to translate entries={} to the suggestion below.

Another option would be to construct paths / offsets / lengths with the actual shape, but the values in paths would be the missing value marker (the na_object parameter to np.dtypes.StringDType) – then the concatenation would immediately work, but manifest.dict() would have to skip over any entry where the path is missing.

keewis · 2024-10-18T12:38:06Z

I've done both (looks like path == "" already meant missing chunk, so I'm just using that), tell me what you think

virtualizarr/readers/kerchunk.py

TomNicholas · 2024-10-18T14:05:55Z

virtualizarr/tests/test_manifests/test_array.py

+        assert all(
+            len_chunk <= len_arr
+            for len_arr, len_chunk in zip(expanded.shape, expanded.chunks)
+        )
+        assert expanded.manifest.dict() == {}


Really nice test!

TomNicholas · 2024-10-18T14:08:18Z

So basically the concatenation works okay because under the hood the manifest still contains numpy arrays of the correct shape, they just have path='' for all elements?

keewis · 2024-10-18T14:09:10Z

yes, that's it

TomNicholas · 2024-10-18T14:10:16Z

Great. That's presumably less efficient than not storing them explicitly, but it should be robust.

keewis · 2024-10-18T14:11:19Z

probably faster, too, because we don't need to special-case empty chunk manifests (the memory footprint would be somewhat higher, I guess).

TomNicholas

This is great. Ready to merge?

TomNicholas · 2024-10-18T14:14:41Z

I guess maybe a note in the docs/releases.rst changelog.

TomNicholas · 2024-10-18T15:28:11Z

Thanks @keewis!

keewis added 2 commits October 17, 2024 15:49

raise a more user-friendly error for empty variables

3632372

add a test to make sure the error is raised

8a96480

keewis temporarily deployed to test-release October 17, 2024 15:32 — with GitHub Actions Inactive

keewis added 2 commits October 17, 2024 20:01

create a empty manifest array instead

8e5752c

also allow writing empty chunk manifests

ac700a9

keewis changed the title ~~improve the error message for empty archival datasets~~ allow creating references for empty archival datasets Oct 17, 2024

keewis temporarily deployed to test-release October 17, 2024 18:05 — with GitHub Actions Inactive

try using an annotated variable

bea33b7

keewis temporarily deployed to test-release October 17, 2024 18:38 — with GitHub Actions Inactive

explicitly cast instead

69c5ecb

keewis temporarily deployed to test-release October 17, 2024 18:40 — with GitHub Actions Inactive

switch the order of cast arguments

56e624e

keewis temporarily deployed to test-release October 17, 2024 18:42 — with GitHub Actions Inactive

TomNicholas reviewed Oct 17, 2024

View reviewed changes

virtualizarr/manifests/manifest.py Outdated Show resolved Hide resolved

use the main constructor instead

308e6ef

keewis temporarily deployed to test-release October 17, 2024 19:17 — with GitHub Actions Inactive

forgotten call of ChunkManifest.empty

2d4a9b3

keewis temporarily deployed to test-release October 17, 2024 19:23 — with GitHub Actions Inactive

TomNicholas added the references generation Reading byte ranges from archival files label Oct 17, 2024

TomNicholas reviewed Oct 17, 2024

View reviewed changes

virtualizarr/readers/kerchunk.py Show resolved Hide resolved

mdsumner mentioned this pull request Oct 18, 2024

HDF5 support for compound datasets, character string datasets OSGeo/gdal#1348

Open

explanatory comment

624265f

keewis temporarily deployed to test-release October 18, 2024 10:00 — with GitHub Actions Inactive

check that broadcasting works

07518af

keewis temporarily deployed to test-release October 18, 2024 10:38 — with GitHub Actions Inactive

Merge branch 'main' into error-mismatching-shape

2335e9b

keewis temporarily deployed to test-release October 18, 2024 10:41 — with GitHub Actions Inactive

keewis added 3 commits October 18, 2024 14:37

use empty arrays instead of 0-sized if shape given

5033b57

pass the chunk grid shape for all empty chunk manifests

b61fb17

don't allow empty chunks if no chunk grid shape given

bd12745

keewis temporarily deployed to test-release October 18, 2024 12:38 — with GitHub Actions Inactive

keewis commented Oct 18, 2024

View reviewed changes

virtualizarr/readers/kerchunk.py Outdated Show resolved Hide resolved

move ujson to top-level

3ede926

keewis temporarily deployed to test-release October 18, 2024 12:45 — with GitHub Actions Inactive

TomNicholas reviewed Oct 18, 2024

View reviewed changes

replace the manual floor division

31dfe33

keewis temporarily deployed to test-release October 18, 2024 14:10 — with GitHub Actions Inactive

TomNicholas approved these changes Oct 18, 2024

View reviewed changes

keewis added 2 commits October 18, 2024 17:09

release note

8abd22c

fix a couple of changelog entries

03b9d9e

keewis temporarily deployed to test-release October 18, 2024 15:12 — with GitHub Actions Inactive

TomNicholas merged commit 7053bc0 into zarr-developers:main Oct 18, 2024
10 checks passed

keewis deleted the error-mismatching-shape branch October 18, 2024 15:30

keewis mentioned this pull request Oct 18, 2024

Split kerchunk reader up #261

Merged

7 tasks

sharkinsspatial mentioned this pull request Oct 21, 2024

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow creating references for empty archival datasets #260

allow creating references for empty archival datasets #260

keewis commented Oct 17, 2024 •

edited

Loading

mdsumner commented Oct 17, 2024

keewis commented Oct 17, 2024

TomNicholas commented Oct 17, 2024 •

edited

Loading

mdsumner commented Oct 17, 2024 •

edited

Loading

keewis commented Oct 17, 2024 •

edited

Loading

keewis commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

TomNicholas left a comment

keewis commented Oct 18, 2024 •

edited

Loading

keewis commented Oct 18, 2024 •

edited

Loading

TomNicholas Oct 18, 2024

TomNicholas commented Oct 18, 2024

keewis commented Oct 18, 2024

TomNicholas commented Oct 18, 2024 •

edited

Loading

keewis commented Oct 18, 2024 •

edited

Loading

TomNicholas left a comment

TomNicholas commented Oct 18, 2024

TomNicholas commented Oct 18, 2024

allow creating references for empty archival datasets #260

allow creating references for empty archival datasets #260

Conversation

keewis commented Oct 17, 2024 • edited Loading

mdsumner commented Oct 17, 2024

keewis commented Oct 17, 2024

TomNicholas commented Oct 17, 2024 • edited Loading

mdsumner commented Oct 17, 2024 • edited Loading

keewis commented Oct 17, 2024 • edited Loading

keewis commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

keewis commented Oct 18, 2024 • edited Loading

keewis commented Oct 18, 2024 • edited Loading

TomNicholas Oct 18, 2024

Choose a reason for hiding this comment

TomNicholas commented Oct 18, 2024

keewis commented Oct 18, 2024

TomNicholas commented Oct 18, 2024 • edited Loading

keewis commented Oct 18, 2024 • edited Loading

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas commented Oct 18, 2024

TomNicholas commented Oct 18, 2024

keewis commented Oct 17, 2024 •

edited

Loading

TomNicholas commented Oct 17, 2024 •

edited

Loading

mdsumner commented Oct 17, 2024 •

edited

Loading

keewis commented Oct 17, 2024 •

edited

Loading

keewis commented Oct 18, 2024 •

edited

Loading

keewis commented Oct 18, 2024 •

edited

Loading

TomNicholas commented Oct 18, 2024 •

edited

Loading

keewis commented Oct 18, 2024 •

edited

Loading