Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix conversion of Zarr string dataset to HDF5 #1171

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@
- Adjusted stacklevel of warnings to point to user code when possible. @rly [#1166](https://github.com/hdmf-dev/hdmf/pull/1166)
- Improved "already exists" error message when adding a container to a `MultiContainerInterface`. @rly [#1165](https://github.com/hdmf-dev/hdmf/pull/1165)

### Bug fixes
- Fixed bug when converting string datasets from Zarr to HDF5. @oruebel [#1171](https://github.com/hdmf-dev/hdmf/pull/1171)

## HDMF 3.14.3 (July 29, 2024)

### Enhancements
Expand All @@ -22,10 +25,10 @@ is available on build (during the write process), but not on read of a dataset f
- Warn when unexpected keys are present in specs. @rly [#1134](https://github.com/hdmf-dev/hdmf/pull/1134)
- Support appending to zarr arrays. @mavaylon1 [#1136](https://github.com/hdmf-dev/hdmf/pull/1136)
- Support specifying "value" key in DatasetSpec. @rly [#1143](https://github.com/hdmf-dev/hdmf/pull/1143)
- Add support for numpy 2. @rly [#1139](https://github.com/hdmf-dev/hdmf/pull/1139)
- Added support for numpy 2. @rly [#1139](https://github.com/hdmf-dev/hdmf/pull/1139)

### Bug fixes
- Fix iterator increment causing an extra +1 added after the end of completion. @CodyCBakerPhD [#1128](https://github.com/hdmf-dev/hdmf/pull/1128)
- Fixed iterator increment causing an extra +1 added after the end of completion. @CodyCBakerPhD [#1128](https://github.com/hdmf-dev/hdmf/pull/1128)

## HDMF 3.14.1 (June 6, 2024)

Expand Down
6 changes: 6 additions & 0 deletions src/hdmf/backends/hdf5/h5tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -923,6 +923,12 @@
# TODO: These values exist, but I haven't solved them yet
# binary
# number

# Use text dtype for Zarr datasets of strings. Zarr stores variable length strings
# as objects, so we need to detect this special case here
if hasattr(data, 'attrs') and 'zarr_dtype' in data.attrs and data.attrs['zarr_dtype'] == 'str':
return cls.__dtypes['text']

Check warning on line 930 in src/hdmf/backends/hdf5/h5tools.py

View check run for this annotation

Codecov / codecov/patch

src/hdmf/backends/hdf5/h5tools.py#L930

Added line #L930 was not covered by tests
Comment on lines +929 to +930
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add:

try:
    from zarr import Array as ZarrArray
    import numcodecs
    ZARR_INSTALLED = True
except ImportError:
    ZARR_INSTALLED = False

def _is_zarr_array(value):
    return ZARR_INSTALLED and isinstance(value, ZarrArray)

Can we change these lines to:

        if _is_zarr_array(data) and data.filters:
            if numcodecs.VLenUTF8() in data.filters:
                return cls.__dtypes['text']
            elif numcodecs.VLenBytes() in data.filters:
                return cls.__dtypes['ascii']

It would be nice if HDF5IO can accept any Zarr array just like it accepts any Numpy array, instead of accepting only Zarr arrays that have been written with hdmf-zarr. The Zarr array filters should show that the array dtype is a variable-length string and its encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I don't seem to be able to create a test case where these lines (regardless of yours or mine) matters. Could you add one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note from discussion: this probably can be removed and was there to be overly cautious. Ryan TODO: try a conversion and confirm.


dtype = cls.__resolve_dtype_helper__(dtype)
if dtype is None:
dtype = cls.get_type(data)
Expand Down
12 changes: 10 additions & 2 deletions src/hdmf/build/objectmapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,10 +209,18 @@
if (isinstance(value, np.ndarray) or
(hasattr(value, 'astype') and hasattr(value, 'dtype'))):
if spec_dtype_type is _unicode:
ret = value.astype('U')
if hasattr(value, 'attrs') and 'zarr_dtype' in value.attrs:
# Zarr stores strings as objects, so we cannot convert to unicode dtype
ret = value

Check warning on line 214 in src/hdmf/build/objectmapper.py

View check run for this annotation

Codecov / codecov/patch

src/hdmf/build/objectmapper.py#L214

Added line #L214 was not covered by tests
else:
ret = value.astype('U')
ret_dtype = "utf8"
elif spec_dtype_type is _ascii:
ret = value.astype('S')
if hasattr(value, 'attrs') and 'zarr_dtype' in value.attrs:
# Zarr stores strings as objects, so we cannot convert to unicode dtype
ret = value

Check warning on line 221 in src/hdmf/build/objectmapper.py

View check run for this annotation

Codecov / codecov/patch

src/hdmf/build/objectmapper.py#L221

Added line #L221 was not covered by tests
else:
ret = value.astype('S')
ret_dtype = "ascii"
else:
dtype_func, warning_msg = cls.__resolve_numeric_dtype(value.dtype, spec_dtype_type)
Expand Down
Loading