Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix conversion of Zarr string dataset to HDF5 #1171

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

oruebel
Copy link
Contributor

@oruebel oruebel commented Aug 15, 2024

Motivation

Fix hdmf-dev/hdmf-zarr#211

Zarr does not natively support variable length strings, instead the data is stored as objects with an encoder. When converting from Zarr to HDF5 the ObjectMapper tries to change the dtype in some cases for text datasets to unicode, which in turn can cause reading the data from Zarr to fail. This PR changes this behavior to: 1) not change the dtype for Zarr string datasets, and 2) detect the correct dtype to use in HDF5 in the backend.

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Does the PR clearly describe the problem and the solution?
  • Have you reviewed our Contributing Guide?
  • Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

Copy link

codecov bot commented Aug 15, 2024

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 88.83%. Comparing base (316ec4b) to head (2b90d21).

Files Patch % Lines
src/hdmf/build/objectmapper.py 33.33% 2 Missing and 2 partials ⚠️
src/hdmf/backends/hdf5/h5tools.py 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1171      +/-   ##
==========================================
- Coverage   88.88%   88.83%   -0.06%     
==========================================
  Files          45       45              
  Lines        9835     9841       +6     
  Branches     2795     2798       +3     
==========================================
  Hits         8742     8742              
- Misses        776      779       +3     
- Partials      317      320       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@oruebel oruebel requested a review from rly August 15, 2024 08:02
@oruebel
Copy link
Contributor Author

oruebel commented Aug 15, 2024

@rly could you please take a look to see if this is right way to address this issue

Comment on lines +929 to +930
if hasattr(data, 'attrs') and 'zarr_dtype' in data.attrs and data.attrs['zarr_dtype'] == 'str':
return cls.__dtypes['text']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add:

try:
    from zarr import Array as ZarrArray
    import numcodecs
    ZARR_INSTALLED = True
except ImportError:
    ZARR_INSTALLED = False

def _is_zarr_array(value):
    return ZARR_INSTALLED and isinstance(value, ZarrArray)

Can we change these lines to:

        if _is_zarr_array(data) and data.filters:
            if numcodecs.VLenUTF8() in data.filters:
                return cls.__dtypes['text']
            elif numcodecs.VLenBytes() in data.filters:
                return cls.__dtypes['ascii']

It would be nice if HDF5IO can accept any Zarr array just like it accepts any Numpy array, instead of accepting only Zarr arrays that have been written with hdmf-zarr. The Zarr array filters should show that the array dtype is a variable-length string and its encoding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I don't seem to be able to create a test case where these lines (regardless of yours or mine) matters. Could you add one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note from discussion: this probably can be removed and was there to be overly cautious. Ryan TODO: try a conversion and confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: NWB Zarr to HDMF export fails
2 participants