improve html representation of datasets #1100

h-mayorquin · 2024-04-19T16:01:05Z

Motivation

Improve the display of the data in the html representation of containers. Note that this PR is focused on datasets that were already written. For in memory representation the issue on what to do with things that are wrapped in an iterator or an DataIO subtype can be addressed in another PR I think.

How to test the behavior?

HDF5

I have been using this script

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.file import mock_NWBFile
from hdmf.backends.hdf5.h5_utils import H5DataIO
from pynwb.testing.mock.ophys import mock_ImagingPlane, mock_TwoPhotonSeries

import numpy as np

data=np.random.rand(500_000, 384)
timestamps = np.arange(500_000)
data = data=H5DataIO(data=data, compression=True, chunks=True)

nwbfile = mock_NWBFile()
electrical_series = mock_ElectricalSeries(data=data, nwbfile=nwbfile, rate=None, timestamps=timestamps)

imaging_plane = mock_ImagingPlane(grid_spacing=[1.0, 1.0], nwbfile=nwbfile)


data = H5DataIO(data=np.random.rand(2, 2, 2), compression=True, chunks=True)
two_photon_series = mock_TwoPhotonSeries(name="TwoPhotonSeries", imaging_plane=imaging_plane, data=data, nwbfile=nwbfile)


# Write to file
from pynwb import NWBHDF5IO
with NWBHDF5IO('ecephys_tutorial.nwb', 'w') as io:
    io.write(nwbfile)



from pynwb import NWBHDF5IO

io = NWBHDF5IO('ecephys_tutorial.nwb', 'r')
nwbfile = io.read()
nwbfile

Zarr

from numcodecs import Blosc
from hdmf_zarr import ZarrDataIO
import numpy as np
from pynwb.testing.mock.file import mock_NWBFile
from hdmf_zarr.nwb import NWBZarrIO
import os
import zarr
from numcodecs import Blosc, Delta

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
filters = [Delta(dtype="i4")]

data_with_zarr_data_io = ZarrDataIO(
    data=np.arange(100000000, dtype='i4').reshape(10000, 10000),
    chunks=(1000, 1000),
    compressor=Blosc(cname='zstd', clevel=3, shuffle=Blosc.SHUFFLE),
    # filters=filters,
)

timestamps = np.arange(10000)

data = data_with_zarr_data_io

nwbfile = mock_NWBFile()
electrical_series_name = "ElectricalSeries"
rate = None
electrical_series = mock_ElectricalSeries(name=electrical_series_name, data=data, nwbfile=nwbfile, timestamps=timestamps, rate=None)


path = "zarr_test.nwb.zarr"
absolute_path = os.path.abspath(path)
with NWBZarrIO(path=path, mode="w") as io:
    io.write(nwbfile)
    
from hdmf_zarr.nwb import NWBZarrIO

path = "zarr_test.nwb.zarr"

io = NWBZarrIO(path=path, mode="r")
nwbfile = io.read()
nwbfile

Checklist

Did you update CHANGELOG.md with your changes?
Does the PR clearly describe the problem and the solution?
Have you reviewed our Contributing Guide?
Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

for more information, see https://pre-commit.ci

src/hdmf/container.py

for more information, see https://pre-commit.ci

codecov · 2024-04-23T15:18:43Z

Codecov Report

Attention: Patch coverage is 70.37037% with 16 lines in your changes missing coverage. Please review.

Project coverage is 88.96%. Comparing base (b78625b) to head (3813723).

Files with missing lines	Patch %	Lines
src/hdmf/container.py	70.37%	10 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev    #1100      +/-   ##
==========================================
- Coverage   89.03%   88.96%   -0.07%     
==========================================
  Files          45       45              
  Lines        9883     9932      +49     
  Branches     2813     2824      +11     
==========================================
+ Hits         8799     8836      +37     
- Misses        767      774       +7     
- Partials      317      322       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stephprince

This looks great! Thanks for the PR.

Could you add tests for the data html representation with hdf5 and zarr? I think we mainly have string equivalence tests for this kind of thing.

I'm also wondering if it would be nice to have the hdf5 dataset info displayed in a similar table format as the zarr arrays to make it more consistent across backends. I think we should be able to replicate this using the hdf5 dataset info as an input to a method like this: https://github.com/zarr-developers/zarr-python/blob/9d046ea0d2878af7d15b3de3ec3036fe31661340/zarr/util.py#L402

src/hdmf/container.py

for more information, see https://pre-commit.ci

h-mayorquin · 2024-04-26T23:06:25Z

OK, I added table formating for hdf5:

h-mayorquin · 2024-04-26T23:07:46Z

@stephprince
Concerning the test, yes, I can do it, but, can you helmp to create a container that contains array data? I just don't have experienced with the bare bones object. This is my attempt:

from hdmf.container import Container

container = Container(name="Container")
container.__fields__ = {
    "name": "data",
    "description": "test data",
}

test_data = np.array([1, 2, 3, 4, 5])
setattr(container, "data", test_data)
container.fields

But the data is not added as a field. How can I move forward?

h-mayorquin · 2024-04-29T14:30:33Z

…f_data' into improve_html_repr_of_data

for more information, see https://pre-commit.ci

…f_data' into improve_html_repr_of_data

tests/unit/test_container.py

h-mayorquin · 2024-04-30T20:51:23Z

I added the handling division by zero, check this out what happens with external files (like Video):

From this example:

import remfile
import h5py

asset_path = "sub-CSHL049/sub-CSHL049_ses-c99d53e6-c317-4c53-99ba-070b26673ac4_behavior+ecephys+image.nwb"
recording_asset = dandiset.get_asset_by_path(path=asset_path)
url = recording_asset.get_content_url(follow_redirects=True, strip_query=True)
file_path = url

rfile = remfile.File(file_path)
file = h5py.File(rfile, 'r')

from pynwb import NWBHDF5IO

io = NWBHDF5IO(file=file, mode='r')

nwbfile = io.read()
nwbfile

stephprince · 2024-05-01T17:23:00Z

There are still some failing tests for different python versions, it looks like one of the reasons is because h5py added the .nbytes attribute in version 3.0 and we still have h5py==2.10 as a minimum version.

>       array_size_in_bytes = array.nbytes
E       AttributeError: 'Dataset' object has no attribute 'nbytes'

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

oruebel · 2024-05-01T17:29:05Z

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

Checking if hasattr(data, "nbytes") to optionally seems reasonable to me. In this way you can also avoid custom behavior depending on library versions.

oruebel · 2024-05-01T17:39:54Z

src/hdmf/container.py

+        if isinstance(array, h5py.Dataset):
+            hdf5_dataset = array
+            chunks = hdf5_dataset.chunks
+            compression = hdf5_dataset.compression
+            uncompressed_size = hdf5_dataset.nbytes
+            compression_opts = hdf5_dataset.compression_opts
+            compressed_size = hdf5_dataset.id.get_storage_size()
+            compression_ratio = uncompressed_size / compressed_size if compressed_size != 0 else "undefined"
+
+            head = "HDF5 Dataset"
+            hdf5_info_dict = {"chunks": chunks, "compression": compression, "compression_opts": compression_opts,
+                              "compression_ratio": compression_ratio}
+            backend_info_dict = {**basic_array_info_dict, **hdf5_info_dict}
+
+        if hasattr(array, "store") and hasattr(array, "shape"):  # Duck typing for zarr array
+            head = "Zarr Array"
+            zarr_info_dict = {k:v for k, v in array.info_items()}
+            backend_info_dict = zarr_info_dict


It would be nice to avoid having logic that is specific for a particular io backend in the Container. The reason is that this inhibts implementing backends in a self-contained manner and stand-alone packages and requires updating many places throughout HDMF.

The checks for h5py.Dataset and Zarr.array are really only relevant when a file was read from file. To help disentangle the dependencies, I'm wondering whether we could do the following:

Add a static method generate_dataset_html to HDMFIO that would then need to implemented by HDF5IO and ZarrIO

In the Container you could then do something like:

read_io = self.get_read_io() # if the Container was read from file, this will give you the IO object that read it if read_io is not None: html_repr = read_io.generate_dataset_html(my_dataset) else: # The file was not read from disk so the dataset should be numpy array or a list

`

It would be nice to avoid having logic that is specific for a particular io backend in the Container. The reason is that this inhibts implementing backends in a self-contained manner and stand-alone packages and requires updating many places throughout HDMF.

I see, yes, it would be nice if we could decouple this. On the other hand, right now, if they do implement their own backend they will just lose the representation for datasets which is not critical.

The checks for h5py.Dataset and Zarr.array are really only relevant when a file was read from file.

Is it? Right now in the test we are passsing an hdf5 dataset as data without writing the nwbfile for testin the display. Is this not intended?

This proposal seems very good:

read_io = self.get_read_io() # if the Container was read from file, this will give you the IO object that read it if read_io is not None: html_repr = read_io.generate_dataset_html(my_dataset) else: # The file was not read from disk so the dataset should be numpy array or a list

I see two downsides:

Missing extensive representation for in-memory files (it is nice to know what you will write!).

Fragmenting the code base for html representations.

Is there any other backend in the works right now? If not, maybe we can do this simpler way and add the complexity once we are closer to need it?

I think it would worth giving this a try here. If it works for HDF5, we can then we can easily move the logic for Zarr to hdmf-zarr. I don't think it should be too hard to make this work right now, but these things tend to get hard to change later on.

What about handling in-memory objects as well?

In-memory-only objects (i.e., numpy arrays and lists) can be handled here in the Container class.

We can then follow your approach in another PR to add backend related information extracted through the DataIO objects.

Could you point me to the PR you are referring to. I'm not sure what role DataIO plays for this PR.

I mean that we can implement the following strategy in another PR to add backend specific information:

read_io = self.get_read_io() # if the Container was read from file, this will give you the IO object that read it if read_io is not None: html_repr = read_io.generate_dataset_html(my_dataset) else: # The file was not read from disk so the dataset should be numpy array or a list

Yes that looks good 👍 . Just to avoid confusion, read_io is an instance of HDMFIO (i.e., HDF5IO or ZarrIO) and not DataIO. To implement the logic we would then need to:

Add HDMFIO.generate_dataset_html(dataset) which would implement just a minimalist representation

Implement HDF5IO.generate_dataset_html(h5py.Dataset) to represent h5py.Dataset

In a separate PR on hdmf_zarr implement ZarrIO.generate_dataset_html(Zarr.array)

To simplify this implementation and generate consistent representations, we could make a utility function that would take information about a dataset (e.g,. name, shape, data type, etc.) as input and generate the html representation such that the individual generate_data_html on the I/O backends would just collect the information from the dataset and use the utility function to generate the actual html.

Yes that looks good 👍 . Just to avoid confusion, read_io is an instance of HDMFIO (i.e., HDF5IO or ZarrIO) and not DataIO. To implement the logic we would then need to.

Yes, I realized after that I was confusing these two objects.

Sounds good. Thanks for your hard work on this PR and the fruitful discussion.

h-mayorquin · 2024-05-02T15:01:52Z

@stephprince

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

It can be estimated from the dtype and the number of elements. I will do that when the attribute does not exists.

src/hdmf/container.py

rly · 2024-10-02T20:45:43Z

@stephprince when you have time, can you review this?

h-mayorquin and others added 2 commits April 19, 2024 09:52

improve dev repr

649f141

[pre-commit.ci] auto fixes from pre-commit.com hooks

475cda9

for more information, see https://pre-commit.ci

h-mayorquin commented Apr 19, 2024

View reviewed changes

src/hdmf/container.py Outdated Show resolved Hide resolved

h-mayorquin commented Apr 19, 2024

View reviewed changes

src/hdmf/container.py Show resolved Hide resolved

address ruff

7f3c94e

h-mayorquin marked this pull request as ready for review April 23, 2024 15:14

h-mayorquin and others added 2 commits April 23, 2024 09:16

add changelog

5128d53

[pre-commit.ci] auto fixes from pre-commit.com hooks

21ae3cf

for more information, see https://pre-commit.ci

rly requested a review from stephprince April 24, 2024 01:21

rly added the category: enhancement improvements of code or code behavior label Apr 24, 2024

rly added this to the 3.14.0 milestone Apr 24, 2024

stephprince reviewed Apr 24, 2024

View reviewed changes

src/hdmf/container.py Outdated Show resolved Hide resolved

src/hdmf/container.py Show resolved Hide resolved

src/hdmf/container.py Outdated Show resolved Hide resolved

h-mayorquin and others added 2 commits April 26, 2024 17:05

add table representation for hdf5 info

4eb2635

[pre-commit.ci] auto fixes from pre-commit.com hooks

08292c6

for more information, see https://pre-commit.ci

h-mayorquin and others added 6 commits April 29, 2024 14:25

add test

59083c2

Merge remote-tracking branch 'refs/remotes/origin/improve_html_repr_o…

06a064e

…f_data' into improve_html_repr_of_data

[pre-commit.ci] auto fixes from pre-commit.com hooks

7ce5b3f

for more information, see https://pre-commit.ci

ruff

fc14d71

Merge remote-tracking branch 'refs/remotes/origin/improve_html_repr_o…

a2931e2

…f_data' into improve_html_repr_of_data

Merge branch 'dev' into improve_html_repr_of_data

96456a4

h-mayorquin commented Apr 29, 2024

View reviewed changes

tests/unit/test_container.py Show resolved Hide resolved

handle division by zer

133e28d

stephprince added 3 commits April 30, 2024 17:58

add zarr, array, hdf5 repr tests

ae21b61

generalize array html table description

28449a3

remove zarr tests

6e6a84c

oruebel reviewed May 1, 2024

View reviewed changes

fix nbytes

89fd978

h-mayorquin commented May 2, 2024

View reviewed changes

src/hdmf/container.py Outdated Show resolved Hide resolved

mavaylon1 assigned stephprince May 2, 2024

h-mayorquin and others added 4 commits May 2, 2024 10:54

fix use of nbytes ahead

a0e1736

added TODO

538ba98

Merge branch 'dev' into improve_html_repr_of_data

e0ad0a1

add html test array data type

9cbcf64

CodyCBakerPhD mentioned this pull request May 6, 2024

[Backend Configuration Va] Basic user documentation catalystneuro/neuroconv#802

Merged

Merge branch 'dev' into improve_html_repr_of_data

5b235e0

Merge branch 'dev' into improve_html_repr_of_data

3813723

rly modified the milestones: 3.14.5, 3.14.6 Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve html representation of datasets #1100

improve html representation of datasets #1100

h-mayorquin commented Apr 19, 2024 •

edited

Loading

codecov bot commented Apr 23, 2024 •

edited

Loading

stephprince left a comment

h-mayorquin commented Apr 26, 2024

h-mayorquin commented Apr 26, 2024 •

edited

Loading

h-mayorquin commented Apr 29, 2024

h-mayorquin commented Apr 30, 2024

stephprince commented May 1, 2024

oruebel commented May 1, 2024

oruebel May 1, 2024

h-mayorquin May 2, 2024

oruebel May 2, 2024

h-mayorquin May 2, 2024 •

edited

Loading

oruebel May 2, 2024 •

edited

Loading

oruebel May 3, 2024

h-mayorquin May 3, 2024

oruebel May 3, 2024 •

edited

Loading

h-mayorquin May 3, 2024

oruebel May 3, 2024

h-mayorquin commented May 2, 2024

rly commented Oct 2, 2024

improve html representation of datasets #1100

Are you sure you want to change the base?

improve html representation of datasets #1100

Conversation

h-mayorquin commented Apr 19, 2024 • edited Loading

Motivation

How to test the behavior?

HDF5

Zarr

Checklist

codecov bot commented Apr 23, 2024 • edited Loading

Codecov Report

stephprince left a comment

Choose a reason for hiding this comment

h-mayorquin commented Apr 26, 2024

h-mayorquin commented Apr 26, 2024 • edited Loading

h-mayorquin commented Apr 29, 2024

h-mayorquin commented Apr 30, 2024

stephprince commented May 1, 2024

oruebel commented May 1, 2024

oruebel May 1, 2024

Choose a reason for hiding this comment

h-mayorquin May 2, 2024

Choose a reason for hiding this comment

oruebel May 2, 2024

Choose a reason for hiding this comment

h-mayorquin May 2, 2024 • edited Loading

Choose a reason for hiding this comment

oruebel May 2, 2024 • edited Loading

Choose a reason for hiding this comment

oruebel May 3, 2024

Choose a reason for hiding this comment

h-mayorquin May 3, 2024

Choose a reason for hiding this comment

oruebel May 3, 2024 • edited Loading

Choose a reason for hiding this comment

h-mayorquin May 3, 2024

Choose a reason for hiding this comment

oruebel May 3, 2024

Choose a reason for hiding this comment

h-mayorquin commented May 2, 2024

rly commented Oct 2, 2024

h-mayorquin commented Apr 19, 2024 •

edited

Loading

codecov bot commented Apr 23, 2024 •

edited

Loading

h-mayorquin commented Apr 26, 2024 •

edited

Loading

h-mayorquin May 2, 2024 •

edited

Loading

oruebel May 2, 2024 •

edited

Loading

oruebel May 3, 2024 •

edited

Loading