-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve html representation of datasets #1100
base: dev
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #1100 +/- ##
==========================================
- Coverage 89.03% 88.96% -0.07%
==========================================
Files 45 45
Lines 9883 9932 +49
Branches 2813 2824 +11
==========================================
+ Hits 8799 8836 +37
- Misses 767 774 +7
- Partials 317 322 +5 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks for the PR.
Could you add tests for the data html representation with hdf5 and zarr? I think we mainly have string equivalence tests for this kind of thing.
I'm also wondering if it would be nice to have the hdf5 dataset info displayed in a similar table format as the zarr arrays to make it more consistent across backends. I think we should be able to replicate this using the hdf5 dataset info as an input to a method like this: https://github.com/zarr-developers/zarr-python/blob/9d046ea0d2878af7d15b3de3ec3036fe31661340/zarr/util.py#L402
@stephprince from hdmf.container import Container
container = Container(name="Container")
container.__fields__ = {
"name": "data",
"description": "test data",
}
test_data = np.array([1, 2, 3, 4, 5])
setattr(container, "data", test_data)
container.fields But the data is not added as a field. How can I move forward? |
Related: |
…f_data' into improve_html_repr_of_data
for more information, see https://pre-commit.ci
…f_data' into improve_html_repr_of_data
I added the handling division by zero, check this out what happens with external files (like Video): From this example: import remfile
import h5py
asset_path = "sub-CSHL049/sub-CSHL049_ses-c99d53e6-c317-4c53-99ba-070b26673ac4_behavior+ecephys+image.nwb"
recording_asset = dandiset.get_asset_by_path(path=asset_path)
url = recording_asset.get_content_url(follow_redirects=True, strip_query=True)
file_path = url
rfile = remfile.File(file_path)
file = h5py.File(rfile, 'r')
from pynwb import NWBHDF5IO
io = NWBHDF5IO(file=file, mode='r')
nwbfile = io.read()
nwbfile |
There are still some failing tests for different python versions, it looks like one of the reasons is because h5py added the
I'm not sure if there's another way to access that information or if we would just want to optionally display it if available. |
Checking |
src/hdmf/container.py
Outdated
if isinstance(array, h5py.Dataset): | ||
hdf5_dataset = array | ||
chunks = hdf5_dataset.chunks | ||
compression = hdf5_dataset.compression | ||
uncompressed_size = hdf5_dataset.nbytes | ||
compression_opts = hdf5_dataset.compression_opts | ||
compressed_size = hdf5_dataset.id.get_storage_size() | ||
compression_ratio = uncompressed_size / compressed_size if compressed_size != 0 else "undefined" | ||
|
||
head = "HDF5 Dataset" | ||
hdf5_info_dict = {"chunks": chunks, "compression": compression, "compression_opts": compression_opts, | ||
"compression_ratio": compression_ratio} | ||
backend_info_dict = {**basic_array_info_dict, **hdf5_info_dict} | ||
|
||
if hasattr(array, "store") and hasattr(array, "shape"): # Duck typing for zarr array | ||
head = "Zarr Array" | ||
zarr_info_dict = {k:v for k, v in array.info_items()} | ||
backend_info_dict = zarr_info_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to avoid having logic that is specific for a particular io backend in the Container
. The reason is that this inhibts implementing backends in a self-contained manner and stand-alone packages and requires updating many places throughout HDMF.
The checks for h5py.Dataset
and Zarr.array
are really only relevant when a file was read from file. To help disentangle the dependencies, I'm wondering whether we could do the following:
- Add a static method
generate_dataset_html
toHDMFIO
that would then need to implemented byHDF5IO
andZarrIO
- In the
Container
you could then do something like:
read_io = self.get_read_io() # if the Container was read from file, this will give you the IO object that read it
if read_io is not None:
html_repr = read_io.generate_dataset_html(my_dataset)
else:
# The file was not read from disk so the dataset should be numpy array or a list
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to avoid having logic that is specific for a particular io backend in the Container. The reason is that this inhibts implementing backends in a self-contained manner and stand-alone packages and requires updating many places throughout HDMF.
I see, yes, it would be nice if we could decouple this. On the other hand, right now, if they do implement their own backend they will just lose the representation for datasets which is not critical.
The checks for h5py.Dataset and Zarr.array are really only relevant when a file was read from file.
Is it? Right now in the test we are passsing an hdf5 dataset as data without writing the nwbfile for testin the display. Is this not intended?
This proposal seems very good:
read_io = self.get_read_io() # if the Container was read from file, this will give you the IO object that read it
if read_io is not None:
html_repr = read_io.generate_dataset_html(my_dataset)
else:
# The file was not read from disk so the dataset should be numpy array or a list
I see two downsides:
- Missing extensive representation for in-memory files (it is nice to know what you will write!).
- Fragmenting the code base for html representations.
Is there any other backend in the works right now? If not, maybe we can do this simpler way and add the complexity once we are closer to need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would worth giving this a try here. If it works for HDF5, we can then we can easily move the logic for Zarr to hdmf-zarr. I don't think it should be too hard to make this work right now, but these things tend to get hard to change later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about handling in-memory objects as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-memory-only objects (i.e., numpy arrays and lists) can be handled here in the Container class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can then follow your approach in another PR to add backend related information extracted through the
DataIO
objects.
Could you point me to the PR you are referring to. I'm not sure what role DataIO
plays for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that we can implement the following strategy in another PR to add backend specific information:
read_io = self.get_read_io() # if the Container was read from file, this will give you the IO object that read it
if read_io is not None:
html_repr = read_io.generate_dataset_html(my_dataset)
else:
# The file was not read from disk so the dataset should be numpy array or a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that looks good 👍 . Just to avoid confusion, read_io
is an instance of HDMFIO
(i.e., HDF5IO
or ZarrIO
) and not DataIO
. To implement the logic we would then need to:
- Add
HDMFIO.generate_dataset_html(dataset)
which would implement just a minimalist representation - Implement
HDF5IO.generate_dataset_html(h5py.Dataset)
to representh5py.Dataset
- In a separate PR on
hdmf_zarr
implementZarrIO.generate_dataset_html(Zarr.array)
To simplify this implementation and generate consistent representations, we could make a utility function that would take information about a dataset (e.g,. name, shape, data type, etc.) as input and generate the html representation such that the individual generate_data_html
on the I/O backends would just collect the information from the dataset and use the utility function to generate the actual html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that looks good 👍 . Just to avoid confusion, read_io is an instance of HDMFIO (i.e., HDF5IO or ZarrIO) and not DataIO. To implement the logic we would then need to.
Yes, I realized after that I was confusing these two objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks for your hard work on this PR and the fruitful discussion.
It can be estimated from the dtype and the number of elements. I will do that when the attribute does not exists. |
@stephprince when you have time, can you review this? |
Motivation
Improve the display of the data in the html representation of containers. Note that this PR is focused on datasets that were already written. For in memory representation the issue on what to do with things that are wrapped in an iterator or an
DataIO
subtype can be addressed in another PR I think.How to test the behavior?
HDF5
I have been using this script
Zarr
Checklist
CHANGELOG.md
with your changes?