Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Data of VectorData wrapped in DataChunkIterator produces roundtrip error on read #952

Closed
3 tasks done
CodyCBakerPhD opened this issue Aug 23, 2023 · 4 comments · Fixed by #953
Closed
3 tasks done
Assignees
Labels
category: bug errors in the code or code behavior priority: high impacts proper operation or use of feature important to most users

Comments

@CodyCBakerPhD
Copy link
Contributor

What happened?

Discovered while writing tests for hdmf-dev/hdmf-zarr#111

Note that if I remove the DataChunkIterator wrapper around the data in the VectorData construction, the roundtrip issue goes away

Steps to Reproduce

from pathlib import Path

import numpy as np
from numpy.testing import assert_array_equal
from hdmf_zarr import ZarrIO
from hdmf.common import DynamicTable, VectorData, get_manager
from hdmf.data_utils import DataChunkIterator

tmpdir = Path("/home/jovyan/Downloads")

data=np.array([1., 2., 3.])
column = VectorData(name="TestColumn", description="", data=DataChunkIterator(data))
dynamic_table = DynamicTable(name="TestTable", description="", columns=[column])

zarr_top_level_path = str(tmpdir / f"example_issue.zarr")
with ZarrIO(path=zarr_top_level_path,  manager=get_manager(), mode="w") as io:
    io.write(dynamic_table)

with ZarrIO(path=zarr_top_level_path, manager=get_manager(), mode="r") as io:
    dynamic_table_roundtrip = io.read()
    data_roundtrip = dynamic_table_roundtrip["TestColumn"]
    assert_array_equal(data_roundtrip, data)

Traceback

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/GitHub/hdmf/src/hdmf/build/objectmapper.py:1256, in ObjectMapper.construct(self, **kwargs)
   1255 try:
-> 1256     obj = self.__new_container__(cls, builder.source, parent, builder.attributes.get(self.__spec.id_key()),
   1257                                  **kwargs)
   1258 except Exception as ex:

File ~/GitHub/hdmf/src/hdmf/build/objectmapper.py:1269, in ObjectMapper.__new_container__(self, cls, container_source, parent, object_id, **kwargs)
   1267 # obj has been created and is in construction mode, indicating that the object is being constructed by
   1268 # the automatic construct process during read, rather than by the user
-> 1269 obj.__init__(**kwargs)
   1270 obj._in_construct_mode = False  # reset to False to indicate that the construction of the object is complete

File ~/GitHub/hdmf/src/hdmf/utils.py:644, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    643 pargs = _check_args(args, kwargs)
--> 644 return func(args[0], **pargs)

File ~/GitHub/hdmf/src/hdmf/common/table.py:368, in DynamicTable.__init__(self, **kwargs)
    367         else:  # set ids to: 0 to length of columns - 1
--> 368             id.data.extend(range(lens[0]))
    370 self.id = id

AttributeError: 'Array' object has no attribute 'extend'

The above exception was the direct cause of the following exception:

ConstructError                            Traceback (most recent call last)
Cell In[15], line 2
      1 with ZarrIO(path=zarr_top_level_path, manager=get_manager(), mode="r") as io:
----> 2     dynamic_table_roundtrip = io.read()
      3     data_roundtrip = dynamic_table_roundtrip["TestColumn"]
      4     assert_array_equal(data_roundtrip, data)

File ~/GitHub/hdmf/src/hdmf/utils.py:644, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    642 def func_call(*args, **kwargs):
    643     pargs = _check_args(args, kwargs)
--> 644     return func(args[0], **pargs)

File ~/GitHub/hdmf/src/hdmf/backends/io.py:60, in HDMFIO.read(self, **kwargs)
     57 if all(len(v) == 0 for v in f_builder.values()):
     58     # TODO also check that the keys are appropriate. print a better error message
     59     raise UnsupportedOperation('Cannot build data. There are no values.')
---> 60 container = self.__manager.construct(f_builder)
     61 container.read_io = self
     62 if self.external_resources_path is not None:

File ~/GitHub/hdmf/src/hdmf/utils.py:644, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    642 def func_call(*args, **kwargs):
    643     pargs = _check_args(args, kwargs)
--> 644     return func(args[0], **pargs)

File ~/GitHub/hdmf/src/hdmf/build/manager.py:284, in BuildManager.construct(self, **kwargs)
    280     result = self.__type_map.construct(builder, self, parent)
    281 else:
    282     # we are at the top of the hierarchy,
    283     # so it must be time to resolve parents
--> 284     result = self.__type_map.construct(builder, self, None)
    285     self.__resolve_parents(result)
    286 self.prebuilt(result, builder)

File ~/GitHub/hdmf/src/hdmf/utils.py:644, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    642 def func_call(*args, **kwargs):
    643     pargs = _check_args(args, kwargs)
--> 644     return func(args[0], **pargs)

File ~/GitHub/hdmf/src/hdmf/build/manager.py:795, in TypeMap.construct(self, **kwargs)
    793     raise ValueError('No ObjectMapper found for builder of type %s' % dt)
    794 else:
--> 795     return obj_mapper.construct(builder, build_manager, parent)

File ~/GitHub/hdmf/src/hdmf/utils.py:644, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    642 def func_call(*args, **kwargs):
    643     pargs = _check_args(args, kwargs)
--> 644     return func(args[0], **pargs)

File ~/GitHub/hdmf/src/hdmf/build/objectmapper.py:1260, in ObjectMapper.construct(self, **kwargs)
   1258 except Exception as ex:
   1259     msg = 'Could not construct %s object due to: %s' % (cls.__name__, ex)
-> 1260     raise ConstructError(builder, msg) from ex
   1261 return obj

ConstructError: (root GroupBuilder {'attributes': {'.specloc': 'specifications', 'colnames': ['TestColumn'], 'data_type': 'DynamicTable', 'description': '', 'namespace': 'hdmf-common', 'object_id': '4bb712df-7363-4d25-a943-5e930b6a280e'}, 'groups': {'specifications': root/specifications GroupBuilder {'attributes': {}, 'groups': {'hdmf-common': root/specifications/hdmf-common GroupBuilder {'attributes': {}, 'groups': {'1.7.0': root/specifications/hdmf-common/1.7.0 GroupBuilder {'attributes': {}, 'groups': {}, 'datasets': {'base': root/specifications/hdmf-common/1.7.0/base DatasetBuilder {'attributes': {}, 'data': '{"datasets":[{"doc":"An abstract data type for a dataset.","data_type_def":"Data"}],"groups":[{"doc":"An abstract data type for a group storing collections of data and metadata. Base type for all data and metadata containers.","data_type_def":"Container"},{"groups":[{"doc":"Container objects held within this SimpleMultiContainer.","quantity":"*","data_type_inc":"Container"}],"datasets":[{"doc":"Data objects held within this SimpleMultiContainer.","quantity":"*","data_type_inc":"Data"}],"doc":"A simple Container for holding onto multiple containers.","data_type_inc":"Container","data_type_def":"SimpleMultiContainer"}]}'}, 'namespace': root/specifications/hdmf-common/1.7.0/namespace DatasetBuilder {'attributes': {}, 'data': '{"namespaces":[{"doc":"Common data structures provided by HDMF","schema":[{"source":"base"},{"source":"table"},{"source":"sparse"}],"name":"hdmf-common","full_name":"HDMF Common","version":"1.7.0","author":["Andrew Tritt","Oliver Ruebel","Ryan Ly","Ben Dichter"],"contact":["[email protected]","[email protected]","[email protected]","[email protected]"]}]}'}, 'sparse': root/specifications/hdmf-common/1.7.0/sparse DatasetBuilder {'attributes': {}, 'data': '{"groups":[{"datasets":[{"shape":[null],"dims":["number of non-zero values"],"dtype":"uint","doc":"The column indices.","name":"indices"},{"shape":[null],"dims":["number of rows in the matrix + 1"],"dtype":"uint","doc":"The row index pointer.","name":"indptr"},{"shape":[null],"dims":["number of non-zero values"],"doc":"The non-zero values in the matrix.","name":"data"}],"doc":"A compressed sparse row matrix. Data are stored in the standard CSR format, where column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]].","data_type_inc":"Container","data_type_def":"CSRMatrix","attributes":[{"doc":"The shape (number of rows, number of columns) of this sparse matrix.","name":"shape","dtype":"uint","shape":[2],"dims":["number of rows, number of columns"]}]}]}'}, 'table': root/specifications/hdmf-common/1.7.0/table DatasetBuilder {'attributes': {}, 'data': '{"datasets":[{"shape":[[null],[null,null],[null,null,null],[null,null,null,null]],"dims":[["dim0"],["dim0","dim1"],["dim0","dim1","dim2"],["dim0","dim1","dim2","dim3"]],"doc":"An n-dimensional dataset representing a column of a DynamicTable. If used without an accompanying VectorIndex, first dimension is along the rows of the DynamicTable and each step along the first dimension is a cell of the larger table. VectorData can also be used to represent a ragged array if paired with a VectorIndex. This allows for storing arrays of varying length in a single cell of the DynamicTable by indexing into this VectorData. The first vector is at VectorData[0:VectorIndex[0]]. The second vector is at VectorData[VectorIndex[0]:VectorIndex[1]], and so on.","data_type_inc":"Data","data_type_def":"VectorData","attributes":[{"doc":"Description of what these vectors represent.","name":"description","dtype":"text"}]},{"shape":[null],"dims":["num_rows"],"dtype":"uint8","doc":"Used with VectorData to encode a ragged array. An array of indices into the first dimension of the target VectorData, and forming a map between the rows of a DynamicTable and the indices of the VectorData. The name of the VectorIndex is expected to be the name of the target VectorData object followed by \\"_index\\".","data_type_inc":"VectorData","data_type_def":"VectorIndex","attributes":[{"doc":"Reference to the target dataset that this index applies to.","name":"target","dtype":{"target_type":"VectorData","reftype":"object"}}]},{"shape":[null],"dims":["num_elements"],"dtype":"int","doc":"A list of unique identifiers for values within a dataset, e.g. rows of a DynamicTable.","default_name":"element_id","data_type_inc":"Data","data_type_def":"ElementIdentifiers"},{"shape":[null],"dims":["num_rows"],"dtype":"int","doc":"DynamicTableRegion provides a link from one table to an index or region of another. The `table` attribute is a link to another `DynamicTable`, indicating which table is referenced, and the data is int(s) indicating the row(s) (0-indexed) of the target array. `DynamicTableRegion`s can be used to associate rows with repeated meta-data without data duplication. They can also be used to create hierarchical relationships between multiple `DynamicTable`s. `DynamicTableRegion` objects may be paired with a `VectorIndex` object to create ragged references, so a single cell of a `DynamicTable` can reference many rows of another `DynamicTable`.","data_type_inc":"VectorData","data_type_def":"DynamicTableRegion","attributes":[{"doc":"Reference to the DynamicTable object that this region applies to.","name":"table","dtype":{"target_type":"DynamicTable","reftype":"object"}},{"doc":"Description of what this table region points to.","name":"description","dtype":"text"}]}],"groups":[{"datasets":[{"shape":[null],"dims":["num_rows"],"dtype":"int","doc":"Array of unique identifiers for the rows of this dynamic table.","name":"id","data_type_inc":"ElementIdentifiers"},{"doc":"Vector columns, including index columns, of this dynamic table.","quantity":"*","data_type_inc":"VectorData"}],"doc":"A group containing multiple datasets that are aligned on the first dimension (Currently, this requirement if left up to APIs to check and enforce). These datasets represent different columns in the table. Apart from a column that contains unique identifiers for each row, there are no other required datasets. Users are free to add any number of custom VectorData objects (columns) here. DynamicTable also supports ragged array columns, where each element can be of a different size. To add a ragged array column, use a VectorIndex type to index the corresponding VectorData type. See documentation for VectorData and VectorIndex for more details. Unlike a compound data type, which is analogous to storing an array-of-structs, a DynamicTable can be thought of as a struct-of-arrays. This provides an alternative structure to choose from when optimizing storage for anticipated access patterns. Additionally, this type provides a way of creating a table without having to define a compound type up front. Although this convenience may be attractive, users should think carefully about how data will be accessed. DynamicTable is more appropriate for column-centric access, whereas a dataset with a compound type would be more appropriate for row-centric access. Finally, data size should also be taken into account. For small tables, performance loss may be an acceptable trade-off for the flexibility of a DynamicTable.","data_type_inc":"Container","data_type_def":"DynamicTable","attributes":[{"doc":"The names of the columns in this table. This should be used to specify an order to the columns.","name":"colnames","dtype":"text","shape":[null],"dims":["num_columns"]},{"doc":"Description of what is in this dynamic table.","name":"description","dtype":"text"}]},{"groups":[{"doc":"A DynamicTable representing a particular category for columns in the AlignedDynamicTable parent container. The table MUST be aligned with (i.e., have the same number of rows) as all other DynamicTables stored in the AlignedDynamicTable parent container. The name of the category is given by the name of the DynamicTable and its description by the description attribute of the DynamicTable.","quantity":"*","data_type_inc":"DynamicTable"}],"doc":"DynamicTable container that supports storing a collection of sub-tables. Each sub-table is a DynamicTable itself that is aligned with the main table by row index. I.e., all DynamicTables stored in this group MUST have the same number of rows. This type effectively defines a 2-level table in which the main data is stored in the main table implemented by this type and additional columns of the table are grouped into categories, with each category being represented by a separate DynamicTable stored within the group.","data_type_inc":"DynamicTable","data_type_def":"AlignedDynamicTable","attributes":[{"doc":"The names of the categories in this AlignedDynamicTable. Each category is represented by one DynamicTable stored in the parent group. This attribute should be used to specify an order of categories and the category names must match the names of the corresponding DynamicTable in the group.","name":"categories","dtype":"text","shape":[null],"dims":["num_categories"]}]}]}'}}, 'links': {}}}, 'datasets': {}, 'links': {}}, 'hdmf-experimental': root/specifications/hdmf-experimental GroupBuilder {'attributes': {}, 'groups': {'0.4.0': root/specifications/hdmf-experimental/0.4.0 GroupBuilder {'attributes': {}, 'groups': {}, 'datasets': {'experimental': root/specifications/hdmf-experimental/0.4.0/experimental DatasetBuilder {'attributes': {}, 'data': '{"datasets":[{"dtype":"uint8","doc":"Data that come from a fixed set of values. A data value of i corresponds to the i-th value in the VectorData referenced by the \'elements\' attribute.","data_type_inc":"VectorData","data_type_def":"EnumData","attributes":[{"doc":"Reference to the VectorData object that contains the enumerable elements","name":"elements","dtype":{"target_type":"VectorData","reftype":"object"}}]}]}'}, 'namespace': root/specifications/hdmf-experimental/0.4.0/namespace DatasetBuilder {'attributes': {}, 'data': '{"namespaces":[{"doc":"Experimental data structures provided by HDMF. These are not guaranteed to be available in the future.","schema":[{"namespace":"hdmf-common"},{"source":"experimental"},{"source":"resources"}],"name":"hdmf-experimental","full_name":"HDMF Experimental","version":"0.4.0","author":["Andrew Tritt","Oliver Ruebel","Ryan Ly","Ben Dichter","Matthew Avaylon"],"contact":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]}]}'}, 'resources': root/specifications/hdmf-experimental/0.4.0/resources DatasetBuilder {'attributes': {}, 'data': '{"groups":[{"datasets":[{"shape":[null],"dims":["num_rows"],"dtype":[{"doc":"The user term that maps to one or more resources in the `resources` table, e.g., \\"human\\".","name":"key","dtype":"text"}],"doc":"A table for storing user terms that are used to refer to external resources.","name":"keys","data_type_inc":"Data"},{"shape":[null],"dims":["num_rows"],"dtype":[{"doc":"The object id (UUID) of a file that contains objects that refers to external resources.","name":"file_object_id","dtype":"text"}],"doc":"A table for storing object ids of files used in external resources.","name":"files","data_type_inc":"Data"},{"shape":[null],"dims":["num_rows"],"dtype":[{"doc":"The compact uniform resource identifier (CURIE) of the entity, in the form [prefix]:[unique local identifier], e.g., \'NCBI_TAXON:9606\'.","name":"entity_id","dtype":"text"},{"doc":"The URI for the entity this reference applies to. This can be an empty string. e.g., https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=9606","name":"entity_uri","dtype":"text"}],"doc":"A table for mapping user terms (i.e., keys) to resource entities.","name":"entities","data_type_inc":"Data"},{"shape":[null],"dims":["num_rows"],"dtype":[{"doc":"The row index to the file in the `files` table containing the object.","name":"files_idx","dtype":"uint"},{"doc":"The object id (UUID) of the object.","name":"object_id","dtype":"text"},{"doc":"The data type of the object.","name":"object_type","dtype":"text"},{"doc":"The relative path from the data object with the `object_id` to the dataset or attribute with the value(s) that is associated with an external resource. This can be an empty string if the object is a dataset that contains the value(s) that is associated with an external resource.","name":"relative_path","dtype":"text"},{"doc":"The field within the compound data type using an external resource. This is used only if the dataset or attribute is a compound data type; otherwise this should be an empty string.","name":"field","dtype":"text"}],"doc":"A table for identifying which objects in a file contain references to external resources.","name":"objects","data_type_inc":"Data"},{"shape":[null],"dims":["num_rows"],"dtype":[{"doc":"The row index to the object in the `objects` table that holds the key","name":"objects_idx","dtype":"uint"},{"doc":"The row index to the key in the `keys` table.","name":"keys_idx","dtype":"uint"}],"doc":"A table for identifying which objects use which keys.","name":"object_keys","data_type_inc":"Data"},{"shape":[null],"dims":["num_rows"],"dtype":[{"doc":"The row index to the entity in the `entities` table.","name":"entities_idx","dtype":"uint"},{"doc":"The row index to the key in the `keys` table.","name":"keys_idx","dtype":"uint"}],"doc":"A table for identifying which keys use which entity.","name":"entity_keys","data_type_inc":"Data"}],"doc":"A set of five tables for tracking external resource references in a file. NOTE: this data type is experimental and is subject to change in a later version.","data_type_inc":"Container","data_type_def":"ExternalResources"}]}'}}, 'links': {}}}, 'datasets': {}, 'links': {}}}, 'datasets': {}, 'links': {}}}, 'datasets': {'TestColumn': root/TestColumn DatasetBuilder {'attributes': {'data_type': 'VectorData', 'description': '', 'namespace': 'hdmf-common', 'object_id': 'fd255425-aed6-44a5-936d-bd27008a48cd'}, 'data': <zarr.core.Array '/TestColumn' (3,) float64 read-only>}, 'id': root/id DatasetBuilder {'attributes': {'data_type': 'ElementIdentifiers', 'namespace': 'hdmf-common', 'object_id': '86350b98-de0e-456b-87e2-8a60f0cb57b0'}, 'data': <zarr.core.Array '/id' (0,) int32 read-only>}}, 'links': {}}, "Could not construct DynamicTable object due to: 'Array' object has no attribute 'extend'")

Operating System

Linux

Python Executable

Conda

Python Version

3.9

Package Versions

No response

Code of Conduct

@oruebel
Copy link
Contributor

oruebel commented Aug 30, 2023

@CodyCBakerPhD I think when creating a DynamicTable where all columns are specified by AbstractDataChunkIterator we must provide also the values for the id column, because we can't determine the number of rows (because we don't know the length until we have finished writing). I.e., I think specifying the id column explicitly in your example should fix the issue, e.g.,

dynamic_table = DynamicTable(name="TestTable", description="", columns=[column], id=np.array([1., 2., 3.]))

I think specifying the id's via a DataChunkIterator should also be fine:

dynamic_table = DynamicTable(name="TestTable", description="", columns=[column], id=DataChunkIterator(np.array([1., 2., 3.])))

I think this is ultimately an issue in HDMF. To be honest, I'm not sure we can easily fix this, because the issue is that we won't know how to construct the id's until after we have finished writing all the columns in this case. I think it may be best to just make it explicit to require that id is provided when all columns are DataChunkIterators and raise a ValueError to indicate that id must be set because we can't determine the number of rows.

Note that if I remove the DataChunkIterator wrapper around the data in the VectorData construction, the roundtrip issue goes away

My guess is that because you have only 1 column in the table and that column is wrapped in a DataChunkIterator that DynamicTable doesn't know to initalize the id column, because it doesn't know how many rows there are (and we won't know how many rows there are until write. Because of this, my guess is that we actually end up with an empty id column so that DynamicTable on read needs to create the ids. Normally, we should never hit this part on read:

File ~/GitHub/hdmf/src/hdmf/common/table.py:368, in DynamicTable.__init__(self, **kwargs)
    367         else:  # set ids to: 0 to length of columns - 1
--> 368             id.data.extend(range(lens[0]))
    370 self.id = id

This condition is there to create valid id column if the user did not set it. Normally, hen reading from file the id column should be valid and we should not need to modify it, but I think because we don't know the length of the DataChunkIterator the id's are not generated correctly. This comment in DynamicTable.__init__ seems to confirm this:

# the first part of this conditional is needed in the
# event that all columns are AbstractDataChunkIterators

@CodyCBakerPhD
Copy link
Contributor Author

Makes sense now that I understand

As per usual my only request for this then is that something should have errored out more informatively earlier on, what caught me off guard is that I could write the initial file without issue

So I say transfer this to HDMF and 'solution' is simply to catch this earlier and provide a better error during table write instructing that the IDs must be specified when using that approach, which is totally reasonable

@oruebel
Copy link
Contributor

oruebel commented Aug 31, 2023

As per usual my only request for this then is that something should have errored out more informatively earlier on

I agree. To be honest, I wasn't aware of this issue, and the bigger problem really is that this really resulted in an invalid NWB file, because the id column was create incorrectly. I'll transfer the issue to HDMF and work on it next week.

@oruebel oruebel transferred this issue from hdmf-dev/hdmf-zarr Aug 31, 2023
@oruebel oruebel added the category: bug errors in the code or code behavior label Aug 31, 2023
@oruebel oruebel modified the milestones: Future, Next Release Aug 31, 2023
@oruebel oruebel added the priority: high impacts proper operation or use of feature important to most users label Aug 31, 2023
@oruebel
Copy link
Contributor

oruebel commented Aug 31, 2023

The suggested solution is to have error check in DynamicTable.__init__ and raise an error when all columns are AbstractDataChunkIterator that we must supply the id column because we cannot auto-generate it in that case.

@oruebel oruebel self-assigned this Aug 31, 2023
@rly rly removed this from the 3.14.0 milestone Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior priority: high impacts proper operation or use of feature important to most users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants