[Feature]: TermSet Restructure (Stage 4): Iterative Write #930

mavaylon1 · 2023-08-01T03:24:18Z

What would you like to see added to HDMF?

This part will round out the first phase of changes to TermSet by investigating how TermSet should interact with iterative write.

With large datasets, users can iteratively populate objects and then iteratively write them. How TermSet will fit into this is still a broad subject to be discussed as a group once initial merges have been made.

Is your feature request related to a problem?

No response

What solution would you like?

TBD

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

I agree to follow this project's Code of Conduct
Have you checked the Contributing document?
Have you ensured this change was not already requested?

The text was updated successfully, but these errors were encountered:

oruebel · 2023-08-02T17:48:27Z

At first glance I can see three main approaches:

In AbstractDataChunkIterator.__next__ we could process each chunk with the term set before returning the chunk for writing. Advantages: 1) This should work across I/O backends, 2) ensure processing of chunks on load. Disadvantages: 3) This may slow down I/O because we are processing data on write (we could use mutli-processing to be able to return the chunk for I/O while processing the chunk for terms), 4) while we currently use DataChunkIterator only for I/O we actually don't know who is calling the iterator to retrieve data and whether we are actually doing I/O.
In the HDF5IODataChunkIteratorQueue.__write_chunk we could process each chunk with the term set before returning the chunk for writing. This approach is very similar to 1, but places the processing squarely on the I/O backend, rather than the iterator that produces the data. The advantage compared to 1) is that we know we are in the I/O phase but the disadvantage is that this approach is backend-specific so we would also need to update HDMF-Zarr.
We could do the processing for TermSets after write. I.e., we would need to record which datasets were written iteratively (which is doable) and then in HMDFIO.write process all those datasets to check that they are valid and add terms to external resources. Advantages: 1) works across backends, 2) does not interfere with the data write, Disadvantages: 3) does not validate unitl after write, 4) requires loading of data after write (which should be done iteratively to avoid overloading memory).

mavaylon1 · 2023-10-18T22:17:10Z

After talking with the team, this is such a special case that will be pushed back until requested by the community.

mavaylon1 added category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s) labels Aug 1, 2023

mavaylon1 self-assigned this Aug 1, 2023

mavaylon1 changed the title ~~[Feature]: TermSet Restructure Part 4~~ [Feature]: TermSet Restructure (Stage 4): Iterative Write Aug 9, 2023

mavaylon1 mentioned this issue Oct 15, 2023

[Feature]: HERD/TermSet Expansion Tracker #966

Closed

11 tasks

mavaylon1 added priority: wontfix will not be fixed due to low priority and/or conflict with other feature/priority and removed priority: low alternative solution already working and/or relevant to only specific user(s) labels Oct 18, 2023

mavaylon1 closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: TermSet Restructure (Stage 4): Iterative Write #930

[Feature]: TermSet Restructure (Stage 4): Iterative Write #930

mavaylon1 commented Aug 1, 2023

oruebel commented Aug 2, 2023

mavaylon1 commented Oct 18, 2023

[Feature]: TermSet Restructure (Stage 4): Iterative Write #930

[Feature]: TermSet Restructure (Stage 4): Iterative Write #930

Comments

mavaylon1 commented Aug 1, 2023

What would you like to see added to HDMF?

Is your feature request related to a problem?

What solution would you like?

Do you have any interest in helping implement the feature?

Code of Conduct

oruebel commented Aug 2, 2023

mavaylon1 commented Oct 18, 2023