Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: TermSet Restructure (Stage 4): Iterative Write #930

Closed
3 tasks done
mavaylon1 opened this issue Aug 1, 2023 · 2 comments
Closed
3 tasks done

[Feature]: TermSet Restructure (Stage 4): Iterative Write #930

mavaylon1 opened this issue Aug 1, 2023 · 2 comments
Assignees
Labels
category: enhancement improvements of code or code behavior priority: wontfix will not be fixed due to low priority and/or conflict with other feature/priority

Comments

@mavaylon1
Copy link
Contributor

What would you like to see added to HDMF?

This part will round out the first phase of changes to TermSet by investigating how TermSet should interact with iterative write.

With large datasets, users can iteratively populate objects and then iteratively write them. How TermSet will fit into this is still a broad subject to be discussed as a group once initial merges have been made.

Is your feature request related to a problem?

No response

What solution would you like?

TBD

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

@mavaylon1 mavaylon1 added category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s) labels Aug 1, 2023
@mavaylon1 mavaylon1 self-assigned this Aug 1, 2023
@oruebel
Copy link
Contributor

oruebel commented Aug 2, 2023

At first glance I can see three main approaches:

  1. In AbstractDataChunkIterator.__next__ we could process each chunk with the term set before returning the chunk for writing. Advantages: 1) This should work across I/O backends, 2) ensure processing of chunks on load. Disadvantages: 3) This may slow down I/O because we are processing data on write (we could use mutli-processing to be able to return the chunk for I/O while processing the chunk for terms), 4) while we currently use DataChunkIterator only for I/O we actually don't know who is calling the iterator to retrieve data and whether we are actually doing I/O.
  2. In the HDF5IODataChunkIteratorQueue.__write_chunk we could process each chunk with the term set before returning the chunk for writing. This approach is very similar to 1, but places the processing squarely on the I/O backend, rather than the iterator that produces the data. The advantage compared to 1) is that we know we are in the I/O phase but the disadvantage is that this approach is backend-specific so we would also need to update HDMF-Zarr.
  3. We could do the processing for TermSets after write. I.e., we would need to record which datasets were written iteratively (which is doable) and then in HMDFIO.write process all those datasets to check that they are valid and add terms to external resources. Advantages: 1) works across backends, 2) does not interfere with the data write, Disadvantages: 3) does not validate unitl after write, 4) requires loading of data after write (which should be done iteratively to avoid overloading memory).

@mavaylon1 mavaylon1 changed the title [Feature]: TermSet Restructure Part 4 [Feature]: TermSet Restructure (Stage 4): Iterative Write Aug 9, 2023
@mavaylon1
Copy link
Contributor Author

After talking with the team, this is such a special case that will be pushed back until requested by the community.

@mavaylon1 mavaylon1 added priority: wontfix will not be fixed due to low priority and/or conflict with other feature/priority and removed priority: low alternative solution already working and/or relevant to only specific user(s) labels Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: wontfix will not be fixed due to low priority and/or conflict with other feature/priority
Projects
None yet
Development

No branches or pull requests

2 participants