Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn when adding ragged arrays to DynamicTable without index argument #1066

Merged
merged 8 commits into from
Mar 14, 2024

Conversation

stephprince
Copy link
Contributor

@stephprince stephprince commented Mar 6, 2024

Motivation

Fix #1065. When a user tries to add a ragged array to a DynamicTable using add_row or add_column, a warning will occur prompting the user to use the index argument.

I had some questions about these potential changes.

  1. I found a related issue discussion that suggested that doing a full inspection of the data to detect whether it was ragged might be a costly operation - Added unit_columns items are saved as strings NeurodataWithoutBorders/pynwb#1060. I was wondering if that's still a concern here or not.
  2. Numpy's warning refers to the requested array having an "inhomogeneous shape". I wasn't sure if that type of phrasing is preferred over "ragged."

How to test the behavior?

from pynwb import NWBHDF5IO
from pynwb.testing.mock.file import mock_NWBFile

# test adding individual trials
nwbfile = mock_NWBFile()
nwbfile.add_trial_column(name="test", description="")
nwbfile.add_trial(start_time=0.0, stop_time=2.0, test=[1, 2])
nwbfile.add_trial(start_time=0.0, stop_time=2.0, test=[3, 4])
nwbfile.add_trial(start_time=0.0, stop_time=2.0, test=[5, 6, 7])  # Warning here

### test adding full column
nwbfile = mock_NWBFile()
nwbfile.add_trial(start_time=0.0, stop_time=2.0)
nwbfile.add_trial(start_time=0.0, stop_time=2.0)
nwbfile.add_trial_column(name="test1", description="", data=[[1, 2], [1, 2]])
nwbfile.add_trial_column(name="test2", description="", data=[[1, 2], [1, 2, 3]])  # Warning here

with NWBHDF5IO('test.nwb', 'w') as f:
    f.write(nwbfile)  # Error here

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Does the PR clearly describe the problem and the solution?
  • Have you reviewed our Contributing Guide?
  • Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

Copy link

codecov bot commented Mar 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.55%. Comparing base (e9c2737) to head (a44206d).
Report is 56 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1066      +/-   ##
==========================================
+ Coverage   88.53%   88.55%   +0.01%     
==========================================
  Files          45       45              
  Lines        9605     9616      +11     
  Branches     2732     2738       +6     
==========================================
+ Hits         8504     8515      +11     
  Misses        778      778              
  Partials      323      323              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stephprince stephprince marked this pull request as ready for review March 6, 2024 00:54
@stephprince stephprince requested review from rly and oruebel March 6, 2024 00:54
src/hdmf/utils.py Outdated Show resolved Hide resolved
@rly
Copy link
Contributor

rly commented Mar 8, 2024

Yes, please correct me if I am wrong - it looks like we are iterating over every element in the array to check for raggedness. That could incur performance issues. I think we should add a flag check_ragged (default True) so that a user can disable the check if it is causing issues.

I don't know how we would get around checking every element...

Note that this issue should only come up if 1) the data is a list or tuple (arrays and hdf5 datasets cannot be ragged unless you give it an object type) or 2) the user adds rows one by one to the table and a cell in that row has a different shape than the shapes of the cell in the previous rows. (the original post mentioned using this method)

Related to 1), I suggest limiting the is_ragged check in add_column to just list and tuple.

Related to 2), it looks like adding a check for add_row is not implemented. I think that would be good to do.

  • Numpy's warning refers to the requested array having an "inhomogeneous shape". I wasn't sure if that type of phrasing is preferred over "ragged."

It seems like numpy uses "ragged" internally. I wonder why they chose to use "inhomogeneous shape" publicly. Perhaps "inhomogeneous shape" is more clear to the non-programmer, but I feel like both terms are poor. Maybe we could say "data has elements that have different lengths and therefore data cannot be coerced into an N-dimensional array"?

@stephprince
Copy link
Contributor Author

I think we should add a flag check_ragged (default True) so that a user can disable the check if it is causing issues.

Sounds like a good option!

Note that this issue should only come up if 1) the data is a list or tuple (arrays and hdf5 datasets cannot be ragged unless you give it an object type)

Yes I wasn't sure how often people added arrays that were an object type but since it would still error when writing I think limiting this check make sense.

Related to 2), it looks like adding a check for add_row is not implemented. I think that would be good to do.

I had a check in add_row that inspects the whole column for raggedness every time a row was added, or do you mean additional tests for adding a row?

Perhaps "inhomogeneous shape" is more clear to the non-programmer, but I feel like both terms are poor.

I agree, I think the phrasing you suggested is good.

Copy link
Contributor

@rly rly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@rly
Copy link
Contributor

rly commented Mar 12, 2024

I had a check in add_row that inspects the whole column for raggedness every time a row was added

Ah, I missed that.

@stephprince stephprince merged commit f092cbb into dev Mar 14, 2024
28 checks passed
@stephprince stephprince deleted the detect-ragged-arrays branch March 14, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Warn to use index argument on call to add_row or add_column
3 participants