Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Adding Large Stimulus Table with add_interval takes incredibly long #1946

Closed
3 tasks done
rcpeene opened this issue Aug 13, 2024 · 3 comments
Closed
3 tasks done
Labels
category: question questions about code or code behavior priority: medium non-critical problem and/or affecting only a small set of NWB users topic: HDMF issues related to the use, depending on, or affecting HDMF

Comments

@rcpeene
Copy link

rcpeene commented Aug 13, 2024

What happened?

I am trying to generate an NWB file with a rather large stim table row by row using TimeIntervals.add_interval(). The stim table for our experiment happens to be very large (>40,000 rows). Trying on two different machines, this takes more than 10 hours to do. The add_interval operation seems to be the bottleneck, and it takes greater amounts of time as the table gets longer.

After digging through the code it looks like it might be __calculate_idx_count, perhaps bisect.

Is there a more direct way to generate a TimeIntervals table from an existing table (while ensuring that types of each columns are properly casted)? Or is there a fix to the slowness of the add_interval operation?

Steps to Reproduce

Running this snippet when generating a TimeIntervals object with a very large table

        presentation_interval = create_stimulus_presentation_time_interval(
            name=f"{stim_name}_presentations",
            description=interval_description,
            columns_to_add=cleaned_table.columns,
        )

        for i, row in enumerate(cleaned_table.itertuples(index=False)):
            row = row._asdict()
            row = {key: str(value) for key, value in row.items()}
            start_column = 'Start'  # Adjust this as per the actual column name in CSV
            end_column = 'End'  # Adjust this as per the actual
            start_time = float(row[start_column])
            end_time = float(row[end_column])
            presentation_interval.add_interval(
                **row,
                start_time=start_time, stop_time=end_time,
                tags="stimulus_time_interval", timeseries=ts
            )

        nwbfile.add_time_intervals(presentation_interval)

Traceback

No traceback

Operating System

Windows

Python Executable

Conda

Python Version

3.10

Package Versions

pynwb==2.8.1

Code of Conduct

@rcpeene
Copy link
Author

rcpeene commented Aug 14, 2024

The real bottleneck appears to be in DynamicTable.add_row()

@stephprince
Copy link
Contributor

stephprince commented Aug 14, 2024

Hi @rcpeene,

One way to speed up the add_interval operation would be to add the argument check_ragged=False. We recently added this check to provide a better warning for ragged arrays, but this operation can cause performance issues for larger tables since it checks the data on each call to add_row / add_interval.

presentation_interval.add_interval(
    **row,
    start_time=start_time, stop_time=end_time,
    tags="stimulus_time_interval", timeseries=ts, check_ragged=False
)

Could you try setting check_ragged to False and see if that improves your performance?

@stephprince stephprince added priority: medium non-critical problem and/or affecting only a small set of NWB users category: question questions about code or code behavior topic: HDMF issues related to the use, depending on, or affecting HDMF labels Aug 14, 2024
@rcpeene
Copy link
Author

rcpeene commented Aug 14, 2024

This was remarkably faster and completed in a few minutes. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: question questions about code or code behavior priority: medium non-critical problem and/or affecting only a small set of NWB users topic: HDMF issues related to the use, depending on, or affecting HDMF
Projects
None yet
Development

No branches or pull requests

2 participants