Allow opening selected groups only #338

mraspaud · 2024-06-24T09:05:13Z

This PR allows opening selected groups only in open_datatree.

The use case is speeding up loading of files with many groups, in our case netcdf, where we actually need a handful of groups to be loaded.

Tests added
Passes pre-commit run --all-files
Changes are summarized in docs/source/whats-new.rst

This takes advantage of replacing the generator of paths in the _open_datatree_* functions

mraspaud · 2024-06-24T09:09:52Z

There seems to be failing tests that I don't think is our doing, as we could reproduct them on the main branch (before our changes where added), is that to be expected?

keewis · 2024-06-24T09:11:09Z

before you spend more time here: could you check if the version that was integrated into xarray does this already? And if not, open the PR there?

Edit: but yes, the failing tests seem unrelated, that's because of a change in the Dataset / DataArray repr.
Edit2: also, the version of open_datatree is much faster now, so we might not even need the manual optimization

mraspaud · 2024-06-24T09:48:19Z

@keewis thanks for the heads up.
We have checked the latest DataTree for the xarray integration, and while it indeed is much faster, it's still to slow for our need.

We need to read batches of 80 files, which have around 70 groups each, on my laptop that takes now around 2 second per file, so almost three minutes to generate the datatrees. As this is for a process that needs to run in realtime, with a new batch every 10 minutes, we are looking for all the performance gains we can get.
The optimisation we are looking for with this PR comes from the fact that there are groups which are duplicated across the 80 files (so we can just read them from one file and reuse them for the other files), and that some data from the files we don't need at all.

keewis · 2024-06-24T10:21:06Z

okay, sure. I'd still recommend checking the version in xarray (which is not public API yet so may still change – though this is pretty unlikely at this point) to see if the group parameter already does what you need it to.

mraspaud · 2024-06-24T11:30:34Z

From what I understand, the groupparameter just sets the root group, so different purpose.

TomNicholas · 2024-06-24T15:38:56Z

Hi @mraspaud - thanks for this contribution! I can see how this might be useful. I apologise for the indeterminate state of datatree right now.

From what I understand, the group parameter just sets the root group, so different purpose.

This repository will soon be archived, so if you want this feature then your PR here will need to be reconciled with what's now in xarray main.

The recent PR's that @keewis mentioned are especially pertinent - they speed up opening DataTree objects by multiple orders of magnitude!

We should think about whether your use of the groups kwarg here can be made compatible with the interpretation of group upstream to mean "the root group". e.g. could the type of group be str | Iterable[str] | None?

Another idea you might want to think about is whether the suggested open_dict_of_datasets function might be better suited for your use case (see pydata/xarray#9137). That's already "lower-level", so might be a more natural place to accept an argument that means you only open certain groups.

mraspaud added 2 commits June 24, 2024 10:59

Allow opening selected groups only

bc97a36

This takes advantage of replacing the generator of paths in the _open_datatree_* functions

Add to what's new

4b38d3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow opening selected groups only #338

Allow opening selected groups only #338

mraspaud commented Jun 24, 2024 •

edited

Loading

mraspaud commented Jun 24, 2024

keewis commented Jun 24, 2024 •

edited

Loading

mraspaud commented Jun 24, 2024

keewis commented Jun 24, 2024

mraspaud commented Jun 24, 2024

TomNicholas commented Jun 24, 2024

Allow opening selected groups only #338

Are you sure you want to change the base?

Allow opening selected groups only #338

Conversation

mraspaud commented Jun 24, 2024 • edited Loading

mraspaud commented Jun 24, 2024

keewis commented Jun 24, 2024 • edited Loading

mraspaud commented Jun 24, 2024

keewis commented Jun 24, 2024

mraspaud commented Jun 24, 2024

TomNicholas commented Jun 24, 2024

mraspaud commented Jun 24, 2024 •

edited

Loading

keewis commented Jun 24, 2024 •

edited

Loading