Performance hit when using dask to compute `missing_wmo` #1820

huard · 2024-07-03T15:35:25Z

Setup Information

Xclim version: 0.47
Python version: 3.10

Description

I noticed a very significant performance hit when running missing_wmo with and without dask. The dask version seems to slow things down, and I had to load the array to get results in a reasonable amount of time.

Steps To Reproduce

No response

Additional context

No response

Contribution

I would be willing/able to open a Pull Request to address this bug.

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

aulemahal · 2024-07-03T16:06:14Z

MWE and dask analysis:

import xclim as xc
from xclim.testing import open_dataset

# Open a dataset of a single chunk
ds = open_dataset('sdba/CanESM2_1950-2100.nc', chunks={'time': -1, 'location': -1})

pr_valid = xc.core.missing.missing_wmo(ds.pr, freq="YS")

The last line took me 115 s. But most importantly, counting the number of tasks with len(ds.pr.__dask_graph__().keys()), I see an increase from 6 to 95304. This is insane!

A probable solution would be to wrap as much as possible into single apply_ufunc or map_blocks call to aggregate the tasks. Maybe we can look into flox for help since we are applying a function along the time axis and grouping.

huard added the bug Something isn't working label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance hit when using dask to compute `missing_wmo` #1820

Performance hit when using dask to compute `missing_wmo` #1820

huard commented Jul 3, 2024

aulemahal commented Jul 3, 2024

Performance hit when using dask to compute missing_wmo #1820

Performance hit when using dask to compute missing_wmo #1820

Comments

huard commented Jul 3, 2024

Setup Information

Description

Steps To Reproduce

Additional context

Contribution

Code of Conduct

aulemahal commented Jul 3, 2024

Performance hit when using dask to compute `missing_wmo` #1820

Performance hit when using dask to compute `missing_wmo` #1820