Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First execution of groupby on Xarray with Flox takes a lot of time #365

Open
josephnowak opened this issue May 21, 2024 · 4 comments
Open

Comments

@josephnowak
Copy link

josephnowak commented May 21, 2024

Hi, I have been using Flox with Xarray for some experiments and it works nicely in multiple cases, but when I try to create some unit tests with small data the execution times take longer than expected, for example, the code that I'm going to sent takes 60s to run but this only happens the first time that I run it which I suppose can be related to some just-in-time compilation. Also, I have noticed that it is printing arrays multiple times (please see the screenshot that shows the place that contains a possible unexpected print on the code), which I suppose can be affecting the performance (and again this only happens the first time that I run it which is even more strange).

Note: If I uninstall Flox everything works faster, that's why I'm reporting it in this repo and not in Xarray

import xarray as xr
import time
import pandas as pd

dates = pd.to_datetime(
    [
        "2020-12-30",
        "2021-01-02",
        "2021-01-03",
        "2021-01-05",
        "2021-01-07",
        "2021-01-09",
        "2021-01-11",
    ]
)
new_data = xr.DataArray(
    [
        [10, 9.1, 8, 7, 6, 2, 1],
        [6, 5, 7, 7, 4, 2, 1],
        [2, 9, 4, 5, 3, 2, 1],
        [-10, 2, 7, 7, 9, 2, 1],
        [4, 7, 3, -5, 6, 2, 1],
        [10, 1, 2, 7, 3, 2, 1],
        [12, 4, 5, -1, 0, 2, 1],
    ],
    coords={
        "date": dates,
        "a": ["a", "b", "c", "d", "e", "f", "g"],
    },
).chunk({"date": 2, "a": 5})

start = time.time()
new_data.groupby("a").max("a").compute()
print(time.time() - start)

image

Arrays printed during the execution:
image

INSTALLED VERSIONS

commit: None
python: 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 4.14.275-207.503.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: None

xarray: 2024.5.0
pandas: 2.2.1
numpy: 1.26.3
scipy: 1.11.4
netCDF4: None
pydap: None
h5netcdf: None
h5py: 3.10.0
zarr: 2.16.1
cftime: None
nc_time_axis: None
iris: None
bottleneck: 1.3.7
dask: 2024.5.1
distributed: 2024.5.1
matplotlib: 3.8.2
cartopy: None
seaborn: 0.13.1
numbagg: 0.8.1
fsspec: 2023.12.2
cupy: None
pint: None
sparse: 0.15.1
flox: 0.9.7
numpy_groupies: 0.10.2
setuptools: 69.0.3
pip: 23.3.2
conda: 23.11.0
pytest: 7.4.4
mypy: None
IPython: 8.20.0
sphinx: None

@dcherian
Copy link
Collaborator

You have numbagg installed so this is some JIT overhead. Some background here: numbagg/numbagg#220.

You can avoid this by setting engine="flox" or engine="numpy".

That print statement is in the benchmarking code so it shouldn't affect what your;re seeing

I did make the choice to not allow selectively disabling numbagg. Would you prefer this happen? Or would you prefer that at the moment flox should prioritize another approach (probably flox over numbagg)?

@josephnowak
Copy link
Author

josephnowak commented May 21, 2024

I have used Numbagg in other projects and regularly I see a first execution of only a couple of seconds, not a minute but probably Flox is using many functions of it and that would explain the time. Thanks for the clarification.

For me, as changing the engine (I did not know that the engine parameter existed, so thank you) avoids the overhead of JIT for the smaller cases that I use on my test then I think is fine to let the current behavior.

Related to the printed array, I think the problem was caused by the Numbagg package, apparently in release 0.7.0 they let those print, which was later fixed. I updated the package and restarted the kernel of my Jupyter and they are no longer being printed.
image

@dcherian
Copy link
Collaborator

Flox is using many functions of it

No I believe it compiles for all combinations of dtypes at one go. @max-sixty has more context on why the groupby aggregation functions take a while to compile.

Your point about "small" datasets is interesting. Perhaps the automatic choice should be to use numbagg only for "big enough" problems for some definition of big enough.

@max-sixty
Copy link
Contributor

On the print statement — yes apologies. FWIW that was only the most current version for ~36 hours IIRC. There now a check to ensure this can't happen again.


On the compilation time — this is a flamegraph of the total time for your excellent example — indeed it's basically all compilation; the execution time are the small slither at the very end of each block.

I get an execution time of 18 seconds, ~8s of compilation for each.

image

We do have benchmarks for these, but haven't had them running on each commit! I could run them nightly so we have proper measurements.


There are two changes from numba which will make a big difference:

  • Dynamic compilation for parallel functions — so the compilation won't need to be for every type; only the types that are requested Support parallel dynamic gufuncs numba/numba#9377. This would be a big win across the board, cutting compilation times by 8x for the groupbys, which have lots of combinations of types (since they're multiplicative)
  • Cached compilation for parallel functions — so the compilation will happen once per machine, rather than once per session. This would basically eliminate this type of issue, where starting a new process in a loop requires repeatedly re-JIT-ing. Cache causes Segmentation Faults when generated in parallel numba/numba#4807

Currently neither of these are compatible with parallel. The first is being worked on upstream but not sure when to expect it; we probably shouldn't plan around them.


Partitioning based on the size of the array may make sense. Ofc if you're doing computations over lots of small arrays, you probably want to pay the upfront cost, but it would still be a reasonable heuristic for spotting this sort of workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants