Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouping is significantly slower when adding auxiliary coordinates to the time dimension #9426

Closed
5 tasks done
tomvothecoder opened this issue Sep 3, 2024 · 3 comments · Fixed by #9429
Closed
5 tasks done
Labels
bug needs triage Issue that has not been reviewed by xarray team member

Comments

@tomvothecoder
Copy link
Contributor

What happened?

In the temporal APIs of the xCDAT package, we generate "labeled" time coordinates which allow grouping across multiple time components (e.g., "year", "month", etc.) since Xarray does not currently support this feature. These labeled time coordinates are added to the existing time dimension, then we use Xarray's .groupby() method to group on these "labeled" time coordinates.

However, I noticed .groupby() is much slower after adding these auxiliary coordinates to the time dimension. This performance slowdown also affects grouping on the original time dimension coordinates, which I don't think should happen.

In the MVCE, I generate a dummy dataset with a shape of (10000, 180, 360).

  • Grouping on time.month without auxiliary coords: 0.0119 seconds
  • Grouping on time.month with auxiliary coords: 0.4892 seconds
  • Grouping on auxiliary coords: 0.5712 seconds

Note, as the dataset size increases, the slowdown is much more prominent.

What did you expect to happen?

Xarray's groupby() should not slow down after adding auxiliary time coordinates to the time dimension (I think?).

Minimal Complete Verifiable Example

import timeit

# Date shape is (10000, 180, 360)
setup_code = """
import cftime
import numpy as np
import pandas as pd
import xarray as xr

# Define coordinates
lat = np.linspace(-90, 90, 180)
lon = np.linspace(-180, 180, 360)
time = pd.date_range("1850-01-01", periods=10000, freq="D")

# Generate random data
data = np.random.rand(len(time), len(lat), len(lon))

# Create dataset
ds = xr.Dataset(
    {"random_data": (["time", "lat", "lon"], data)},
    coords={"lat": lat, "lon": lon, "time": time},
)
"""

# Setup code to add auxiliary coords to time dimension.
setup_code2 = """
labeled_time = []
for coord in ds.time:
    labeled_time.append(cftime.datetime(coord.dt.year, coord.dt.month, 1))

ds.coords["year_month"] = xr.DataArray(data=labeled_time, dims="time")
"""

# 1. Group by month component (without auxiliary coords) -- very fast
#    Test case 1 execution time: 0.011903275270015001 seconds
# -----------------------------------------------------------------------------
test_case1 = """
ds_gb1 = ds.groupby("time.month")
"""
time_case1 = timeit.timeit(test_case1, setup=setup_code, globals=globals(), number=5)

print(f"Test case 1 execution time: {time_case1} seconds")


# 2. Group by month component (with auxiliary coords) -- very slow now
#    Test case 2 execution time: 0.48919129418209195 seconds
# -----------------------------------------------------------------------------
time_case2 = timeit.timeit(
    test_case1, setup=setup_code + setup_code2, globals=globals(), number=5
)
print(f"Test case 2 execution time: {time_case2} seconds")

# 3. Group by year_month coordinates -- very slow
#    Test case 3 execution time: 0.5711932387202978 seconds
# -----------------------------------------------------------------------------
test_case3 = """
ds_gb3 = ds.groupby("year_month")
"""
time_case3 = timeit.timeit(
    test_case3, setup=setup_code + setup_code2, globals=globals(), number=5
)

print(f"Test case 3 execution time: {time_case3} seconds")

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

Related to PR #689 in the xCDAT repo.

Our workaround was to replace the dimension coordinates with the auxiliary coordinates for grouping purposes, which sped up grouping significantly.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:36:51) [GCC 12.4.0] python-bits: 64 OS: Linux OS-release: 5.14.21-150400.24.81_12.0.87-cray_shasta_c machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2024.7.0
pandas: 2.2.2
numpy: 2.1.0
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 73.0.1
pip: 24.2
conda: None
pytest: None
mypy: None
IPython: 8.27.0
sphinx: None

@tomvothecoder tomvothecoder added bug needs triage Issue that has not been reviewed by xarray team member labels Sep 3, 2024
@dcherian
Copy link
Contributor

dcherian commented Sep 3, 2024

I can't reproduce the relative difference here. Also, why not just doresample(time="ME")

EDIT: welp, now I can.

@dcherian
Copy link
Contributor

dcherian commented Sep 3, 2024

As an aside, why not use .resample(time="ME")?

dcherian added a commit to dcherian/xarray that referenced this issue Sep 4, 2024
dcherian added a commit to dcherian/xarray that referenced this issue Sep 4, 2024
@dcherian
Copy link
Contributor

dcherian commented Sep 4, 2024

Turns out deep-copying cftime object is quite slow, and we don't need the deep-copy anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants