Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import datatree in xarray? #7418

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7c6fa70
list datatree in public API
TomNicholas Jan 4, 2023
5ef43be
attempt to import datatree API on xarray import
TomNicholas Jan 4, 2023
d184764
incorporate datatree links into io docs on groups
TomNicholas Jan 4, 2023
d986df3
Merge branch 'main' into import_datatree
TomNicholas Jan 4, 2023
d2e8ec3
add Dataset.to_datatree() method
TomNicholas Jan 12, 2023
08ff5c4
Merge branch 'import_datatree' of https://github.com/TomNicholas/xarr…
TomNicholas Jan 12, 2023
1401ca5
Merge branch 'main' into import_datatree
TomNicholas Jan 25, 2023
b153152
Merge branch 'main' into import_datatree
TomNicholas Jan 27, 2023
c5b8d10
add test that DataTree class can be imported
TomNicholas Jan 31, 2023
62b5e27
add to every CI environment that also has flox
TomNicholas Jan 31, 2023
ffa53c4
also check we can import accessor
TomNicholas Feb 1, 2023
a8f752d
whatsnew
TomNicholas Feb 1, 2023
eed3a71
Merge branch 'import_datatree' of https://github.com/TomNicholas/xarr…
TomNicholas Feb 1, 2023
3d3c29f
Update to_node docstring
TomNicholas Feb 1, 2023
74fea3a
Merge branch 'main' into import_datatree
TomNicholas Feb 1, 2023
95d76e6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 1, 2023
caafe90
test .to_datatree method
TomNicholas Feb 1, 2023
462e0b3
Merge branch 'import_datatree' of https://github.com/TomNicholas/xarr…
TomNicholas Feb 1, 2023
91c6ee1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 1, 2023
bc6a538
fix datatree import
TomNicholas Feb 1, 2023
3baf79e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 1, 2023
667d5cd
protect my import from the exacting ruff linter
TomNicholas Feb 1, 2023
dfe763b
Merge branch 'import_datatree' of https://github.com/TomNicholas/xarr…
TomNicholas Feb 1, 2023
d231055
try installing datatree from main
TomNicholas Feb 1, 2023
ae07dfd
Update xarray/__init__.py
TomNicholas Feb 1, 2023
6343104
also import accessor and open_datatree in top-level init
TomNicholas Feb 1, 2023
395a3ae
importorskip whole test file
TomNicholas Feb 1, 2023
7cf1d55
correct package name in wheels
TomNicholas Feb 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions ci/install-upstream-wheels.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ conda uninstall -y --force \
bottleneck \
sparse \
flox \
xarray-datatree \
h5netcdf \
xarray
# to limit the runtime of Upstream CI
Expand Down Expand Up @@ -47,5 +48,6 @@ python -m pip install \
git+https://github.com/intake/filesystem_spec \
git+https://github.com/SciTools/nc-time-axis \
git+https://github.com/xarray-contrib/flox \
git+https://github.com/xarray-contrib/xarray-datatree \
git+https://github.com/h5netcdf/h5netcdf
python -m pip install pytest-timeout
1 change: 1 addition & 0 deletions ci/requirements/all-but-dask.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,5 @@ dependencies:
- sparse
- toolz
- typing_extensions
- xarray-datatree
- zarr
1 change: 1 addition & 0 deletions ci/requirements/environment-py311.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,5 @@ dependencies:
# - sparse
- toolz
- typing_extensions
- xarray-datatree
- zarr
1 change: 1 addition & 0 deletions ci/requirements/environment-windows-py311.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,5 @@ dependencies:
# - sparse
- toolz
- typing_extensions
- xarray-datatree
- zarr
1 change: 1 addition & 0 deletions ci/requirements/environment-windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,5 @@ dependencies:
- sparse
- toolz
- typing_extensions
- xarray-datatree
- zarr
2 changes: 2 additions & 0 deletions ci/requirements/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,5 @@ dependencies:
- toolz
- typing_extensions
- zarr
- pip:
- git+https://github.com/xarray-contrib/datatree
Comment on lines +49 to +50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why we're installing from github here?

Copy link
Member Author

@TomNicholas TomNicholas Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I want to see if this commit to datatree fixes the mypy issue without releasing a whole new version of datatree just to check.

14 changes: 14 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1133,6 +1133,20 @@ used filetypes in the xarray universe.
backends.StoreBackendEntrypoint
backends.ZarrBackendEntrypoint

DataTree
========

Experimental API for handling nested groups of data.
Requires the `xarray-datatree package <https://github.com/xarray-contrib/datatree>`_ to be installed.
See the `datatree documentation <https://xarray-datatree.readthedocs.io/en/latest/>`_ for details.

.. autosummary::
:toctree: generated/

DataTree
open_datatree
register_datatree_accessor

Deprecated / Pending Deprecation
================================

Expand Down
48 changes: 45 additions & 3 deletions doc/user-guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,9 @@ to the original netCDF file, regardless if they exist in the original dataset.
Groups
~~~~~~

Single groups as datasets
.........................

NetCDF groups are not supported as part of the :py:class:`Dataset` data model.
Instead, groups can be loaded individually as Dataset objects.
To do so, pass a ``group`` keyword argument to the
Expand Down Expand Up @@ -228,10 +231,34 @@ Either of these groups can be loaded from the file as an independent :py:class:`
Data variables:
b int64 ...

.. note::
.. _io.netcdf_datatree_groups:

Multiple Groups as a DataTree
.............................

For native handling of multiple groups with xarray, including I/O, you might be interested in the experimental
`xarray-datatree <https://github.com/xarray-contrib/datatree>`_ package.
If installed, this package's API can be imported directly from xarray, i.e. ``from xarray import DataTree``.

Whilst netCDF groups can only be loaded individually as Dataset objects, a whole file of many nested groups can be loaded
as a single :py:class:`DataTree` object.
To open a whole netCDF file as a tree of groups use the :py:func:`open_datatree()` function.
To save a DataTree object as a netCDF file containing many groups, use the :py:meth:`DataTree.to_netcdf()`` method.

.. _netcdf.group.warning:

.. warning::
``DataTree`` objects do not follow the exact same data model as netCDF files, which means that perfect round-tripping
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that intentionally preformatted, or would it make sense to convert it to a link? (that's really minor, though)

is not always possible.

In particular in the netCDF data model dimensions are entities that can exist regardless of whether any variable possesses them.
This is in contrast to `xarray's data model <https://docs.xarray.dev/en/stable/user-guide/data-structures.html>`_
(and hence `datatree's data model <https://xarray-datatree.readthedocs.io/en/latest/data-structures.html>`_) in which the dimensions of a (Dataset/Tree)
object are simply the set of dimensions present across all variables in that dataset.

For native handling of multiple groups with xarray, including I/O, you might be interested in the experimental
`xarray-datatree <https://github.com/xarray-contrib/datatree>`_ package.
This means that if a netCDF file contains dimensions but no variables which possess those dimensions,
these dimensions will not be present when that file is opened as a DataTree object.
Saving this DataTree object to file will therefore not preserve these "unused" dimensions.


.. _io.encoding:
Expand Down Expand Up @@ -633,6 +660,21 @@ To read back a zarr dataset that has been created this way, we use the
ds_zarr = xr.open_zarr("path/to/directory.zarr")
ds_zarr

Groups
~~~~~~

Like for netCDF, zarr groups can either be opened as individual :py:class:`Dataset` objects using the ``group`` keyword argument to :py:func:`open_dataset`,
or alternatively nested groups in zarr stores can be represented by loading the store as a :py:class:`DataTree` object.
(The latter option requires that you have the `xarray-datatree <https://github.com/xarray-contrib/datatree>`_ package installed.)

To open a whole zarr store as a tree of groups use the :py:func:`open_datatree()` function.
To save a DataTree object as a zarr store containing many groups, use the :py:meth:`DataTree.to_zarr()` method.

.. note::
Note that perfect round-tripping should always be possible with a zarr store (:ref:`unlike for netCDF files<netcdf.group.warning>`),
as zarr does not support "unused" dimensions.


Cloud Storage Buckets
~~~~~~~~~~~~~~~~~~~~~

Expand Down
9 changes: 9 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,15 @@ v2023.01.1 (unreleased)
New Features
~~~~~~~~~~~~

- Allow importing the prototype :py:class:`DataTree` class (as well as the accompanying :py:func:`open_datatree()` and :py:func:`register_datatree_accessor` functions).
Currently ``from xarray import DataTree`` disguises an import from a separate package ``xarray-contrib/xarray-datatree``.
Importing these features will raise an ``ImportError`` unless the datatree package is installed.
Full integration of the :py:class:`DataTree` class in xarray is planned in the future (see our development roadmap),
but for now is proceeding on a provisional basis, and as such the API is still experimental and subject to change without notice.
In the meantime, you are encouraged to try using these features, and please let us know about your experiences!
(:issue:`4118`, :pull:`7418`)
By `Tom Nicholas <https://github.com/TomNicholas>`_.


Breaking changes
~~~~~~~~~~~~~~~~
Expand Down
6 changes: 6 additions & 0 deletions xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,12 @@
# Disable minimum version checks on downstream libraries.
__version__ = "999"

try:
from datatree import DataTree, register_datatree_accessor, open_datatree # noqa
except ImportError:
pass


# A hardcoded __all__ variable is necessary to appease
# `mypy --strict` running in projects that import xarray.
__all__ = (
Expand Down
42 changes: 42 additions & 0 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -3656,6 +3656,48 @@ def reduce(
var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
return self._replace_maybe_drop_dims(var)

def to_datatree(self, node_name: str | None = None, name: str | None = None):
"""
Convert this dataarray into a datatree.DataTree.

WARNING: The DataTree structure is considered experimental,
and the API is less solidified than for other xarray features.

The returned tree will only consist of a single node.
That node will contain a copy of the dataarray's data,
meaning including its coordinates, dimensions and attributes.

Requires the xarray-datatree package to be installed.
Find it at https://github.com/xarray-contrib/datatree.
Comment on lines +3663 to +3671
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also be moved into a warning block?


Parameters
----------
node_name: str, optional
The name of the datatree node created.
name: str, optional
Name to substitute for this array's name.

Returns
-------
dt : DataTree
A single-node datatree object, containing the information from this dataarray.

See Also
--------
datatree.DataTree
"""

try:
from datatree import DataTree
except ImportError:
raise ImportError(
"Could not import the datatree package. "
"Find it at https://github.com/xarray-contrib/datatree"
)

ds = self.to_dataset(name=name)
return DataTree(data=ds, name=node_name)

def to_pandas(self) -> DataArray | pd.Series | pd.DataFrame:
"""Convert this array into a pandas object with the same shape.

Expand Down
39 changes: 39 additions & 0 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -6116,6 +6116,45 @@ def to_array(

return DataArray._construct_direct(variable, coords, name, indexes)

def to_datatree(self, node_name: str | None = None):
"""
Convert this dataset into a datatree.DataTree.

.. warning:: The DataTree structure is considered experimental,
and the API is less solidified than for other xarray features.
Comment on lines +6123 to +6124
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I just don't know enough about rst, but I wonder if it would be better to move the whole text into the block?

Suggested change
.. warning:: The DataTree structure is considered experimental,
and the API is less solidified than for other xarray features.
.. warning::
The DataTree structure is considered experimental, and the API
is less solidified than for other xarray features.


The returned tree will only consist of a single node.
That node will contain a copy of the dataset's data,
meaning all variables, coordinates, dimensions and attributes.

Requires the xarray-datatree package to be installed.
Find it at https://github.com/xarray-contrib/datatree.

Parameters
----------
node_name: str, optional
The name of the datatree node created.

Returns
-------
dt : DataTree
A single-node datatree object, containing the information from this dataset.

See Also
--------
datatree.DataTree
"""

try:
from datatree import DataTree
except ImportError:
raise ImportError(
"Could not import the datatree package. "
"Find it at https://github.com/xarray-contrib/datatree"
)

return DataTree(data=self, name=node_name)

def _normalize_dim_order(
self, dim_order: Sequence[Hashable] | None = None
) -> dict[Hashable, int]:
Expand Down
5 changes: 5 additions & 0 deletions xarray/core/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@
from xarray.core.indexes import Index
from xarray.core.variable import Variable

try:
from datatree import DataTree as T_DataTree
except ImportError:
T_DataTree = Any

try:
from dask.array import Array as DaskArray
except ImportError:
Expand Down
1 change: 1 addition & 0 deletions xarray/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ def _importorskip(
has_pint, requires_pint = _importorskip("pint")
has_numexpr, requires_numexpr = _importorskip("numexpr")
has_flox, requires_flox = _importorskip("flox")
has_datatree, requires_datatree = _importorskip("datatree")


# some special cases
Expand Down
29 changes: 29 additions & 0 deletions xarray/tests/test_datatree.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import pytest

import xarray.testing as xrt
from xarray import Dataset, DataTree

pytest.importorskip("datatree")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the whole module depends on datatree, I'd call pytest.importorskip("datatree") somewhere at the top of the module:

Suggested change
pytest.importorskip("datatree")

then we don't need to decorate every test with requires_datatree

If we want to reuse requires_datatree, we can use:

Suggested change
pytestmark = [requires_datatree]


def test_import_datatree():
"""Just test importing datatree package from xarray-contrib repo"""

DataTree()


def test_to_datatree():

ds = Dataset({"a": ("x", [1, 2, 3])})
dt = ds.to_datatree(node_name="group1")

assert isinstance(dt, DataTree)
assert dt.name == "group1"
xrt.assert_identical(dt.to_dataset(), ds)

da = ds["a"]
dt = da.to_datatree(node_name="group1")

assert isinstance(dt, DataTree)
assert dt.name == "group1"
xrt.assert_identical(dt["a"], da)