Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import datatree in xarray? #7418

Closed
wants to merge 28 commits into from
Closed

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Jan 4, 2023

I want datatree to live in xarray main, as right now it's in a separate package but imports many xarray internals.

This presents a few questions:

  1. At what stage is datatree "ready" to moved in here? At what stage should it become encouraged public API?
  2. What's a good way to slowly roll the feature out?
  3. How do I decrease the bus factor on datatree's code? Can I get some code reviews during the merging process? 🙏
  4. Should I make a new CI environment just for testing datatree stuff?

Today @jhamman and @keewis suggested for now I make it so that you can from xarray import DataTree, using the current xarray-datatree package as an optional dependency. That way I can create a smoother on-ramp, get some more users testing it, but without committing all the code into this repo yet.

@pydata/xarray what do you think? Any other thoughts about best practices when moving a good few thousand lines of code into xarray?

@TomNicholas TomNicholas added topic-DataTree Related to the implementation of a DataTree class dependencies Pull requests that update a dependency file labels Jan 4, 2023
@github-actions github-actions bot removed topic-DataTree Related to the implementation of a DataTree class dependencies Pull requests that update a dependency file labels Jan 4, 2023
Comment on lines 55 to 58
try:
from datatree import DataTree, open_datatree, register_datatree_accessor
except ImportError:
...
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is almost certainly a better way to make from xarray import DataTree work than this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as a temporary solution.

@TomNicholas TomNicholas added dependencies Pull requests that update a dependency file topic-DataTree Related to the implementation of a DataTree class labels Jan 4, 2023
@github-actions github-actions bot removed topic-DataTree Related to the implementation of a DataTree class dependencies Pull requests that update a dependency file labels Jan 4, 2023
@Illviljan
Copy link
Contributor

Add a py.typed file in datatree to fix the mypy error: https://github.com/pydata/xarray/blob/main/xarray/py.typed

@TomNicholas
Copy link
Member Author

There is also at least one bug in datatree that cannot be fixed without a (small) change to xarray, and having datatree as an optional import means I could fix it here.

@rabernat
Copy link
Contributor

rabernat commented Jan 5, 2023

  • At what stage is atatree "ready" to moved in here? At what stage should it become encouraged public API?

My opinion is that Datatree should move into Xarray now, ideally in a way that does not disrupt any existing user code, and that Datatree should become a first-class Xarray object (together with DataArray, and Dataset). Since it's a new feature, we don't necessarily have to be super conservative here. I think it is more than good enough / stable enough in its current state.

  • What's a good way to slowly roll the feature out?

Since Datatree sits above DataArray and Dataset, it should not interfere with any of our existing API. As long as test coverage is good, documentation is solid, and the code style matches the rest of Xarray, I think we can just bring it in.

  • How do I decrease the bus factor on datatree's code? Can I get some code reviews during the merging process? 🙏

I think that it is inevitable that you Tom will be the main owner of the Datatree code at the beginning (as @shoyer was of all of Xarray when he first released it). Over time, if people use it, some fraction of users will become maintainers, starting with the existing dev team.

  • Should I make a new CI environment just for testing datatree stuff?

Why? Are its dependencies different from Xarray?

@TomNicholas
Copy link
Member Author

Why? Are its dependencies different from Xarray?

No, datatree has no additional dependencies. I was just asking because if we went for the "import from second repository" plan we may want to test that the import works as part of our CI. Not a major issue though.

@rabernat
Copy link
Contributor

rabernat commented Jan 5, 2023

I personally favor just copying the code into Xarray and archiving the old repo.

@jhamman
Copy link
Member

jhamman commented Jan 5, 2023

I personally favor just copying the code into Xarray and archiving the old repo.

I also lean in this direction. At this point, I see little downside to making this change at this point. My suggestion to import xarray-datatree into xarray was meant low-lift compromise.

@benbovy
Copy link
Member

benbovy commented Jan 5, 2023

I don't have strong opinions for or against including datatree in Xarray. It indeed makes sense if it is using many Xarray internals and if there are many existing or potential applications for it. Additional load (CI) is fine if datatree doesn't bring any extra dependency and won't do so in the near future (which seems to be the case).

Datatree should become a first-class Xarray object

Since Datatree sits above DataArray and Dataset, it should not interfere with any of our existing API.

Would it mean that if someone wants to later add any feature "x" or "y" into Xarray, they just need implementing the feature for Dataset (and possibly DataArray) and it will be guaranteed to work with Datatree? (I guess so but I'm not familiar enough with Datatree to know it for sure).

Otherwise, if there is any extra implementation effort required to make feature "x" or "y" work with Datatree, then I'm concerned about the additional burden or obstacle for future contributors and maintainers. Or we could say that this is OK to leave datatree support and wait for someone to take care of it later, but I don't think it is ideal to have such non-synchronized state within Xarray itself.

@benbovy
Copy link
Member

benbovy commented Jan 5, 2023

Again, there is likely more good reasons merging the Datatree code with Xarray than not doing it, but IMHO such decision should be made very carefully. You certainly do know better than me what positive vs. negative impacts it would have here! I'm just speaking generally from my experience of having struggled while doing some heavy refactoring in Xarray recently :)

@TomNicholas
Copy link
Member Author

Would it mean that if someone wants to later add any feature "x" or "y" into Xarray, they just need implementing the feature for Dataset (and possibly DataArray) and it will be guaranteed to work with Datatree?

Basically yes, it would immediately work with Datatree. Datatree currently implements most dataset methods by literally copying them and their docstrings, and they work by mapping the method over every node in the tree. We could integrate Datatree in such a way that the additional developer effort to get a method on dataset working on Datatree would be negligible (think adding a single element with the method name to an internal list, or copy-pasting a docstring).

I don't think it is ideal to have such non-synchronized state within Xarray itself.

This is an argument for waiting before integrating.

I'm just speaking generally from my experience of having struggled while doing some heavy refactoring in Xarray recently :)

I appreciate the input @benbovy! I think the main difference between this effort and your (heroic) indexes effort is that Datatree doesn't touch any existing API.

I guess my main concern is that integrating prematurely into Xarray might give a false sense of stability - I don't want to later realize I should redesign Datatree, and have people be annoyed because they thought it was as stable as the rest of xarray.

@TomNicholas
Copy link
Member Author

Integrating upstream into xarray might also help with people trying to open their nested data formats as a datatree objects, because then we can immediately begin integrating with xarray's backend engines.

See for example this datatree issue asking about opening grib files as a datatree. It would be nice to be able to do

open_datatree("data.grib", engine="cfgrib")

@shoyer
Copy link
Member

shoyer commented Jan 11, 2023

I agree, datatree is an important data structure for Xarray. My preferred way to do this would be follow @rabernat's suggestion and to fork the code the existing repo into the Xarray main codebase.

My main concern is that we should carefully evaluate the datatree API to make sure we won't want to change it soon. Once we bring it into Xarray, there will be a higher expectation that the interface will remain stable.

@rabernat
Copy link
Contributor

we should carefully evaluate the datatree API to make sure we won't want to change it soon

I agree with this. We could use the PR process for this review.

@TomNicholas
Copy link
Member Author

I think this PR is ready now, it just fails mypy (@Illviljan I added a py.typed file to datatree but the xarray mypy CI is still not happy with it).

Once this is merged we can push on with implementing open_datatree (#7437 @jthielen) and I can take a crack at fixing xarray-contrib/datatree#146 in xarray in a follow-up PR.

@Illviljan
Copy link
Contributor

Hmm, I don't understand. Adding py.typed should be all that's needed, did that in flox and it worked great there: xarray-contrib/flox#92

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to follow Tom's plan to simply import datatree for 1-2 release cycles.

sounds good to me!

xarray/__init__.py Outdated Show resolved Hide resolved
@@ -52,6 +52,12 @@
# Disable minimum version checks on downstream libraries.
__version__ = "999"

try:
from datatree import DataTree # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from datatree import DataTree # noqa
from datatree import DataTree, register_datatree_accessors # noqa: F401

I think this is why the docs build is failing. Also, not sure if the error code still works with ruff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not sure if the error code still works with ruff?

I actually found that if I don't add that error code then ruff replaces

try:
    from datatree import DataTree

with

try:
    pass

which obviously caused an ImportError. I thought that was surprising behavior for a linter too...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to

from datatree import DataTree, register_datatree_accessor, open_datatree  # noqa

now though

import xarray.testing as xrt
from xarray import Dataset
from xarray.tests import requires_datatree

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the whole module depends on datatree, I'd call pytest.importorskip("datatree") somewhere at the top of the module:

Suggested change
pytest.importorskip("datatree")

then we don't need to decorate every test with requires_datatree

If we want to reuse requires_datatree, we can use:

Suggested change
pytestmark = [requires_datatree]

Comment on lines +6123 to +6124
.. warning:: The DataTree structure is considered experimental,
and the API is less solidified than for other xarray features.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I just don't know enough about rst, but I wonder if it would be better to move the whole text into the block?

Suggested change
.. warning:: The DataTree structure is considered experimental,
and the API is less solidified than for other xarray features.
.. warning::
The DataTree structure is considered experimental, and the API
is less solidified than for other xarray features.

Comment on lines +3663 to +3671
WARNING: The DataTree structure is considered experimental,
and the API is less solidified than for other xarray features.

The returned tree will only consist of a single node.
That node will contain a copy of the dataarray's data,
meaning including its coordinates, dimensions and attributes.

Requires the xarray-datatree package to be installed.
Find it at https://github.com/xarray-contrib/datatree.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also be moved into a warning block?

.. _netcdf.group.warning:

.. warning::
``DataTree`` objects do not follow the exact same data model as netCDF files, which means that perfect round-tripping
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that intentionally preformatted, or would it make sense to convert it to a link? (that's really minor, though)

Comment on lines +49 to +50
- pip:
- git+https://github.com/xarray-contrib/datatree
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why we're installing from github here?

Copy link
Member Author

@TomNicholas TomNicholas Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I want to see if this commit to datatree fixes the mypy issue without releasing a whole new version of datatree just to check.

@@ -20,6 +20,7 @@ conda uninstall -y --force \
bottleneck \
sparse \
flox \
datatree \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably fine as-is (and I'm always confused about the name), but should this be xarray-datatree, given that that's the package we are installing below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably!

@TomNicholas
Copy link
Member Author

So I let this get stale, but today I met with @owenlittlejohns and @asteiker who are both very interested in helping get datatree merged into xarray upstream, which is great!

I would like to:

  1. forget about this PR, and just begin moving datatree upstream,
  2. use the merging to get good feedback on the design of datatree instead of just copy-pasting,
  3. use the merging as an opportunity to reduce the bus factor on datatree's code.

@owenlittlejohns especially seemed interested in actually understanding the guts of datatree.

It would be helpful if at least one other @pydata/xarray dev was interested and had time to review such PRs, so that I'm not just approving merging my own code. @jhamman ? @shoyer ? @keewis ? You guys have been particularly helpful with datatree so far.

@owenlittlejohns
Copy link
Contributor

@flamingbear, @eni-awowale and @lsterzinger - I just wanted to tag you here as a starting point for discussing this effort a bit more, as you all were interested in helping with this work.

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Dec 6, 2023
@TomNicholas
Copy link
Member Author

Thanks for coming to the meeting today @owenlittlejohns , @eni-awowale and @flamingbear!

We decided that opening hierarchical files into xarray is the first priority, so ideally integration of datatree into xarray should start with the backends. There is a prototype of this in #7437, but that's gone stale (cc @jthielen ?) so I think we should just use it as reference and forge ahead!

That would give us a temporary situation in which xarray.open_datatree(file) returned a datatree.DataTree object, as a stepping stone towards returning an xarray.DataTree object.

@TomNicholas
Copy link
Member Author

Closing this in favour of tracking a proper integration in #8572.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration tools dependencies Pull requests that update a dependency file topic-DataTree Related to the implementation of a DataTree class topic-typing
Projects
Development

Successfully merging this pull request may close these issues.

8 participants