Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API] Long-term transition away from pandas into xarray, or something that supports structured NDarrays instead #113

Open
adam2392 opened this issue Feb 10, 2023 · 0 comments

Comments

@adam2392
Copy link
Collaborator

Problem Statement

We are commonly dealing with structured data where we need to know the "labeling" of axes and different datasets. For example, passing in a 2D numpy array, one might expect the features to be the columns and the rows to be samples. However, creating a resulting causal graph from just integer numbers is really hard to read and interpret. Therefore, we typically might instead use pandas DataFrame, so we can attach names to each node in the graph.

However, things get more complicated as we move towards general causal discovery, where we want to support multiple datasets. This is not so easy with a numpy array because you have an additional axis to keep track of and remember conventions of what you named it. This is not an issue for observational data because if you have multiple instances of observational data, typically you would just concatenate them along the sample axis. However, for interventions and multi-environment learning, this becomes complicated. For example, when you pass in data for an interventional causal discovery algorithm, it is desirable to index each dataset differently. However, there is no good way to do this with pandas. Multi-indexing is super confusing imo. Moreover, apparently pandas will even move away from supporting multi-dimensional analysis because it is so cumbersome.

https://stackoverflow.com/questions/42876278/when-to-use-multiindexing-vs-xarray-in-pandas

Possible solutions

I don't think this is something we need to change right away. We can hack multi-index or janky APIs in the meantime, but for longer-term stability, we might consider transitioning to defining the internal dataset as an Xarray. We should still support input from pandas and numpy arrays, but internally they are transformed to an xarray, which is then used to do causal discovery. This helps eliminate the need to pass around e.g. lists of pandas data frames with a list of intervention target and names or lists of numpy arrays with lists of node names and lists of intervention targets. Rather, we should strive to pass around a single instance data: XArray, which is ensured to have the relevant information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant