[API] Long-term transition away from pandas into xarray, or something that supports structured NDarrays instead #113

adam2392 · 2023-02-10T23:53:40Z

Problem Statement

We are commonly dealing with structured data where we need to know the "labeling" of axes and different datasets. For example, passing in a 2D numpy array, one might expect the features to be the columns and the rows to be samples. However, creating a resulting causal graph from just integer numbers is really hard to read and interpret. Therefore, we typically might instead use pandas DataFrame, so we can attach names to each node in the graph.

However, things get more complicated as we move towards general causal discovery, where we want to support multiple datasets. This is not so easy with a numpy array because you have an additional axis to keep track of and remember conventions of what you named it. This is not an issue for observational data because if you have multiple instances of observational data, typically you would just concatenate them along the sample axis. However, for interventions and multi-environment learning, this becomes complicated. For example, when you pass in data for an interventional causal discovery algorithm, it is desirable to index each dataset differently. However, there is no good way to do this with pandas. Multi-indexing is super confusing imo. Moreover, apparently pandas will even move away from supporting multi-dimensional analysis because it is so cumbersome.

https://stackoverflow.com/questions/42876278/when-to-use-multiindexing-vs-xarray-in-pandas

Possible solutions

I don't think this is something we need to change right away. We can hack multi-index or janky APIs in the meantime, but for longer-term stability, we might consider transitioning to defining the internal dataset as an Xarray. We should still support input from pandas and numpy arrays, but internally they are transformed to an xarray, which is then used to do causal discovery. This helps eliminate the need to pass around e.g. lists of pandas data frames with a list of intervention target and names or lists of numpy arrays with lists of node names and lists of intervention targets. Rather, we should strive to pass around a single instance data: XArray, which is ensured to have the relevant information.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API] Long-term transition away from pandas into xarray, or something that supports structured NDarrays instead #113

[API] Long-term transition away from pandas into xarray, or something that supports structured NDarrays instead #113

adam2392 commented Feb 10, 2023

[API] Long-term transition away from pandas into xarray, or something that supports structured NDarrays instead #113

[API] Long-term transition away from pandas into xarray, or something that supports structured NDarrays instead #113

Comments

adam2392 commented Feb 10, 2023

Problem Statement

Possible solutions