Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string coordinate in DataArray versus Dataset #9583

Open
jmccreight opened this issue Oct 4, 2024 · 2 comments
Open

string coordinate in DataArray versus Dataset #9583

jmccreight opened this issue Oct 4, 2024 · 2 comments
Labels
needs triage Issue that has not been reviewed by xarray team member

Comments

@jmccreight
Copy link
Contributor

jmccreight commented Oct 4, 2024

What is your issue?

This SEEMS like a bug. I think it would take me quite a bit of tedious research on my own to tell if that's true, so I'm asking if this is a bug.

I have a netcdf file where I intentionally set the coodinates on the single variable via the metadata per https://docs.xarray.dev/en/stable/user-guide/io.html#coordinates so that I could ensure the file opens as a DataArray.

ncdump -h node_outflows.nc
netcdf node_outflows {
dimensions:
	time = UNLIMITED ; // (180 currently)
	node_coord = 1945 ;
	S12 = 12 ;
variables:
	float time(time) ;
		time:units = "days since 1970-01-01 00:00:00" ;
	int node_coord(node_coord) ;
	string node_maker_name(node_coord, S12) ;
	int64 node_maker_index(node_coord) ;
	int64 node_maker_id(node_coord) ;
	int64 to_graph_index(node_coord) ;
	double node_outflows(time, node_coord) ;
		node_outflows:_FillValue = 9.96920996838687e+36 ;
		node_outflows:desc = "The flows leaving each FlowGraph node in (vol/time)" ;
		node_outflows:dims = "nnodes" ;
		node_outflows:type = "float64" ;
		node_outflows:units = "cfs" ;
		node_outflows:var_category = "mass flux" ;
		node_outflows:coordinates = "node_maker_name node_maker_index node_maker_id to_graph_index" ;

// global attributes:
		:Description = "pywatershed output data" ;
		:process\ class = "FlowGraph" ;
}

While it opens fine as a DataArray, the node_maker_name is missing from the coordinates.

>>> da = xr.open_dataarray(tmp_path / f"{vv}.nc")
>>> da
<xarray.DataArray 'node_outflows' (time: 180, node_coord: 1945)> Size: 3MB
[350100 values with dtype=float64]
Coordinates:
  * time              (time) datetime64[ns] 1kB 1979-01-01 ... 1979-06-29
  * node_coord        (node_coord) int32 8kB 0 1 2 3 4 ... 1941 1942 1943 1944
    node_maker_index  (node_coord) int64 16kB ...
    node_maker_id     (node_coord) int64 16kB ...
    to_graph_index    (node_coord) int64 16kB ...
Attributes:
    desc:          The flows leaving each FlowGraph node in (vol/time)
    dims:          nnodes
    type:          float64
    units:         cfs
    var_category:  mass flux
>>> da.encoding['coordinates']
'node_maker_name node_maker_index node_maker_id to_graph_index'

But if I open as a Dataset, there's node_maker_name as a coordinate.

>>> ds = xr.open_dataset(tmp_path / f"{vv}.nc")
>>> ds
<xarray.Dataset> Size: 3MB
Dimensions:           (time: 180, node_coord: 1945, S12: 12)
Coordinates:
  * time              (time) datetime64[ns] 1kB 1979-01-01 ... 1979-06-29
  * node_coord        (node_coord) int32 8kB 0 1 2 3 4 ... 1941 1942 1943 1944
    node_maker_name   (node_coord, S12) <U1 93kB ...
    node_maker_index  (node_coord) int64 16kB ...
    node_maker_id     (node_coord) int64 16kB ...
    to_graph_index    (node_coord) int64 16kB ...
Dimensions without coordinates: S12
Data variables:
    node_outflows     (time, node_coord) float64 3MB ...
Attributes:
    Description:    pywatershed output data
    process class:  FlowGraph

This inconsistency seems unnecessary on the face of it. So I think it's a bug.

Thanks in advance!

@jmccreight jmccreight added the needs triage Issue that has not been reviewed by xarray team member label Oct 4, 2024
@jmccreight
Copy link
Contributor Author

jmccreight commented Oct 5, 2024

Maybe its obvious now that I take a step back, but I'm going to guess that it's because "S12" is not in the dimensions of the variable and that somehow matters more to DataArray than Dataset.

If that's the case, the question is "how to put string coordinates on DataArrays".

I thought the answer was going to be the concat_characters=True option for open_dataarray. Not quite sure why that isnt working. Will try to look at the tests and make an MRE.

@keewis
Copy link
Collaborator

keewis commented Oct 5, 2024

the first issue the same as for #9579: the data model of DataArray does not allow having dimensions on a coordinate that is not on the DataArray's variable (as you noticed). We've discussed extending it but have not made any progress yet, and either way I don't think that would really help you in this case.

concat_characters=True not working on open_dataarray is a different issue. Since open_dataarray is roughly equal to open_dataset + a getitem, could you try if open_dataset with concat_characters=True works? If not, we might have to figure out why (set decode_cf=False and check the dtype, it should be "S1"). A MRE would be very helpful in getting to the bottom of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

No branches or pull requests

2 participants