Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal data structure #106

Open
mathause opened this issue Oct 14, 2021 · 18 comments · May be fixed by #537
Open

internal data structure #106

mathause opened this issue Oct 14, 2021 · 18 comments · May be fixed by #537
Milestone

Comments

@mathause
Copy link
Member

mathause commented Oct 14, 2021

NOTE: this issue is very much a draft at the moment.


We should give some thought on the internal data structure in mesmer. IMHO this is one of the more important but also difficult things to decide on. Generally the idea is to move to a xarray-based data structure. However, it is unclear how we want to carry metadata around (e.g. model name, experiment, etc.).

Current status

Currently a triply nested dictionary of numpy arrays. Something that roughly looks like:

arr = "array"
data = dict(model=dict(ssp=dict(tas=arr)))

Ideas

1. Keep as is

Pros/ Cons

  • No need to change anything
  • Data access is awkward (data["CESM"]["ssp585"]["tas"])
  • What if another layer is needed?
  • ...

2. One huge DataArray

We could potentially create one enormous DataArray that encapsulates all necessary information as coordinates (and non-dimension coordinates). It could look something like this:

<xarray.DataArray (time, ens, gp, realisation)>
array([[1, 2, 3]])
Coordinates:
    model    (ens) <U5 'CESM' 'CESM' 'MIROC'
    exp      (ens) <U6 'ssp585' 'ssp126' 'ssp585'
  * time     (time) int32 1
Dimensions without coordinates: ens, gp, realisation

Pros/ Cons

  • A well understood and well tested data structure
  • Can get unwieldy if too big
  • Difficult to decide where to draw the border
  • Need to be careful what becomes a dimension (e.g. not all models have all scenarios, so it cannot be a dimension)
  • Cannot have hist and projections in the same array
  • Cannot combine different models if they have a different grid

3. DataList

DataList is a data structure inspired by ESMValTool that I used extensively for my analysis of the CMIP6 data for IPCC. It is used as a flat data container to store model data (e.g. xr.DataArray objects) and its metadata. It is a list of data, metadata tuples:

datalist = [
    (ds, meta),
    (ds, meta),
    ...
]

where ds is a data store (e.g. an xr.DataArray) and meta is a dict containing metadata, e.g. meta = {"model": "CESM", exp: "ssp585", ...}. This allows to (1) work with a convenient flat data structure (no nested loops), (2) store arbitrary data (e.g. xr.DataArray objects with different grids), (3) carry around metadata without having to alter the data store.

DataList structure could be turned into a class, which would allow for a convenient interface. E.g. it could allow for things like

datalist.select(exp="ssp585")
datalist.map(compute_time_average, period=slice(1850, 1900))

An implementation could look like (outside from mesmer):

class DataList:
    
    def __init__(self, data: List[Any], meta: List[dict]):
        
        assert len(data) == len(meta)
        ...

    def select(self, **attributes):
      """Select specific data described by metadata."""

    def remove(self, **attributes):
      """Remove specific data described by metadata."""

    def map(self, func, pass_meta=False, **kwargs):
      """loop over DataList and apply a function"""
    # could also accept function names, e.g.
    # map("sel", time=slice("1850", "1900"))

    def to_dataarray(self, along):
      """concatenate data along a dimension and return as xr.DataArray"""

    def groupby(group):
        "group over an attribute"

def join(*datalists, on=("model", "exp", "ens"), check=True):
    """align datalists (inner join)"""

Pros/ Cons

  • Flat & flexible data structure
  • Not implemented yet (but should be reasonably straightforward - I have most of the code)
  • Why two data structures (DataList and combined DataArrays)

4. DataTree

WIP implementation of a tree-like hierarchical data structure for xarray, see https://github.com/TomNicholas/datatree

Pros/ Cons

  • could overcome the issues of (2)
  • not production ready
  • metadata must be encoded in the key

5. Any other ideas?

@leabeusch
Copy link
Collaborator

@znicholls same comment as for #105 but probably even more important to get your input on this at some point!

@znicholls
Copy link
Collaborator

I think it depends on the intended access pattern. In my head, the access pattern for MESMER is always (and if this assumption is wrong, ignore everything I've written):

  1. Start with data to which we want to calibrate (e.g. CMIP output, or some impact model output)
  2. Find the data we want to use as inputs (e.g. CMIP global-means having adjusted for volcanoes first, extreme rainfall data)
  3. Calibrate a model (e.g. regression, some other fancy stats stuff I have no idea about)
  4. Store the calibrated model parameters (how to do the storage well is a separate point)
  5. Use the calibrated model parameters to draw emulations

Given this, calibrating multiple models is an embarrassingly parallel problem (you can calibrate to multiple CMIP models at once, you don't need information from one calibration to inform another one). Hence I would build the internal calibration data structure for the one model/calibration case (so things like needing to store the model as the dimension disappear, and we can just use scenario/experiment as the dimension). That should help simplify that scope. However, we will then need a second (and maybe third) 'outer-layer' structure which can handle the case where we want to calibrate multiple things at once. That outer-layer will need to handle things like parallelisation, knowing which functions to call, how to pass parameters etc. However, having multiple layers means that we separate the concern of each layer, which should make implementing them easier. The other-layer we'd probably want is an output storage layer (i.e. some way to store the outputs from multiple calibrations yet still be able to search them). I suspect that would be fairly easy to build though because the storage layer would only have one concern (so doing some slightly sneaky things is probably fine).

If all the above is correct, then for our internal layer I would use either a DataArray (and I would just let the user/outer-layer decide whether historical and scenario should be joint or treated separately) or DataList (having each experiment to use in calibration as one element in the list).

For the outer layer (i.e. the one that handles multiple model calibrations) then DataList or even just boring tuples or dictionaries should be fine as this outer layer won't be handling any big data (all it's doing is kicking off calibrations and making sure the results get saved in the right place).

For the storage layer I would go with DataList and then we can do some thin wrappers around to make it easy to search things as needed. Maybe a better option would be a class like below which allows you to search easily making the most of pandas 2D data handling whilst also making it hard to break the disconnect between the data and the metadata.

class MesmerOutputStore
    def __init__(self):
        self._meta = # load a 2D table which stores simple metadata info about the data
        self._calibrations = # load from a netCDF file which actually has all the calibrations

    def get_calibration(**kwargs):
        """Get calibrations""""
        selected = self._meta.copy()
        # setting it up this way makes it super easy to then go and filter/search
        for k, v in kwargs.items():
            selected = selected[selected[k] == v]

        # storing metadata in the table means that xarray doesn't have to store the metadata
        # so we don't run into the multi-dimensional sparsity issue
        return self._calibrations.sel(id=selected["id"])

We've actually got this kind of wrapper in https://github.com/openscm/scmdata. It basically has a meta table (https://github.com/openscm/scmdata/blob/029707c57427608026b77f1e263947d8b2a06fac/src/scmdata/run.py#L951) and then a pandas dataframe (cause we only deal with timeseries data). Filtering can then be done really fast (https://github.com/openscm/scmdata/blob/029707c57427608026b77f1e263947d8b2a06fac/src/scmdata/run.py#L1074) because you only search meta before deciding what data to return. Storing the meta as a 2D table means that things don't explode even if you have super sparse data. We wouldn't want all of ScmRun here, but the filtering stuff could be really handy, and we just put a different data model with it.

This was referenced Oct 25, 2021
@yquilcaille
Copy link
Collaborator

In my head, the access pattern for MESMER is always [...]

I agree with this access pattern, it is clear to me.

Given this, calibrating multiple models is an embarrassingly parallel problem (you can calibrate to multiple CMIP models at once, you don't need information from one calibration to inform another one). Hence I would build the internal calibration data structure for the one model/calibration case (so things like needing to store the model as the dimension disappear, and we can just use scenario/experiment as the dimension)

If by model, you mean ESM, not statistical model, i would say that the data structure does not matter much. Whether we have one huge DataArray, a DataList or a DataTree, it would not matter that much for the calibration on one ESM or several ESMs. In my view, the default of MESMER should be 1 calibration / 1 ESM, which means that it would need anyway to reshape the data to calibrate on several ESMs at once, these different layers that you mention.
Though, for the outer layer, i would stick to DataList. Tuples or dictionaries are as handy as such as a DataList would be.

For the calibration, i slightly prefer the huge DataArray to the DataList. I consider the former easier to navigate and use rather than a list. In terms of RAM, it can be handled, from my estimation, it is about 10Gb if we keep the gridpoint axis only, not the maps. Besides, by not .load() the netCDF directly into the memory, i think it should be ok, at the cost of more I/O reading. However, for emulations, we cannot have everything in the RAM at once.
On the emulation part, the current situation -as far as i know- is that the historical part is emulated for each scenario, while it is the same. Over CMIP6, for 5 scenarios, it would cut the amount to calculate by half. This point should be considered for the data and the emulation.

IF we go for a huge DataArray, I think that we can simply load an empty array using np.empty( tuple_dim_sizes ) filled in a loop, or a masked_array, to deal with the Fill_Value.
And yes, it would cause issues if ESMs have different grids... But there would be issues raised at other points, for instance on the geo distance matrix during the training of the local variability.

IF we stick to DataLists for each layer, I would definitely like such a wrapper to gather more easily the correct items. And actually, such a structure may be closer to what scm does, then that is another plus.

@mathause
Copy link
Member Author

mathause commented Oct 28, 2021

I refer to #109 (comment) from @znicholls (sorry my comments are all over the place). I thought a bit more how we can avoid concatenating hist + proj and having very sparse arrays and also how a DataList structure could work in mesmer. Assuming we have a DataList which looks like:

dl = DataList(
    (da, {scen="historical", model="CESM", ...}),
    (da, {scen="ssp126", model="CESM", ...}),
    (da, {scen="ssp585", model="CESM", ...}),
)

i.e. one entry per scenario and model, where da has dims time x gridpoint x member. Then it should be 'relatively elegant' to calculate anomalies and flatten the DataArray objects.

1. Calculate anomalies

def calculate_anomalies(dl : DataList, period=slice("1850", "1900"), dim="time"):

    out = DataList()

    # loop all simulations
    for ds, meta in dl:

        # find the historical simulation
        hist = dl.select(model=meta["model"], scen="historical")

        if how = "individual":
            # given this makes an inner join it will work
            ds = ds - hist.sel({dim: period}).mean(dim)
        elif how = "all":
            ds = ds - hist.sel({dim: period}).mean()
        else:
            raise ValueError()

        out.append(ds, meta)

We loop over all elements in dl, select the corresponding historical simulation and subtract it. For how="individual" we rely on xarray to align ds and hist: the subtraction does an inner join - so we should get rid of all projections that do not have a corresponding historical simulation.

2. Flatten and concatenate the arrays

def expand_dims(ds, meta):

    return ds.expand_dims(scen=meta["scen"])

def stack_except(ds, dim):
    
    dim = [dim] if isinstance(dim, str) else dim

    dims = set(ds.dims) - set(dim)
        
    return ds.stack(stacked=dims)

dl = dl.map(expand_dims, pass_meta=True)
dl = dl.map(stack_except, dim="cell")
dl = dl.concat(along="stacked")

We add model as dimension to each da, then we stack all dimensions except "cell", and finally we concatenate along the new stacked dimension.

@znicholls
Copy link
Collaborator

Then it should be 'relatively elegant' to calculate anomalies and flatten the DataArray objects

I would say this looks absolutely beautiful! Very nice indeed and a great solution to a nasty problem (yes, maybe we find some reason to have to adjust things in future but I think the benefits of trying such a great solution now outweigh the risks of having to adjust in future).

@yquilcaille
Copy link
Collaborator

i.e. one entry per scenario and model, where da has dims time x gridpoint x member.

Yes, aligning hist with projections is a clean solution to provide us with the common members for ESM x scenarios. However, MESMER may use two variables, eg tas and hfds. Then later in the code, when using the da, we must still check for common members to tas and hfds, for they may have different sets.
(@mathause, that is why I had introduced this function to select common members to tas, hfds and TXx.)

@mathause
Copy link
Member Author

mathause commented Oct 28, 2021

I would say this looks absolutely beautiful!

Thanks 😊 I do still have a lot of open questions....

  • How far we should we go? (e.g. in ScmRun you implement a lot of nice functionality but this kind of replicates a Dataset) We don't need to answer this right now - that can be developed*.
  • Is there a package we could rely on for this? (I don't know of any except maybe ScmRun.)
  • Is DataList a good name for the class? What else could it be?
  • Do we have a good name for the package?**
  • How do we store meta internally - dict or pd.DataFrame?
  • I would kind of like to develop datalist outside of mesmer but I am not sure how good this idea is as long as it is not stable.

However, MESMER may use two variables, eg tas and hfds. Then later in the code, when using the da, we must still check for common members to tas and hfds, for they may have different sets.

Yes this is indeed an issue. I see three options.

  1. Enforce symmetry on load time (i.e. make it the users responsibility).
  2. Add a deep option to datalist.align which also aligns the DataArray/ Datasets
  3. Still have one entry per member in the datalist. Then dl.align would take care of it (without any deep param)

IIRC @leabeusch would prefer to have one DataList for all variables. My preference would be to have one per variable. What I imagine is something of the sort:

tas = mesmer.io.cmip6_ng.load(variable="tas")
hfds = mesmer.io.cmip6_ng.load(variable="hfds")

tas, hfds = dl.align(tas, hfds)

*I think we could allow to pass strings to DataList.map which call the underlying functions - that could be a good compromise. E.g.

dl.map("sel", time=slice("1850", "1900"))
# would be turned into
for data, meta in dl:
    getattr(data, "sel")(time=slice("1850", "1900"))

**The best I can usually come up with is to name the package the same as the class which is no good, because then the following happens:

import datalist as dl
dl = dl.DataList(...)
# aahrg

B.t.w. thanks for all your input! That helps a lot.

@znicholls
Copy link
Collaborator

My two cents on the questions

  • How far we should we go?

Maximise code reuse as much as possible. I'm very happy to do some refactoring in scmdata so that we can use the filtering capability here without picking up the rest of scmdata's assumptions.

  • Is there a package we could rely on for this? (I don't know of any except maybe ScmRun.)

The only one I know is scmdata or pyam (they both have a similar idea for filtering). I would go for scmdata as I have more control. Maybe someone else knows of other options for this sort of filtering wrapping though.

  • Is DataList a good name for the class? What else could it be?

I like it and if there are no other similar packages being worked on I think it's a good start. You could make clear that it's xarray specific by calling it something like DataArrayList or DatasetList?

  • Do we have a good name for the package?**

I think datalist is good. You can then do the following to avoid your namespace explosion

from datalist import DataList

dl = DataList(...)

A different option would be xarray_lists?

  • How do we store meta internally - dict or pd.DataFrame?

pd.MultiIndex - means we can use categories for the various bits of metadata which helps memory usage and gives a massive speed boost for filtering (and scmdata has done all the work of making multiindexes behave).

  • I would kind of like to develop datalist outside of mesmer but I am not sure how good this idea is as long as it is not stable.

Start within mesmer, then split out once we've got the interface stable enough.

one DataList for all variables

I would lean towards this (all data in one place). I would write align so you told it what dimension to ignore when doing the alignment (e.g. align("variable")) and then align can raise an error if the two datalists being aligned have more than one variable. Something like

# assume dl contains both tas and hfds
# this would work fine
dl.filter(variable="tas").align(dl.filter(variable="hfds"), "variable")
# this would raise an error because there is more than one variable in dl
dl.filter(variable="tas").align(dl, "variable")

ValueError("self must have only one variable in the dimension being ignored during the alignment")

@yquilcaille
Copy link
Collaborator

yquilcaille commented Oct 29, 2021

How do we store meta internally - dict or pd.DataFrame?

Actually, why not use the attributes of the DataArray? With that, we dont separate the data from its information, and whenever we need to align/concatenate/else, we can simply adapt accordingly the attributes of the new variable.

IIRC @leabeusch would prefer to have one DataList for all variables

I prefer as well having all data into a single file. Intuitively, I was thinking simply about something like that:

data = mesmer.io.cmip6_ng.load(variable=["tas", "hfds"]) # later, precipitations, climate extremes and so on

# if the format is directly as a dl/DataList, something like:
data = data.align( ["tas", "hfds"] ) # or any subset of the variables in data

And inside dl.align, the dl.filter proposed by @znicholls could be integrated.

@leabeusch
Copy link
Collaborator

I haven't understood everything going on here from a code perspective yet, so in case some of my statements don't make any sense, this could really just be because of a stupid misunderstanding -> sorry in advance for that. 😅 but I'm confident with some more time / explanations I'll catch up again eventually. 😄 I'll try to add a few things nevertheless already at this stage:

  1. IIRC @leabeusch would prefer to have one DataList for all variables. My preference would be to have one per variable.

I just really want to keep the option for MESMER to go multivariate. Meaning: I'd like to be able to pass e.g., temperature and precip fields from a single ESM (but from various emission scenarios) at once to whatever forced response module (e.g., regression or sth fancier) & internal variability module (e.g., multivariate Gaussian after some transformations) I have available. Thus it seems weird to me to have those two variables in a different datalist. It would e.g., make more sense to have separate datalists per ESM rather than per variable to me. But maybe I have misunderstood what would need to be included in the same datalist to be able to achieve the functionality I describe?

  1. This internal data structure would be applied to both our ESM runs and our emulations, correct?
  2. If it really applies to emulations too: say we choose the datalist (which seems quite established by now ^^) -> how would you go about saving? Do you just save the individual dataarrays to netcdf? Would somehow include the metadata too? E.g., as attributes of the datarrays? Or just in the name of the file?

@mathause
Copy link
Member Author

I had a chat with Sarah on her mesmer temperature/ precipitation analysis. She currently organizes her data in one large xr.Dataset (see below). It can be practical to have everything in one place. However, she currently only looks at one simulations (?) and what I am not entirely sure is how this will scale to several models, scenarios etc...

<xarray.Dataset>
Dimensions:                         (time: 1932, lat: 72, lon: 144, month: 12)
Coordinates:
  * time                            (time) object 1920-01-16 12:00:00 ... 208...
  * lon                             (lon) float32 -178.8 -176.2 ... 176.2 178.8
  * lat                             (lat) float32 -88.75 -86.25 ... 86.25 88.75
  * month                           (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
Data variables:
    pr                              (time, lat, lon) float32 ...
    tas                             (time, lat, lon) float32 ...
    lsm                             (lat, lon) float32 nan nan nan ... nan nan
    log_pr                          (time, lat, lon) float32 ...
    monthly_global_tas              (time) float32 266.0 266.2 ... 275.8 271.3
    tas_hp                          (time, lat, lon) float32 ...
    pr_hp                           (time, lat, lon) float32 ...
    log_pr_hp                       (time, lat, lon) float32 ...
    tas_res_variability             (time, lat, lon) float32 ...
    log_pr_res_variability          (time, lat, lon) float32 ...
    tas_autoreg_res_variability     (time, lat, lon) float32 ...
    log_pr_autoreg_res_variability  (time, lat, lon) float32 ...
    tas_generated_variability       (time, lat, lon) float32 ...
    log_pr_generated_variability    (time, lat, lon) float32 ...
    tas_autoregressed               (time, lat, lon) float32 ...
    log_pr_autoregressed            (time, lat, lon) float32 ...
    tas_autoreg_slope               (month, lat, lon) float32 ...
    tas_autoreg_offset              (month, lat, lon) float32 ...
    log_pr_autoreg_slope            (month, lat, lon) float32 ...
    log_pr_autoreg_offset           (month, lat, lon) float32 ...
Attributes:
    CDI:          Climate Data Interface version 1.9.10 (https://mpimet.mpg.d...
    Conventions:  CF-1.6
    history:      Tue Oct 19 15:25:28 2021: cdo remapcon2,output_grid_descrip...
    CDO:          Climate Data Operators version 1.9.10 (https://mpimet.mpg.d...

@znicholls
Copy link
Collaborator

I am not entirely sure is how this will scale to several models, scenarios etc

Yes, particularly how it would handle the very annoying history/scenario split

@mathause
Copy link
Member Author

But it's of course super convenient because you only need to schlep one data structure around.


An alternative to a DataList would be to (ab-)use a pandas DataFrame - similar to geopandas. Either by subclassing or via an accessor (see extending pandas). The DataFrame or Dataset would then need a dedicated column and could have an accessor to do fun stuff with it. But again not sure how this would work out exactly.

@mathause
Copy link
Member Author

mathause commented Jul 26, 2022

A pointer to cf_xarray and it's accessor that seems to wrap various methods and classes of xarray (as a potential inspiration how to do this for a DataList (or xrDataList or whatever I called it).

https://github.com/xarray-contrib/cf-xarray/blob/main/cf_xarray/accessor.py

edit: forgot the link

@mathause mathause added this to the v1.0.0 milestone Jul 11, 2023
@mathause
Copy link
Member Author

mathause commented Feb 20, 2024

Some updates:

  • DataTree is currently being ported into xarray. This effort has just started and is certainly still far off. When I tried it out I was quite confused how it's supposed to work but I maybe dis just not spend enough time with it (and there are prob. some rough edges that need polishing)
  • DataList:
    • My sub-optimal implementation I used for the IPCC stuff can be found under IPCC-WG1/Chapter-11 -> datalist.py. Check also my ideas for improvements
    • I have started implementing a DataList at /net/cfc/landclim1/mathause/projects/mesmer/datalist

@znicholls
Copy link
Collaborator

znicholls commented Feb 20, 2024

I had a play with data tree and really liked it. For the kind of data we have here, it seemed the right choice (at least for container, then you just build on top of it as needed). Looking at your links, putting some of the data list ideas onto data tree (or a class that wraps a data tree) seems the best choice to me...

filefinder looks sick by the way, can't wait to use that.

@mathause I am hoping to build some sort of CMIP data processing thing this year (we need it for MAGICC calibration). Would you be interested in collaborating or is finding time for this an ongoing dream that is never realised?

@mathause
Copy link
Member Author

I had a play with data tree and really liked it. For the kind of data we have here, it seemed the right choice (at least for container, then you just build on top of it as needed). Looking at your links, putting some of the data list ideas onto data tree (or a class that wraps a data tree) seems the best choice to me...

@veni-vidi-vici-dormivi (Victoria) started playing with datatree and it indeed looks promising

filefinder looks sick by the way, can't wait to use that.

😊

@mathause I am hoping to build some sort of CMIP data processing thing this year (we need it for MAGICC calibration). Would you be interested in collaborating or is finding time for this an ongoing dream that is never realised?

Yes, please keep me in the loop! I assume you have heard of jbusecke/xMIP?

@znicholls
Copy link
Collaborator

Yes, please keep me in the loop! I assume you have heard of jbusecke/xMIP?

Yep. We want to be able to download our own data though so our use of it depends a bit on how tightly coupled it is to the pangeo data archives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants