Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert() a partial dataset -> auto add dimension ? #11

Open
robin-cls opened this issue Oct 16, 2024 · 0 comments
Open

insert() a partial dataset -> auto add dimension ? #11

robin-cls opened this issue Oct 16, 2024 · 0 comments

Comments

@robin-cls
Copy link

robin-cls commented Oct 16, 2024

Hi,

I am in a case where I need to insert a dataset that does not contain all the variables in the zcollection. It works properly most of the time, but when my inserted dataset does not have all the dimensions stored in the collection, I get the following error:

image

When the insertion tries to add missing variables in the inserted dataset, it fails because zcollection.Dataset does not support adding a variable for which one dimension is unknown. Should we add dimension extension to zcollection.Dataset to solve this ?

Code for reproducing the error

from __future__ import annotations

from typing import Iterator
import datetime
import pprint

import dask.distributed
import fsspec
import numpy

import zcollection
import zcollection.tests.data

def create_dataset() -> zcollection.Dataset:
    """Create a dataset to record."""
    generator: Iterator[zcollection.Dataset] = \
        zcollection.tests.data.create_test_dataset_with_fillvalue()
    return next(generator)


zds: zcollection.Dataset | None = create_dataset()
assert zds is not None
zds.to_xarray()

fs: fsspec.AbstractFileSystem = fsspec.filesystem('memory')
cluster = dask.distributed.LocalCluster(processes=False)
client = dask.distributed.Client(cluster)

partition_handler = zcollection.partitioning.Date(('time', ), resolution='M')
collection: zcollection.Collection = zcollection.create_collection(
    'time', zds, partition_handler, '/my_collection', filesystem=fs)

collection.insert(zds.select_vars(['time']))

Workaround

For now, I am preprocessing the dataset by rebuilding it from scratch and adding carefully selected variables from the zcollection with the missing dimensions. Then I drop these variables to retrieve the original dataset with its new dimensions. This is not satisfying in case of a non-delayed dataset, because we add non-necessary memory usage by creating new arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant