Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new zarr release / numcodecs #868

Closed
sofroniewn opened this issue Jan 11, 2020 · 11 comments · Fixed by #873
Closed

new zarr release / numcodecs #868

sofroniewn opened this issue Jan 11, 2020 · 11 comments · Fixed by #873
Labels
bug Something isn't working
Milestone

Comments

@sofroniewn
Copy link
Contributor

🐛 Bug

Looks like zarr's new release requires numcodecs > 0.6.4, but we pinned to exclude it see discussion #666. I think we need to resolve this ASAP and then make the 0.2.10 release (which also includes the #866 bug fix). Thoughts @tlambert03 @jni? Has the 0.6.4 numcodecs install problem been resolved? You can see our failing tests in #867.

@sofroniewn sofroniewn added the bug Something isn't working label Jan 11, 2020
@sofroniewn sofroniewn added this to the 0.2.0 milestone Jan 11, 2020
@tlambert03
Copy link
Contributor

that's funny... I was literally about to hit "submit" on a similar issue.
here is the commit where they bump the dependency, and they do it " to use ensure_text text handling functionality from it". I don't know zarr deeply enough to know how frequently that is likely to happen...

numcodecs 0.6.4 is still not working, (this actually was just checked in the dependabot fork today 😊)... so not sure what to do about that

@tlambert03
Copy link
Contributor

actually... looks like it was just the windows builds that broke with numcodecs... so maybe something has changed in the meantime?

@sofroniewn
Copy link
Contributor Author

ok - also if I remember zarr-developers/numcodecs#210 correctly the issue is just with pip install on our CI, so people in the wild should still be installing ok. Maybe then this can wait till tomorrow

@jni
Copy link
Member

jni commented Jan 11, 2020

pip install works fine on my machine. imho the fix is to stop depending on zarr. We can use dask.array.from_zarr instead of zarr.open, and it should work whenever the file has standard compression, if I understand correctly.

@jni
Copy link
Member

jni commented Jan 11, 2020

Ah. I just saw why we started using zarr directly: we can tell whether there is a dataset instead of an array. Guh. =)

@sofroniewn
Copy link
Contributor Author

sofroniewn commented Jan 11, 2020

hmm - relevant part of the codebase is here

if isinstance(zr, zarr.core.Array):

I agree that we should figure out a way to drop our dependency on both zarr and numcodecs, and just leverage dask, possibly with try catches to do everything, or say that additional functionality belongs in a plugin. We can still have zarr as one of our test dependencies (if that works with our CI) or just use dask functions for all our tests too. I'll think more about this tomorrow too

@sofroniewn
Copy link
Contributor Author

so i'm still really not sure how to deal with this one. On the one hand I like the idea of not explicitly depending on zarr and numcodecs and just leveraging fileIO options here in dask, but then there is no good way to deal with the pyramid case where we want a list of dask arrays out. We could try raising an issue on dask to see if they'd ever be interested in adding the functionality we have here to their from_zarr method, but it might be strange for them to return a list of dask arrays, or a dask array where each subarray along the first axis doesn't have the same shape.

I will also note that when I do pyramid loading from my laptop it worked fine, but when I did it from s3 it couldn't find the components properly. I need to look into that more, but it makes me think that we might not have a perfect solution even now.

@tlambert03
Copy link
Contributor

tlambert03 commented Jan 11, 2020

just played with it a little bit, and it looks like the main thing that fails when using dask.from_zarr on a zarr group is the fact that the .zarray file doesn't exist. Here's a function that works on both zarr single arrays and groups of arrays, without importing zarr:

def read_zarr_dataset(filename):
    if os.path.exists(os.path.join(filename, '.zarray')):
        # load zarr array
        image = da.from_zarr(filename)
        shape = image.shape
    elif os.path.exists(os.path.join(filename, '.zgroup')):
        # else load zarr all arrays inside file, useful for pyramid data
        image = []
        for subd in os.listdir(filename):
            full_subd = os.path.join(filename, subd)
            if os.path.exists(os.path.join(full_subd, '.zarray')):
                image.append(da.from_zarr(full_subd))
        shape = image[0].shape
    else:
        raise ValueError("Not a zarr dataset or group")
    return image, shape

(improvements that could be made to that function to provide dask with the proper component string)... and I think folder structure assumption is probably pretty safe since it's in the Zarr spec
thoughts?

@sofroniewn sofroniewn modified the milestones: 0.2, 0.2.10 Jan 11, 2020
@tlambert03
Copy link
Contributor

one last thought and then I have to sign off for the night...
regarding our CI strategy, I think we can still pip install zarr on CI, (making sure that windows has the required build tools) so we can at least test zarr even if it's not a strict dependency.

@sofroniewn
Copy link
Contributor Author

This looks great! I will play with now and try and break, but I think it is the way to go

@sofroniewn
Copy link
Contributor Author

This is really great - I don't think we need to use the dask component syntax. I had to add a sorted call to the os.listdir to make everything work there, but otherwise its great. I have the code ready to go in a PR that I can make right now if you're out for a bit, but feel free to then take over if you needed.

I'll add here that this solution won't now work for things like s3 links, but that didn't really work properly before either, so not sure we're in a worse place, and we can make a separate push around that remote stuff later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants