Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.experimental.concat_on_disk fails to properly infer indptr dtype for the final object #1709

Open
3 tasks done
jacobkimmel opened this issue Oct 10, 2024 · 3 comments
Open
3 tasks done
Assignees
Labels

Comments

@jacobkimmel
Copy link

jacobkimmel commented Oct 10, 2024

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

See:

number_non_zero = sum(len(d.group["indices"]) for d in datasets)

.experimental.concat_on_disk is an awesome feature. Thanks for writing it!

I found in practice that the dtype inference for indptr in the final object can fail in curious ways. It seems to be overly aggressive with casting to int32, which leads to a traceback when merging objects large enough to require int64.

Honestly, I can't tell why the existing code doesn't work from first principles, but I can confirm that when I hardcoded int64 for the output object, everything completed successfuly.

Code:

# anndata paths contains objects with >10e6 observations and >1e4 features
# on average probably ~80-90% sparse
anndata.experimental.concat_on_disk(
    in_files=adata_paths,
    out_file=out_path,
    max_loaded_elems=int(1e10),
    axis=0,
    join="inner",
    label="concat_batch",
    keys=keys,
    index_unique="::",
)

Traceback:

  File "/efs/home/jacob/mambaforge/envs/scpy/lib/python3.10/site-packages/anndata/_core/sparse_dataset.py", line 499, in append
    raise OverflowError(
OverflowError: This array was written with a 32 bit intptr, but is now large enough to require 64 bit values. Please recreate the array with a 64 bit indptr.

Versions


IPython 8.28.0
anndata 0.11.0rc3.dev3+g8e9eb88.d20241010
session_info 1.0.0

asttokens NA
cython_runtime NA
dateutil 2.9.0.post0
decorator 5.1.1
executing 2.1.0
h5py 3.12.1
jedi 0.19.1
natsort 8.4.0
numpy 2.1.2
packaging 24.1
pandas 2.2.3
parso 0.8.4
prompt_toolkit 3.0.48
pure_eval 0.2.3
pygments 2.18.0
pytz 2024.2
scipy 1.14.1
six 1.16.0
stack_data 0.6.3
traitlets 5.14.3
wcwidth 0.2.13

Python 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0]
Linux-6.2.0-1018-aws-x86_64-with-glibc2.35

Session information updated at 2024-10-10 03:56

@ilan-gold
Copy link
Contributor

Honestly, I can't tell why the existing code doesn't work from first principles, but I can confirm that when I hardcoded int64 for the output object, everything completed successfuly.

What do you mean by hardcoding? We do have a test for this feature, so I'm definitely curious what could be going on here. I see you're on an AWS machine but if you could start a debugger and just look at that line with your example, would be amazing.

@jacobkimmel
Copy link
Author

jacobkimmel commented Oct 16, 2024 via email

@ilan-gold
Copy link
Contributor

Thanks so much @jacobkimmel. I would be opening to adding something to force this usage, but I'm hesitant since I don't understand what is going on. We have a settings object coming out which would make this super easy.

BTW so you don't think I'm stonewalling, the reason we don't do int64 by default isn't performance or something (indptr is relatively small), but more than CUDA doesn't handle int64 at all :( So we try our best to keep it int32

@ilan-gold ilan-gold self-assigned this Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants