`.experimental.concat_on_disk` fails to properly infer `indptr` dtype for the final object #1709

jacobkimmel · 2024-10-10T03:57:27Z

Please make sure these conditions are met

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of anndata.
(optional) I have confirmed this bug exists on the master branch of anndata.

Report

See:

anndata/src/anndata/experimental/merge.py

Line 223 in 8e9eb88

number_non_zero = sum(len(d.group["indices"]) for d in datasets)

.experimental.concat_on_disk is an awesome feature. Thanks for writing it!

I found in practice that the dtype inference for indptr in the final object can fail in curious ways. It seems to be overly aggressive with casting to int32, which leads to a traceback when merging objects large enough to require int64.

Honestly, I can't tell why the existing code doesn't work from first principles, but I can confirm that when I hardcoded int64 for the output object, everything completed successfuly.

Code:

# anndata paths contains objects with >10e6 observations and >1e4 features
# on average probably ~80-90% sparse
anndata.experimental.concat_on_disk(
    in_files=adata_paths,
    out_file=out_path,
    max_loaded_elems=int(1e10),
    axis=0,
    join="inner",
    label="concat_batch",
    keys=keys,
    index_unique="::",
)

Traceback:

  File "/efs/home/jacob/mambaforge/envs/scpy/lib/python3.10/site-packages/anndata/_core/sparse_dataset.py", line 499, in append
    raise OverflowError(
OverflowError: This array was written with a 32 bit intptr, but is now large enough to require 64 bit values. Please recreate the array with a 64 bit indptr.

Versions

IPython 8.28.0
anndata 0.11.0rc3.dev3+g8e9eb88.d20241010
session_info 1.0.0

asttokens NA
cython_runtime NA
dateutil 2.9.0.post0
decorator 5.1.1
executing 2.1.0
h5py 3.12.1
jedi 0.19.1
natsort 8.4.0
numpy 2.1.2
packaging 24.1
pandas 2.2.3
parso 0.8.4
prompt_toolkit 3.0.48
pure_eval 0.2.3
pygments 2.18.0
pytz 2024.2
scipy 1.14.1
six 1.16.0
stack_data 0.6.3
traitlets 5.14.3
wcwidth 0.2.13

Python 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0]
Linux-6.2.0-1018-aws-x86_64-with-glibc2.35

Session information updated at 2024-10-10 03:56

The text was updated successfully, but these errors were encountered:

ilan-gold · 2024-10-16T13:41:00Z

Honestly, I can't tell why the existing code doesn't work from first principles, but I can confirm that when I hardcoded int64 for the output object, everything completed successfuly.

What do you mean by hardcoding? We do have a test for this feature, so I'm definitely curious what could be going on here. I see you're on an AWS machine but if you could start a debugger and just look at that line with your example, would be amazing.

jacobkimmel · 2024-10-16T14:57:35Z

hardcoding

I forked the repo and changed these lines to *always* set int64 dtypes.

debugger

Happy to take a look, will follow up this week!

…

On Wed, Oct 16, 2024 at 06:41 Ilan Gold ***@***.***> wrote: Honestly, I can't tell why the existing code doesn't work from first principles, but I can confirm that when I hardcoded int64 for the output object, everything completed successfuly. What do you mean by hardcoding? We do have a test for this feature, so I'm definitely curious what could be going on here. I see you're on an AWS machine but if you could start a debugger and just look at that line with your example, would be amazing. — Reply to this email directly, view it on GitHub <#1709 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AASZ3RL2JQ62PEDKVWU63TTZ3ZUIFAVCNFSM6AAAAABPV3IC5CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJWHA3TONBRGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ilan-gold · 2024-10-16T15:13:35Z

Thanks so much @jacobkimmel. I would be opening to adding something to force this usage, but I'm hesitant since I don't understand what is going on. We have a settings object coming out which would make this super easy.

BTW so you don't think I'm stonewalling, the reason we don't do int64 by default isn't performance or something (indptr is relatively small), but more than CUDA doesn't handle int64 at all :( So we try our best to keep it int32

jacobkimmel added Bug 🐛 Triage 🩺 labels Oct 10, 2024

ilan-gold removed the Triage 🩺 label Oct 16, 2024

ilan-gold self-assigned this Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`.experimental.concat_on_disk` fails to properly infer `indptr` dtype for the final object #1709

`.experimental.concat_on_disk` fails to properly infer `indptr` dtype for the final object #1709

jacobkimmel commented Oct 10, 2024 •

edited

Loading

ilan-gold commented Oct 16, 2024

jacobkimmel commented Oct 16, 2024 via email

ilan-gold commented Oct 16, 2024

.experimental.concat_on_disk fails to properly infer indptr dtype for the final object #1709

.experimental.concat_on_disk fails to properly infer indptr dtype for the final object #1709

Comments

jacobkimmel commented Oct 10, 2024 • edited Loading

Please make sure these conditions are met

Report

Versions

IPython 8.28.0 anndata 0.11.0rc3.dev3+g8e9eb88.d20241010 session_info 1.0.0

Python 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0] Linux-6.2.0-1018-aws-x86_64-with-glibc2.35

ilan-gold commented Oct 16, 2024

jacobkimmel commented Oct 16, 2024 via email

ilan-gold commented Oct 16, 2024

`.experimental.concat_on_disk` fails to properly infer `indptr` dtype for the final object #1709

`.experimental.concat_on_disk` fails to properly infer `indptr` dtype for the final object #1709

jacobkimmel commented Oct 10, 2024 •

edited

Loading

IPython 8.28.0
anndata 0.11.0rc3.dev3+g8e9eb88.d20241010
session_info 1.0.0

Python 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0]
Linux-6.2.0-1018-aws-x86_64-with-glibc2.35