Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Avoid allocating and using size_input vector while computing output col sizes when lists are present. #16985

Open
mhaseeb123 opened this issue Oct 3, 2024 · 0 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@mhaseeb123
Copy link
Member

mhaseeb123 commented Oct 3, 2024

Describe the bug
We currently use a temporary vector called size_input of size num_keys = input_cols.size() x max_depth x total_number_of_pages when computing output column lengths when lists are present. This leads to OOM for ultra-wide and deeply nested tables if num_keys becomes too large (e.g. 25k input cols x 5 max depth x 25k total pages = 3.12B). Note that even if this does not OOM on a larger GPU, the loop using num_keys will certainly generate a runtime error or worse a logical error downstream.

To avoid this, we should update sizes and PageNestingInfo.page_start_value fields using cuda::atomic_ref if num_keys is > 2B (to avoid breaking this loop

Steps/Code to reproduce bug
On any RDS machine

import cudf
import os
import rmm
import pyarrow.parquet as pq
from numba import cuda

cuda.select_device(0)
rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
os.environ["KVIKIO_COMPAT_MODE"] = "on"

if __name__ == "__main__":
    df = cudf.read_parquet(
        "/datasets/gkimball/spark_json/20241001/part-00000-505e98e9-a5c8-4720-8bb4-d6cc96625744-c000.snappy.parquet"
    )
    print("cudf read input parquet")

    buf = StringIO(df["columnC"].str.cat(sep="\n", na_rep="{}"))
    print("made JSONL buffer")

    df = cudf.read_json(buf, lines=True)
    print("cudf read JSONL buffer using pandas engine")

    df = cudf.DataFrame(
        {"AENBHHGIABBBDDGOEI": df["AENBHHGIABBBDDGOEI"]}
    ).to_arrow()

    pq.write_table(df, "/home/coder/transcoded/entire_problematic_arrow.pq")
    print("pyarrow wrote parquet")

    _ = cudf.read_parquet("/home/coder/transcoded/entire_problematic_arrow.pq")
    print("cudf read parquet") <-- OOMs while allocating size_input

Expected behavior
We should not OOM. Here is an unrefined alternative loop to update sizes vector. Need to redefine the reduction_keys iterator and correspondingly update PageNestingInfo.page_start_value for each page of each input column.

      thrust::for_each(rmm::exec_policy(_stream),
                       thrust::make_counting_iterator<size_t>(0),
                       thrust::make_counting_iterator<size_t>(num_keys),
                       [&,
                        input_cols = d_cols_info.data(),
                        pages      = subpass.pages.device_begin(),
                        max_depth  = max_depth,
                        num_pages  = subpass.pages.size(),
                        sizes      = sizes.d_begin()](auto index) {
                         auto const indices = reduction_indices{index, max_depth, num_pages};
                         auto const& page   = pages[indices.page_idx];
                         cuda::atomic_ref<size_t, cuda::thread_scope_device> sizes_ref{
                           sizes[(indices.col_idx * max_depth) + indices.depth_idx]};
                         if (page.src_col_schema == input_cols[indices.col_idx].schema_idx and
                             not(page.flags & PAGEINFO_FLAGS_DICTIONARY) and
                             indices.depth_idx < input_cols[indices.col_idx].nesting_depth) {
                           sizes_ref.fetch_add(page.nesting[indices.depth_idx].batch_size);
                         }
                       });

Environment details
cudf: branch-24.12 on RDS machine: dgx-05 running dev-container: cuda12.5-conda,

Additional context
N/A

@mhaseeb123 mhaseeb123 added the bug Something isn't working label Oct 3, 2024
@mhaseeb123 mhaseeb123 added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

1 participant