[BUG] Avoid allocating and using `size_input` vector while computing output col sizes when lists are present. #16985

mhaseeb123 · 2024-10-03T03:06:31Z

Describe the bug
We currently use a temporary vector called size_input of size num_keys = input_cols.size() x max_depth x total_number_of_pages when computing output column lengths when lists are present. This leads to OOM for ultra-wide and deeply nested tables if num_keys becomes too large (e.g. 25k input cols x 5 max depth x 25k total pages = 3.12B). Note that even if this does not OOM on a larger GPU, the loop using num_keys will certainly generate a runtime error or worse a logical error downstream.

To avoid this, we should update sizes and PageNestingInfo.page_start_value fields using cuda::atomic_ref if num_keys is > 2B (to avoid breaking this loop

Steps/Code to reproduce bug
On any RDS machine

import cudf
import os
import rmm
import pyarrow.parquet as pq
from numba import cuda

cuda.select_device(0)
rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
os.environ["KVIKIO_COMPAT_MODE"] = "on"

if __name__ == "__main__":
    df = cudf.read_parquet(
        "/datasets/gkimball/spark_json/20241001/part-00000-505e98e9-a5c8-4720-8bb4-d6cc96625744-c000.snappy.parquet"
    )
    print("cudf read input parquet")

    buf = StringIO(df["columnC"].str.cat(sep="\n", na_rep="{}"))
    print("made JSONL buffer")

    df = cudf.read_json(buf, lines=True)
    print("cudf read JSONL buffer using pandas engine")

    df = cudf.DataFrame(
        {"AENBHHGIABBBDDGOEI": df["AENBHHGIABBBDDGOEI"]}
    ).to_arrow()

    pq.write_table(df, "/home/coder/transcoded/entire_problematic_arrow.pq")
    print("pyarrow wrote parquet")

    _ = cudf.read_parquet("/home/coder/transcoded/entire_problematic_arrow.pq")
    print("cudf read parquet") <-- OOMs while allocating size_input

Expected behavior
We should not OOM. Here is an unrefined alternative loop to update sizes vector. Need to redefine the reduction_keys iterator and correspondingly update PageNestingInfo.page_start_value for each page of each input column.

      thrust::for_each(rmm::exec_policy(_stream),
                       thrust::make_counting_iterator<size_t>(0),
                       thrust::make_counting_iterator<size_t>(num_keys),
                       [&,
                        input_cols = d_cols_info.data(),
                        pages      = subpass.pages.device_begin(),
                        max_depth  = max_depth,
                        num_pages  = subpass.pages.size(),
                        sizes      = sizes.d_begin()](auto index) {
                         auto const indices = reduction_indices{index, max_depth, num_pages};
                         auto const& page   = pages[indices.page_idx];
                         cuda::atomic_ref<size_t, cuda::thread_scope_device> sizes_ref{
                           sizes[(indices.col_idx * max_depth) + indices.depth_idx]};
                         if (page.src_col_schema == input_cols[indices.col_idx].schema_idx and
                             not(page.flags & PAGEINFO_FLAGS_DICTIONARY) and
                             indices.depth_idx < input_cols[indices.col_idx].nesting_depth) {
                           sizes_ref.fetch_add(page.nesting[indices.depth_idx].batch_size);
                         }
                       });

Environment details
cudf: branch-24.12 on RDS machine: dgx-05 running dev-container: cuda12.5-conda,

Additional context
N/A

The text was updated successfully, but these errors were encountered:

mhaseeb123 added the bug Something isn't working label Oct 3, 2024

mhaseeb123 added this to the Parquet continuous improvement milestone Oct 3, 2024

mhaseeb123 added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Avoid allocating and using `size_input` vector while computing output col sizes when lists are present. #16985

[BUG] Avoid allocating and using `size_input` vector while computing output col sizes when lists are present. #16985

mhaseeb123 commented Oct 3, 2024 •

edited

Loading

[BUG] Avoid allocating and using size_input vector while computing output col sizes when lists are present. #16985

[BUG] Avoid allocating and using size_input vector while computing output col sizes when lists are present. #16985

Comments

mhaseeb123 commented Oct 3, 2024 • edited Loading

[BUG] Avoid allocating and using `size_input` vector while computing output col sizes when lists are present. #16985

[BUG] Avoid allocating and using `size_input` vector while computing output col sizes when lists are present. #16985

mhaseeb123 commented Oct 3, 2024 •

edited

Loading