[FEA]: CUB large input support #50

jrhemstad · 2023-04-21T17:17:40Z

As a lower-level interface, CUB should optimize for flexibility and performance. As a result, CUB will not guarantee a large input will work by default. However, it should enable users to specify their desired offset type.

This means CUB should not perform any dynamic dispatch based on the input size. Instead, users should have a way to statically specify the offset type. In previous discussion we favored making the type of num_items a template and infer the offset type from the type of num_items.

Design-related research

Give feedback

Testing large number of items

Give feedback

Add tests for large number of items & large number of segments to DeviceSegmentedRadixSort #2139
Add tests for large number of items & large number of segments to DeviceSegmentedSort #2140
Add tests for large number of items in DeviceScan::*ByKey
Options

Clean up interim testing infrastructure

Give feedback

Switch tests for large number of items to use Device* interface in DeviceSelect
Switch tests for large number of items to use Device* interface in DeviceScan
Options

Documentation

Give feedback

We want to be explicit about supported and tested offset types in the Dispatch interface (see Port thrust::merge[_by_key] to CUB #1817 (comment))
Options

The text was updated successfully, but these errors were encountered:

elstehle · 2024-02-21T10:20:11Z

legend for offset type: ✅ considered done | 🟡 considered lower priority | 🟠 considered higher priority, as it prevents usage for larger-than-INT_MAX number of items | ⏳ in progress

legend for testing columns: ✅ considered done | 🟡 to be done | 🟠 needs to support wider offset types first

algorithm	offset type	tests larger-than`INT_MAX`	tests close to `[U]INT_MAX`
`device_adjacent_difference.cuh`	✅ `choose_offset_t`	✅ 2^33, sanity check, iterators	🟡
`device_copy.cuh`	🟡`num_ranges`: `uint32_t` 🟡`buffer sizes`: `iterator_traits<SizeIteratorT>::value_type`	🟡	🟡
`device_for.cuh`	🟡`NumItemsT`: ForEachN, ForEachCopyN, Bulk `difference_type`: ForEach, ForEachCopy	🟡	🟡
`device_histogram.cuh`	🟡 dynamic dispatch: int for (num_rows * row_stride_bytes)<INT_MAX; OffsetT otherwise	🟡	🟡
`device_memcpy.cuh`	🟡 `num_ranges`: `uint32_t` 🟡 `buffer sizes`: `iterator_traits<SizeIteratorT>::value_type`	🟡	🟡
`device_merge_sort.cuh`	🟡 `NumItemsT`	✅ extensive check	✅ extensive check
`device_partition.cuh`	⏳ `int`: `Flagged`, `If` 🟠 `int`: `ThreeWayPartition`	🟠	🟠
`device_radix_sort.cuh`	✅ `choose_offset_t`	✅ extensive check	✅ extensive check
`device_reduce.cuh`	✅ `choose_offset_t`: `Reduce`, `Sum`, `Min`, `Max`, `ReduceByKey`, `TransformReduce` ⚠️ (note) `int`: `ArgMin`, `ArgMax`	✅ sanity, 2^{30,31,33)	✅ sanity, 2^32-1
`device_run_length_encode.cuh`	🟠 `int`	🟠	🟠
`device_scan.cuh`	✅ choose_offset_t: DeviceScan ⏳ choose_offset_t: DeviceScanByKey	🟠	🟠
`device_segmented_radix_sort.cuh`	🟠 `num_items` & `num_segments`:`int`	🟠	🟠
`device_segmented_reduce.cuh`	🟡 `common_iterator_value_t({Begin,End}OffsetIteratorT)`: `Reduce`, `Sum`, `Min`, `Max` ⚠️ (note) `int`: `ArgMin`, `ArgMax` `num_segments`: `int`	✅ sanity, rnd [2^31; 2^33]	🟡
`device_segmented_sort.cuh`	🟠 `num_items` & `num_segments`:`int`	🟠	🟠
`device_select.cuh`	✅ `choose_offset_t`: `UniqueByKey` ⏳ `int`: `Flagged`, `If`, `Unique`
`device_spmv.cuh`	🟠 `int`	🟠	🟠
`device_merge.cuh`	🟠 `int`	🟡	🟡
`device_transform.cuh`	🟠 `int`	🟡	🟡

jrhemstad mentioned this issue Apr 21, 2023

[EPIC] Universal 64-bit index type support in Thrust/CUB algorithms #47

Open

jrhemstad changed the title ~~Determine and finalize design for large input support in CUB~~ CUB large input support Apr 21, 2023

miscco added feature request New feature or request. cub For all items related to CUB labels Jul 12, 2023

miscco changed the title ~~CUB large input support~~ [FEA]: CUB large input support Jul 12, 2023

This was referenced Feb 21, 2024

Identify CUB algorithms that do/do not support large inputs #1408

Closed

Refactor thrust::[stable_]partition[_copy] to use cub::DevicePartition #1435

Merged

This was referenced Apr 3, 2024

Add support for large num_items to device_partition.cuh #1437

Open

Add tests for large number of items to DeviceSelect #1584

Closed

This was referenced Apr 30, 2024

Tensor.nonzero fails on GPU for tensors containing more than INT_MAX elements pytorch/pytorch#51871

Open

Add support for large num_items to device_select.cuh #1422

Open

elstehle mentioned this issue May 21, 2024

Port thrust::merge to CUB #1763

Closed

13 tasks

elstehle mentioned this issue May 30, 2024

Gather benchmark results for each CUB algorithm using different offset types #1787

Open

elstehle mentioned this issue Jun 10, 2024

Adds tests for large number of items in cub::DeviceScan #1830

Merged

2 tasks

elstehle mentioned this issue Jul 9, 2024

Port thrust::merge[_by_key] to CUB #1817

Merged

9 tasks

This was referenced Aug 12, 2024

[DRAFT]: Experimental: Streaming DeviceSelect #2205

Draft

Describe potential solutions for algorithm implementations and their pros/cons #1771

Closed

fbusato mentioned this issue Aug 16, 2024

Add CUB tests for segmented sort/radix sort with 64-bit num. items and segments #2254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: CUB large input support #50

[FEA]: CUB large input support #50

jrhemstad commented Apr 21, 2023 •

edited by elstehle

Loading

Design-related research

Testing large number of items

Enable large `num_items` in CUB algorithms that are sensitive to the choice of `offset_t`

Clean up interim testing infrastructure

Documentation

elstehle commented Feb 21, 2024 •

edited

Loading

[FEA]: CUB large input support #50

[FEA]: CUB large input support #50

Comments

jrhemstad commented Apr 21, 2023 • edited by elstehle Loading

Design-related research

Testing large number of items

Enable large num_items in CUB algorithms that are sensitive to the choice of offset_t

Clean up interim testing infrastructure

Documentation

elstehle commented Feb 21, 2024 • edited Loading

jrhemstad commented Apr 21, 2023 •

edited by elstehle

Loading

Enable large `num_items` in CUB algorithms that are sensitive to the choice of `offset_t`

elstehle commented Feb 21, 2024 •

edited

Loading