Investigate if empty cluster handling could be simplified in tdigest aggregating #16901

jihoonson · 2024-09-24T21:15:11Z

There are two APIs in libcudf for tdigest groupby aggregation, cudf::tdigest::detail::group_tdigest() and cudf::tdigest::detail::group_merge_tdigest(). The former takes the input as numeric values, and the latter takes tdigest columns. The numeric value column can contain nulls as it is a regular column. However, the tdigest column cannot contain nulls. Instead, it can contain an empty cluster for a group if all input values in the group to compute a tdigest were null.

To handle nulls, we are currently using a workaround based on explicit stubs. When all values are null in a group, we put a stub as a placeholder for an empty cluster to be created later. After the core computation is done, these stubs are removed before the result is returned. This workaround not only is certainly adding complexity to the implementation, but also might be adding some unnecessary overhead to handle empty clusters during the computation.

We should investigate if this workaround is absolutely necessary, and remove it if possible.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate if empty cluster handling could be simplified in tdigest aggregating #16901

Investigate if empty cluster handling could be simplified in tdigest aggregating #16901

jihoonson commented Sep 24, 2024

Investigate if empty cluster handling could be simplified in tdigest aggregating #16901

Investigate if empty cluster handling could be simplified in tdigest aggregating #16901

Comments

jihoonson commented Sep 24, 2024