Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate if empty cluster handling could be simplified in tdigest aggregating #16901

Open
jihoonson opened this issue Sep 24, 2024 · 0 comments

Comments

@jihoonson
Copy link
Contributor

There are two APIs in libcudf for tdigest groupby aggregation, cudf::tdigest::detail::group_tdigest() and cudf::tdigest::detail::group_merge_tdigest(). The former takes the input as numeric values, and the latter takes tdigest columns. The numeric value column can contain nulls as it is a regular column. However, the tdigest column cannot contain nulls. Instead, it can contain an empty cluster for a group if all input values in the group to compute a tdigest were null.

To handle nulls, we are currently using a workaround based on explicit stubs. When all values are null in a group, we put a stub as a placeholder for an empty cluster to be created later. After the core computation is done, these stubs are removed before the result is returned. This workaround not only is certainly adding complexity to the implementation, but also might be adding some unnecessary overhead to handle empty clusters during the computation.

We should investigate if this workaround is absolutely necessary, and remove it if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant