You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are two APIs in libcudf for tdigest groupby aggregation, cudf::tdigest::detail::group_tdigest() and cudf::tdigest::detail::group_merge_tdigest(). The former takes the input as numeric values, and the latter takes tdigest columns. The numeric value column can contain nulls as it is a regular column. However, the tdigest column cannot contain nulls. Instead, it can contain an empty cluster for a group if all input values in the group to compute a tdigest were null.
To handle nulls, we are currently using a workaround based on explicit stubs. When all values are null in a group, we put a stub as a placeholder for an empty cluster to be created later. After the core computation is done, these stubs are removed before the result is returned. This workaround not only is certainly adding complexity to the implementation, but also might be adding some unnecessary overhead to handle empty clusters during the computation.
We should investigate if this workaround is absolutely necessary, and remove it if possible.
The text was updated successfully, but these errors were encountered:
There are two APIs in libcudf for tdigest groupby aggregation,
cudf::tdigest::detail::group_tdigest()
andcudf::tdigest::detail::group_merge_tdigest()
. The former takes the input as numeric values, and the latter takes tdigest columns. The numeric value column can contain nulls as it is a regular column. However, the tdigest column cannot contain nulls. Instead, it can contain an empty cluster for a group if all input values in the group to compute a tdigest were null.To handle nulls, we are currently using a workaround based on explicit stubs. When all values are null in a group, we put a stub as a placeholder for an empty cluster to be created later. After the core computation is done, these stubs are removed before the result is returned. This workaround not only is certainly adding complexity to the implementation, but also might be adding some unnecessary overhead to handle empty clusters during the computation.
We should investigate if this workaround is absolutely necessary, and remove it if possible.
The text was updated successfully, but these errors were encountered: