-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44010: [C++] Add arrow::RecordBatch::MakeStatisticsArray()
#44252
base: main
Are you sure you want to change the base?
Conversation
|
auto enumerate_statistics = | ||
[&](std::function<Status(int nth_statistics, bool start_new_column, | ||
std::optional<int32_t> nth_column, const char* key, | ||
const std::shared_ptr<DataType>& type, | ||
const ArrayStatistics::ValueType& value)> | ||
yield) { | ||
int nth_statistics = 0; | ||
RETURN_NOT_OK(yield(nth_statistics++, true, std::nullopt, | ||
ARROW_STATISTICS_KEY_ROW_COUNT_EXACT, int64(), | ||
ArrayStatistics::ValueType{num_rows_})); | ||
|
||
int num_fields = schema_->num_fields(); | ||
for (int nth_column = 0; nth_column < num_fields; ++nth_column) { | ||
auto statistics = column(nth_column)->statistics(); | ||
if (!statistics) { | ||
continue; | ||
} | ||
|
||
bool start_new_column = true; | ||
if (statistics->null_count.has_value()) { | ||
RETURN_NOT_OK(yield( | ||
nth_statistics++, start_new_column, std::optional<int32_t>(nth_column), | ||
ARROW_STATISTICS_KEY_NULL_COUNT_EXACT, int64(), | ||
ArrayStatistics::ValueType{statistics->null_count.value()})); | ||
start_new_column = false; | ||
} | ||
|
||
if (statistics->distinct_count.has_value()) { | ||
RETURN_NOT_OK(yield( | ||
nth_statistics++, start_new_column, std::optional<int32_t>(nth_column), | ||
ARROW_STATISTICS_KEY_DISTINCT_COUNT_EXACT, int64(), | ||
ArrayStatistics::ValueType{statistics->distinct_count.value()})); | ||
start_new_column = false; | ||
} | ||
|
||
if (statistics->min.has_value()) { | ||
if (statistics->is_min_exact) { | ||
RETURN_NOT_OK(yield(nth_statistics++, start_new_column, | ||
std::optional<int32_t>(nth_column), | ||
ARROW_STATISTICS_KEY_MIN_VALUE_EXACT, | ||
statistics->MinArrowType(), statistics->min.value())); | ||
} else { | ||
RETURN_NOT_OK(yield(nth_statistics++, start_new_column, | ||
std::optional<int32_t>(nth_column), | ||
ARROW_STATISTICS_KEY_MIN_VALUE_APPROXIMATE, | ||
statistics->MinArrowType(), statistics->min.value())); | ||
} | ||
start_new_column = false; | ||
} | ||
|
||
if (statistics->max.has_value()) { | ||
if (statistics->is_max_exact) { | ||
RETURN_NOT_OK(yield(nth_statistics++, start_new_column, | ||
std::optional<int32_t>(nth_column), | ||
ARROW_STATISTICS_KEY_MAX_VALUE_EXACT, | ||
statistics->MaxArrowType(), statistics->max.value())); | ||
} else { | ||
RETURN_NOT_OK(yield(nth_statistics++, start_new_column, | ||
std::optional<int32_t>(nth_column), | ||
ARROW_STATISTICS_KEY_MAX_VALUE_APPROXIMATE, | ||
statistics->MaxArrowType(), statistics->max.value())); | ||
} | ||
start_new_column = false; | ||
} | ||
} | ||
return Status::OK(); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to extract this as an internal function.
903e3f4
to
92afc83
Compare
It's a convenient function that converts `arrow::ArrayStatistics` in a `arrow::RecordBatch` to `arrow::Array` for the Arrow C data interface.
92afc83
to
b194430
Compare
@pitrou @ianmcook What do you think about this? Statistics schema https://github.com/apache/arrow/pull/43553/files#diff-f3758fb6986ea8d24bb2e13c2feb625b68bbd6b93b3fbafd3e2a03dcdc7ba263R86-R95 is compact but it may be complex to build. Because it uses many nested types. |
Rationale for this change
Statistics schema for Arrow C data interface (GH-43553) is complex because it uses nested types (struct, map and union). So reusable implementation to make statistics array is useful.
What changes are included in this PR?
arrow::RecordBatch::MakeStatisticsArray()
is a convenient function that convertsarrow::ArrayStatistics
in aarrow::RecordBatch
toarrow::Array
for the Arrow C data interface.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.
arrow::ArrayStatistics
toarrow::Array
for the Arrow C data interface #44010