Skip to content

Commit

Permalink
GH-17211: refresh history for new compute fn infra
Browse files Browse the repository at this point in the history
This commit includes changes to register a new compute function without
the burden of a long development history.

The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in
compilation as it is used by the new Hash64 function defined in
api_scalar.[h,cc].

The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes
scalar_hash_test.cc in compilation for tests and it also adds a new
benchmark binary that is implemented by scalar_hash_benchmark.cc.

The registry files are updated to register the kernel implementations in
scalar_hash.cc with the function definitions in api_scalar.[h,cc].

Finally, docs/source/cpp/compute.rst adds documentation for the Hash64
function.

Issue: GH-17211
Issue: ARROW-8991
  • Loading branch information
drin committed Jun 28, 2024
1 parent abf0832 commit 88d0a77
Show file tree
Hide file tree
Showing 7 changed files with 47 additions and 0 deletions.
1 change: 1 addition & 0 deletions cpp/src/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -764,6 +764,7 @@ if(ARROW_COMPUTE)
compute/kernels/scalar_arithmetic.cc
compute/kernels/scalar_boolean.cc
compute/kernels/scalar_compare.cc
compute/kernels/scalar_hash.cc
compute/kernels/scalar_if_else.cc
compute/kernels/scalar_nested.cc
compute/kernels/scalar_random.cc
Expand Down
6 changes: 6 additions & 0 deletions cpp/src/arrow/compute/api_scalar.cc
Original file line number Diff line number Diff line change
Expand Up @@ -927,6 +927,12 @@ Result<Datum> MapLookup(const Datum& arg, MapLookupOptions options, ExecContext*
return CallFunction("map_lookup", {arg}, &options, ctx);
}

// ----------------------------------------------------------------------
// Hash functions
Result<Datum> Hash64(const Datum& input_array, ExecContext* ctx) {
return CallFunction("hash_64", {input_array}, ctx);
}

// ----------------------------------------------------------------------

} // namespace compute
Expand Down
16 changes: 16 additions & 0 deletions cpp/src/arrow/compute/api_scalar.h
Original file line number Diff line number Diff line change
Expand Up @@ -1718,5 +1718,21 @@ ARROW_EXPORT Result<Datum> NanosecondsBetween(const Datum& left, const Datum& ri
/// \note API not yet finalized
ARROW_EXPORT Result<Datum> MapLookup(const Datum& map, MapLookupOptions options,
ExecContext* ctx = NULLPTR);

/// \brief Construct a hash value for each row of the input.
///
/// The result is an Array of length equal to the length of the input; however, the output
/// shall be a UInt64Array, with each element being a hash constructed from each row of
/// the input. If the input Array is a NestedArray, this means that each "attribute" or
/// "field" of the input NestedArray corresponding to the same "row" will collectively
/// produce a single uint64_t hash. At the moment, this function does not take options,
/// though these may be added in the future.
///
/// \param[in] input_array input data to hash
/// \param[in] ctx function execution context, optional
/// \return elementwise hash values
ARROW_EXPORT
Result<Datum> Hash64(const Datum& input_array, ExecContext* ctx = NULLPTR);

} // namespace compute
} // namespace arrow
2 changes: 2 additions & 0 deletions cpp/src/arrow/compute/kernels/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ add_arrow_compute_test(scalar_utility_test
scalar_random_test.cc
scalar_set_lookup_test.cc
scalar_validity_test.cc
scalar_hash_test.cc
EXTRA_LINK_LIBS
arrow_compute_kernels_testing)

Expand All @@ -87,6 +88,7 @@ add_arrow_benchmark(scalar_round_benchmark PREFIX "arrow-compute")
add_arrow_benchmark(scalar_set_lookup_benchmark PREFIX "arrow-compute")
add_arrow_benchmark(scalar_string_benchmark PREFIX "arrow-compute")
add_arrow_benchmark(scalar_temporal_benchmark PREFIX "arrow-compute")
add_arrow_benchmark(scalar_hash_benchmark PREFIX "arrow-compute")

# ----------------------------------------------------------------------
# Vector kernels
Expand Down
1 change: 1 addition & 0 deletions cpp/src/arrow/compute/registry.cc
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,7 @@ static std::unique_ptr<FunctionRegistry> CreateBuiltInRegistry() {
RegisterScalarArithmetic(registry.get());
RegisterScalarBoolean(registry.get());
RegisterScalarComparison(registry.get());
RegisterScalarHash(registry.get());
RegisterScalarIfElse(registry.get());
RegisterScalarNested(registry.get());
RegisterScalarRandom(registry.get()); // Nullary
Expand Down
1 change: 1 addition & 0 deletions cpp/src/arrow/compute/registry_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ void RegisterScalarBoolean(FunctionRegistry* registry);
void RegisterScalarCast(FunctionRegistry* registry);
void RegisterDictionaryDecode(FunctionRegistry* registry);
void RegisterScalarComparison(FunctionRegistry* registry);
void RegisterScalarHash(FunctionRegistry* registry);
void RegisterScalarIfElse(FunctionRegistry* registry);
void RegisterScalarNested(FunctionRegistry* registry);
void RegisterScalarRandom(FunctionRegistry* registry); // Nullary
Expand Down
20 changes: 20 additions & 0 deletions docs/source/cpp/compute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1200,6 +1200,26 @@ Containment tests
* \(8) Output is true iff :member:`MatchSubstringOptions::pattern`
matches the corresponding input element at any position.


Hash Functions
~~~~~~~~~~~~~~

Not to be confused with the "group by" functions, Hash functions produce an array of hash
values corresponding to the length of the input. Currently, these functions take a single
array as input.

+-----------------------+-------+-----------------------------------+-------------+---------------+-------+
| Function name | Arity | Input types | Output type | Options class | Notes |
+=======================+=======+===================================+=============+===============+=======+
| hash_64 | Unary | Any | UInt64 | | \(1) |
+-----------------------+-------+-----------------------------------+-------------+---------------+-------+

* \(1) The hashing algorithm is "xxHash-like", making some minor trade-offs in favor of
performance. Arrays containing nested types are recursively walked and flattened; such
that each field or attribute (corresponding to the same row) are hashed and combined
into a single hash value.


Categorizations
~~~~~~~~~~~~~~~

Expand Down

0 comments on commit 88d0a77

Please sign in to comment.