Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Request to move ChunkResolver to public API #34535

Closed
anjakefala opened this issue Mar 11, 2023 · 21 comments
Closed

[C++] Request to move ChunkResolver to public API #34535

anjakefala opened this issue Mar 11, 2023 · 21 comments

Comments

@anjakefala
Copy link
Collaborator

Describe the enhancement requested

ChunkResolver.Resolve is used by ChunkedArray.getStatic() in order to identify which chunk, and which index into that chunk, correlate with a given index into the whole array. getStatic then does additional work to wrap the value into a Result<std::shared_ptr<Scalar>>. The creation and return of a shared_ptr does result in some performance overhead, that makes a difference for a performance-sensitive application.

If someone could use ChunkResolver to learn the indices, they could then instead access the data directly. This would provide a more efficient way of indexing into a ChunkedArray, should an application benefit from it.

Would it be possible to move ChunkResolver to public API?

Component(s)

C++

@westonpace
Copy link
Member

If it's going to be a public API then I think it would be good to have some tests for it. Otherwise, this seems like a reasonable utility to make available.

@kou kou changed the title Request to move ChunkResolver to public API [C++] Request to move ChunkResolver to public API Mar 15, 2023
@SChakravorti21
Copy link
Contributor

SChakravorti21 commented Feb 5, 2024

Hey @anjakefala and @westonpace! Is this something that would still be reasonable to work on? I'd like to work on this if so, and my plan would be:

  • Check unit test coverage and expand as necessary to handle edge cases
  • Move ChunkResolver out of the internal namespace
    • Refactor current usages of the class in Arrow C++ as necessary
  • Update the documentation
    • Update the API/reference docs as necessary
    • Add a section, perhaps in the User Guide or Examples, on how to use this class

@anjakefala
Copy link
Collaborator Author

Oh thanks for letting me know @SChakravorti21!

It absolutely would be. @felipecrv has recently been putting love into the ChunkResolver documentation, so that step might actually be shorter: #39815.

Please give us a poke here when you have a PR open. =)

@felipecrv
Copy link
Contributor

@SChakravorti21 I had an idea since my changes to the class interface that you could add maybe?

I added

  inline ChunkLocation ResolveWithChunkIndexHint(int64_t index,
                                                 int64_t cached_chunk_index) const {

I was planning to add a ResolveWithHint(int64_t index, ChunkLocation hint) -- simpler name and less error-prone. Its definition is:

  inline ChunkLocation ResolveWithHint(int64_t index, ChunkLocation hint) const {
    assert(hint.chunk_index < static_cast<int64_t>(offsets_.size()));
    const auto chunk_index =
        ResolveChunkIndex</*StoreCachedChunk=*/false>(index, hint.chunk_index);
    return {chunk_index, index - offsets_[chunk_index]};
  }

@SChakravorti21
Copy link
Contributor

@felipecrv Sounds like a reasonable change, I agree that it looks easier to use!

Will let you folks know when I've opened a PR 👍🏼

@SChakravorti21
Copy link
Contributor

Hey folks, just wanted to give a heads up that I've created a draft PR at #40226 for the work on this so far.

I have a question: I noticed that there is also a ChunkedArrayResolver class. This seems to build on top of ChunkResolver and provides the added benefit of returning the actual chunk Array instead of just its location. I feel like this is convenient for many use-cases and was wondering if we should try to fold ChunkedArrayResolver's behavior into ChunkResolver somehow. Do you have any opinions or concerns about this?

@felipecrv
Copy link
Contributor

felipecrv commented Feb 26, 2024

...if we should try to fold ChunkedArrayResolver's behavior into ChunkResolver somehow. Do you have any opinions or concerns about this?

Hmm, it's better to keep ChunkResolver like this for now because it's meant as a low-level utility that you use in kernels. For instance, when sorting a chunked array, we don't want to keep copying that same set of shared_ptrs over and over.

That and the fact that you would have to type the functions based on all the types that are used to represent an array of array chunks.

@SChakravorti21
Copy link
Contributor

For instance, when sorting a chunked array, we don't want to keep copying that same set of shared_ptrs over and over.

Sorry if I wasn't clear - the ChunkedArrayResolver returns raw pointers to Arrays, not shared pointers, so it avoids the atomic reference counting overhead. Do you feel like it would be acceptable to do it this way in ChunkResolver? I was thinking we could do something like this (pseudo-code):

/// BEGIN: Copied from `chunked_internal.h`

template <typename ArrayType>
struct ResolvedChunk {
  using ViewType = GetViewType<typename ArrayType::TypeClass>;
  using LogicalValueType = typename ViewType::T;

  const ArrayType* array;
  const int64_t index;

  ResolvedChunk(const ArrayType* array, int64_t index) : array(array), index(index) {}

  bool IsNull() const { return array->IsNull(index); }

  LogicalValueType Value() const { return ViewType::LogicalValue(array->GetView(index)); }
};

template <>
struct ResolvedChunk<Array> {
  const Array* array;
  const int64_t index;

  ResolvedChunk(const Array* array, int64_t index) : array(array), index(index) {}

  bool IsNull() const { return array->IsNull(index); }
};

/// END: Copied from `chunked_internal.h`

struct ChunkLocation {
  int64_t chunk_index;
  int64_t index_in_chunk;
};

template <typename ArrayType = Array>
class ChunkResolver {
 public:
  inline ResolvedChunk<ArrayType> Resolve(int64_t index) const;
  inline ChunkLocation            ResolveLocation(int64_t index) const;
  inline ChunkLocation            ResolveLocationWithHint(ChunkLocation hint) const;

 private:
  std::vector<int64_t> offsets_;
  mutable std::atomic<int64_t> cached_chunk_;

  // This contains `shared_ptr`s to simplify usage for some applications (to avoid
  // having to hold onto the original Arrays), but we would only return raw pointers
  // to avoid the reference counting overhead of copying `shared_ptr`s.
  std::vector<std::shared_ptr<Array>> chunks_;
};

I agree that maintaining the performance of ChunkResolver is absolutely critical, and I think this solution avoids adding any unnecessary overhead for people who choose to use ResolveLocation and ResolveLocationWithHint.

That and the fact that you would have to type the functions based on all the types that are used to represent an array of array chunks.

I'm sorry, I'm not quite sure I understood this part. Could you clarify?

@felipecrv
Copy link
Contributor

felipecrv commented Feb 26, 2024

That and the fact that you would have to type the functions based on all the types that are used to represent an array of >> array chunks.

I'm sorry, I'm not quite sure I understood this part. Could you clarify?

The resolution depends on what chunked thing you're targetting:

  explicit ChunkResolver(const ArrayVector& chunks);
  explicit ChunkResolver(const std::vector<const Array*>& chunks);
  explicit ChunkResolver(const RecordBatchVector& batches);

This is the problem (code from your proposal):

template <typename ArrayType = Array>
class ChunkResolver {

Now it becomes impossible to have a non-inline function [1] in the ChunkResolver because the entire class is a template. That's why these abstractions are separate. Care was taken to allow ChunkResolver to work for any type by making it deal with chunks in terms of lengths and offsets.

[1] I have a plan to add batched/vectorized versions of ChunkResolver::Resolve with most of the code living in the chunk_resolver.cc

@SChakravorti21
Copy link
Contributor

SChakravorti21 commented Feb 26, 2024

Edit: on second read, I think your concern might be more targeted towards compile times rather than performance, is that right? In that case, we can have the Resolve method be templated rather than the entire class:

class ChunkResolver {
 public:
  template <typename ArrayType = Array>
  inline ResolvedChunk<ArrayType> Resolve(int64_t index) const;
  inline ChunkLocation            ResolveLocation(int64_t index) const;
  inline ChunkLocation            ResolveLocationWithHint(ChunkLocation hint) const;

 private:
  std::vector<int64_t> offsets_;
  mutable std::atomic<int64_t> cached_chunk_;
  std::vector<std::shared_ptr<Array>> chunks_;
};

This way we can add other methods in the future with definitions in chunk_resolver.cc.

@felipecrv
Copy link
Contributor

towards compile times

Not primarily, binary size is the primary concern. As it is now, we don't have to inline the ChunkResolver constructors, for instance.

In that case, we can have the Resolve method be templated rather than the entire class

You can have crashes if the user passes the wrong type parameter to the method call and in your solution, you're adding std::vector<std::shared_ptr<Array>> chunks_; to the list of members.

There is a way here based on composition that you will probably like. We can bring most of chunked_internal.h to here (or a different header that includes chunk_resolver.h) but improve on the design as it will now become public.

I recommend you to do that on a separate PR as chunked_internal.h is much more dependent on other internal stuff.

template <typename ArrayType = std::shared_ptr<Array>>
class ArrayChunkResolver {
 public:
  using ArrayPtr = /* template thingie that gives you the raw-est pointer type from `ArrayType` */;

 private:
  ChunkResolver _resolver; // owns a vector of chunk offsets and assumes chunk sizes are immutable
  std::vector<ArrayType> &chunks_;  // the resolver doesn't own the vector of chunks

 public:
  // ctor, move-ctor, and move-assign.
  
  inline ArrayPtr Resolve(int64_t index) const;
  inline ChunkLocation            ResolveLocation(int64_t index) const; // delegate to _resolver
  inline ChunkLocation            ResolveLocationWithHint(ChunkLocation hint) const; // delegate to _resolver
};

@felipecrv
Copy link
Contributor

@SChakravorti21 scratch that. I had a better idea (even simpler) and will be pushing a PR soon. I have the code ready on my machine.

@felipecrv
Copy link
Contributor

First step: #40281

In a following step, I might add a template param to class ChunkedArrayResolver, but it would be to accept Array, ArrayData, RecordBatch... and not value-type specific arrays like Int32Array, BoolArray, StringArray... these were the types that were being passed to ChunkedArrayResolver::Resolve and shouldn't really be. That's what the PR linked above changes.

@SChakravorti21
Copy link
Contributor

Hey @felipecrv! Sorry I have been unresponsive the last few days, was busy with work and couldn't find much free time outside of work.

Everything you have said makes sense to me. I took a look at #40281 and the changes there look great to me.

We can bring most of chunked_internal.h to here (or a different header that includes chunk_resolver.h) but improve on the design as it will now become public.

So just to clarify, is the plan to move both ChunkResolver and ChunkedArrayResolver into the public namespace? I think this would make sense - people can use ChunkResolver if they want low-level access to chunk locations, or they can use ChunkedArrayResolver if they want easy random access to values by logical index.

The ChunkedArrayResolver changes look really nice for applications that need to apply some row-wise business logic to ChunkedArrays (which is the use-case I was targeting):

std::shared_ptr<arrow::Table> table;

arrow::ChunkedArrayResolver resolver_a(table.GetColumnByName("a"));
arrow::ChunkedArrayResolver resolver_b(table.GetColumnByName("b"));

for (int i = 0; i < table->num_rows(); ++i)
{
  std::int64_t a = resolver_a.Resolve(i).Value<arrow::Int64Type>();
  std::string_view b = resolver_b.Resolve(i).Value<arrow::StringType>();
  do_business_logic(a, b);
}

@felipecrv
Copy link
Contributor

felipecrv commented Mar 1, 2024

So just to clarify, is the plan to move both ChunkResolver and ChunkedArrayResolver into the public namespace?

Not my plan, but after you mentioned you wanted to unify ChunkResolver with ChunkedArrayResolver I took a look on how it's designed.

If it were to go public, that would be done in another issue/PR pair, so for now it's better to focus just on making ChunkResolver public.

About the code:

for (int i = 0; i < table->num_rows(); ++i)
{
  std::int64_t a = resolver_a.Resolve(i).Value<arrow::Int64Type>();
  std::string_view b = resolver_b.Resolve(i).Value<arrow::StringType>();
  do_business_logic(a, b);
}

It's important to delay the re-construction of row-by-row values as much as possible to preserve the benefits of columnar layouts. So there is a danger [1] in exposing too many APIs that work value-by-value instead or array-by-array. Array-by-array is not compatible with how most people think about programming.

The loop above is performing sequential access, so using the chunk resolver is not the best solution regarding locality of the memory accesses [2]. It's better to keep one ChunkLocation per column and increment them all at once.

[1] Apache Arrow, as a columnar data library, is built to keep most computation on top of columnar representation, you're of course allowed to do whatever you need to solver your application problem, but Arrow itself exposing row-by-row APIs would pass the wrong message
[2] ChunkResolver and ChunkedArrayResolver are used internally in Arrow mainly on functions that perform random access: Take, Sort, Rank...

@SChakravorti21
Copy link
Contributor

If it were to go public, that would be done in another issue/PR pair, so for now it's better to focus just on making ChunkResolver public.

Gotcha. Yes, I agree, we can focus on just ChunkResolver for now!

[1] Apache Arrow, as a columnar data library, is built to keep most computation on top of columnar representation, you're of course allowed to do whatever you need to solver your application problem, but Arrow itself exposing row-by-row APIs would pass the wrong message.

I agree 100% that we don't want to send the wrong message, and that people should always try to frame their logic in terms of vectorized operations. That said, there are practical use-cases where there is no way of getting around row-major processing of the data.

I've been thinking about this and came up with an alternative way that may be better, and would be interested to hear your thoughts on it (pseudocode):

for (auto maybe_batch : arrow::TableBatchReader(*table))
{
  std::shared_ptr<arrow::RecordBatch> batch = maybe_batch.ValueOrDie();

  // User decides whether they want a safe or unsafe cast
  std::shared_ptr<arrow::Int64Array> a = std::dynamic_pointer_cast<arrow::Int64Array>(batch->GetColumnByName("a"));
  std::shared_ptr<arrow::StringArray> b = std::dynamic_pointer_cast<arrow::StringArray>(batch->GetColumnByName("b"));

  for (int i = 0; i < batch->num_rows(); ++i) {
    do_business_logic(a.Value(i), b.Value(i));
  }
}

This is still sequential but (I think) avoids a lot of unnecessary overhead. In that case, it might be good enough to make ChunkResolver/ChunkedArrayResolver public only for the sake of use-cases that require random access into ChunkedArrays. We can also document that these classes should never be used for pure sequential access, and something like above should be preferred instead. How does that sound?

@felipecrv
Copy link
Contributor

felipecrv commented Mar 4, 2024

it might be good enough to make ChunkResolver/ChunkedArrayResolver public

There is a case for making ChunkResolver public already. Please create a separate issue to talk about ChunkedArrayResolver.

Without knowing what do_business_logic is, I can't say it's impossible to process column by column (and I don't mean with vector instructions, just memory locality effects alone being beneficial). To use columnar representation well, the business logic has to be aware of the columnar representation.

But my main problem with your code (and let me be more direct this time with what I mean by "random access") is that you're using Resolve/Value (which is O(log(num_chunks))) on every iteration when you could be incrementing each ChunkLocation in O(1) without having to rely on the caching in Resolve to make it "O(1) most of the time" + overhead.

To that, we can add these two to ChunkResolver:

  /// \pre loc.chunk_index >= 0
  /// \pre loc.index_in_chunk is assumed valid if chunk_index is not the last one
  inline bool Valid(ChunkLocation loc) const {
    const int64_t last_chunk_index = static_cast<int64_t>(offsets_.size()) - 1;
    return loc.chunk_index + 1 < last_chunk_index ||
           (loc.chunk_index + 1 == last_chunk_index &&
            loc.index_in_chunk < offsets_[last_chunk_index]);
  }

  /// \pre Valid(loc)
  inline ChunkLocation Next(ChunkLocation loc) const {
    const int64_t next_index_in_chunk = loc.index_in_chunk + 1;
    return (next_index_in_chunk < offsets_[loc.chunk_index + 1])
               ? ChunkLocation{loc.chunk_index, next_index_in_chunk}
               : ChunkLocation{loc.chunk_index + 1, 0};
  }

Then your loops can be:

ChunkResolver resolver(batches);
for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
  // re-use loc for all the typed columns since they are split on the same offsets
}

@SChakravorti21
Copy link
Contributor

There is a case for making ChunkResolver public already. Please create a separate issue to talk about ChunkedArrayResolver.

For sure, I can make a separate issue for ChunkedArrayResolver. My bad for mixing these two discussions.

But my main problem with your code (and let me be more direct this time with what I mean by "random access") is that you're using Resolve/Value (which is O(log(num_chunks))) on every iteration when you could be incrementing each ChunkLocation in O(1) without having to rely on the caching in Resolve to make it "O(1) most of the time" + overhead.

That makes sense, I didn't fully understand what you meant previously.

I think the API additions you're suggesting make sense, but I'm confused how someone would use them to iterate over multiple columns simultaneously. Is there such a thing as a "typed ChunkedArray"? Otherwise how would we expect someone to access the values inside this loop:

ChunkResolver resolver(batches);

for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
  // what is the most efficient way to access the values for each column here?
}

The benefit of just iterating over the batches themselves is that we only perform the cast from untyped Array to typed array (Int64Array, StringArray, etc.) once per column per batch. This is cheaper if someone prefers to use dynamic_cast for safety vs. doing a cast once per datapoint.

@felipecrv
Copy link
Contributor

for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
    // what is the most efficient way to access the values for each column here?
    
    // I'm not sure this is the most efficient, but it's certainly better than using ChunkedArrayResolver for every array.
    // don't forget the null checks (missing here)
    int64_t a =  checked_cast<Int64Array>(chunks[loc.chunk_index]).Value(loc.index_in_chunk);
    std::string_view b = checked_cast<StringArray>(chunks[loc.chunk_index]).Value(loc.index_in_chunk);
    process_row_oh_no(a, b);
}

I've been thinking about this and came up with an alternative way that may be better, and would be interested to hear your thoughts on it (pseudocode):

Your approach works even better and doesn't need the ChunkResolver. What I've been trying to say is that you should never need the binary-search that ChunkResolver (and ChunkedArrayResolver) provides if you're looping SEQUENTIALLY over chunked arrays.

We can also document that these classes should never be used for pure sequential access, and something like above should be preferred instead. How does that sound?

Well... documenting won't deter people from using it. See how much we had to discuss here so this nuance could be understood. The better plan is to implement things that need random access inside Arrow (sort, ranking, take, filter, joins...) and have users use that instead of reach to random access directly.

felipecrv added a commit that referenced this issue May 15, 2024
…ts (#41561)

### Rationale for this change

I want `ResolveMany` to support me in the implementation of `Take` that doesn't `Concatenate` all the chunks from a `ChunkedArray` `values` parameter.

### What changes are included in this PR?

 - Implementation of `ChunkResolver::ResolveMany()`
 - Addition of missing unit tests for `ChunkResolver`

### Are these changes tested?

Yes. By new unit tests.

### Are there any user-facing changes?

No. `ChunkResolver` is an internal API at the moment (see #34535 for future plans).
* GitHub Issue: #41560

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…it tests (apache#41561)

### Rationale for this change

I want `ResolveMany` to support me in the implementation of `Take` that doesn't `Concatenate` all the chunks from a `ChunkedArray` `values` parameter.

### What changes are included in this PR?

 - Implementation of `ChunkResolver::ResolveMany()`
 - Addition of missing unit tests for `ChunkResolver`

### Are these changes tested?

Yes. By new unit tests.

### Are there any user-facing changes?

No. `ChunkResolver` is an internal API at the moment (see apache#34535 for future plans).
* GitHub Issue: apache#41560

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
anjakefala added a commit to anjakefala/arrow that referenced this issue Oct 9, 2024
anjakefala added a commit to anjakefala/arrow that referenced this issue Oct 9, 2024
felipecrv pushed a commit that referenced this issue Oct 21, 2024
### Rationale for this change

Adopting #40226. 

 The creation and return of a shared_ptr does result in some performance overhead, that makes a difference for a performance-sensitive application.

If someone could use ChunkResolver to learn the indices, they could then instead access the data directly. 

### What changes are included in this PR?

- [X] Updates to documentation (thanks to @ SChakravorti21 )
- [X] Moving `ChunkResolver` to public API, and updating all references to it in the code

### Are these changes tested?

There seemed to be comprehensive tests already: https://github.com/apache/arrow/blob/main/cpp/src/arrow/chunked_array_test.cc#L324 If an edgecase is missing, I'd be happy to add it.

### Are there any user-facing changes?

`ChunkResolver` and `TypedChunkLocation` are now in the public API.
* GitHub Issue: #34535

Lead-authored-by: Anja Kefala <[email protected]>
Co-authored-by: anjakefala <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: SChakravorti21 <[email protected]>
Co-authored-by: SChakravorti21<[email protected]>
Co-authored-by: Jacob Wujciak-Jens <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
@felipecrv felipecrv added this to the 19.0.0 milestone Oct 21, 2024
@felipecrv
Copy link
Contributor

Issue resolved by pull request 44357
#44357

@anjakefala anjakefala modified the milestones: 19.0.0, 18.0.0 Oct 22, 2024
@raulcd raulcd modified the milestones: 18.0.0, 19.0.0 Oct 22, 2024
@raulcd
Copy link
Member

raulcd commented Oct 22, 2024

Sorry, moving this to 19.0.0 because we might not do more RCs for 18.0.0. I've added the backport-candidate label in case we do one new RC but doesn't seem we will.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants