[Docs][HTTP] Clarify when to use batch-at-a-time vs. one-shot approach for receiving data #40613

ianmcook · 2024-03-17T14:58:42Z

Describe the usage question you have. Please include as many useful details as possible.

Among the simple HTTP GET client examples in arrow-experiments/http/get_simple:

Some iterate over the record batches as they stream in from the server (i.e. "streaming" approach).
Some just make a single function call that collects the full data (i.e. "one-shot" approach).

For example:

The Python client example shows how to iterate over the batches calling reader.read_next_batch(), whereas it could have just called reader.read_all() which would be simpler.
The Ruby client example goes for the simpler all-at-once approach, whereas it could have used a batch-at-a-time approach like in this example.

For many use cases, it makes no difference which approach is used, and we should just prioritize whatever is syntactically simplest.

But for some use cases, the batch-at-a-time approach will be preferred or needed for specific reasons, such as:

The receiver wants to start processing batches before the final batch is received.
The receiver wants to stream the received data to a sink without accumulating it in memory.

We should clarify this in the Arrow-over-HTTP conventions doc, and wherever possible we should provide examples showing both approaches.

Component(s)

Documentation

The text was updated successfully, but these errors were encountered:

felipecrv · 2024-09-12T23:58:44Z

It depends on what the client is going to do with the data. I think showing that data can be loaded batch by batch is more interesting. It's a streaming protocol after all.

A client performing some kind of aggregation can go through the batches updating its state and finalizing when all batches are read (e.g. calculating some weighted average on the stream).

People naturally tend to write code that buffers everything, so having the examples showing that batch-by-batch is possible only helps.

ianmcook added the Type: usage Issue is a user question label Mar 17, 2024

github-actions bot added the Component: Documentation label Mar 17, 2024

ianmcook mentioned this issue Mar 17, 2024

[Docs] Document conventions for sending and receiving Arrow data over HTTP APIs #40465

Open

33 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs][HTTP] Clarify when to use batch-at-a-time vs. one-shot approach for receiving data #40613

[Docs][HTTP] Clarify when to use batch-at-a-time vs. one-shot approach for receiving data #40613

ianmcook commented Mar 17, 2024 •

edited

Loading

felipecrv commented Sep 12, 2024

[Docs][HTTP] Clarify when to use batch-at-a-time vs. one-shot approach for receiving data #40613

[Docs][HTTP] Clarify when to use batch-at-a-time vs. one-shot approach for receiving data #40613

Comments

ianmcook commented Mar 17, 2024 • edited Loading

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

felipecrv commented Sep 12, 2024

ianmcook commented Mar 17, 2024 •

edited

Loading