GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded #44470

EnricoMi · 2024-10-18T11:07:54Z

Rationale for this change

The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed,

What changes are included in this PR?

Preserving the dataset order of rows requires the SourceNode to use ImplicitOrdering (this gives exec batches an index), and the ConsumingSinkNode to sequence exec batches (preserve order of batches by their index).

User-facing changes:

Add option preserve_order to FileSystemDatasetWriteOptions

Dev-facing changes:

Add option ordering to SourceNodeOptions
Add option implicit_ordering to ScanNodeOptions

Default behaviour is current behaviour.

Are these changes tested?

Unit tests have been added,

Are there any user-facing changes?

Users can set FileSystemDatasetWriteOptions.preserve_order = true (C++) / arrow.dataset.write_dataset(..., preserve_order=True) (Python).

GitHub Issue: [C++][Dataset] Preserve order when writing dataset #26818

…serve order

github-actions · 2024-10-18T11:08:27Z

⚠️ GitHub issue #26818 has been automatically assigned in GitHub to PR creator.

EnricoMi added 2 commits October 16, 2024 08:35

wip

55375a1

Make scan ordering implicit and consumer sink sequence batches to pre…

928c2aa

…serve order

EnricoMi requested a review from westonpace as a code owner October 18, 2024 11:07

github-actions bot added Component: C++ Component: Python awaiting review Awaiting review labels Oct 18, 2024

EnricoMi mentioned this pull request Oct 18, 2024

[C++][Dataset] Preserve order when writing dataset #26818

Open

EnricoMi added 2 commits October 18, 2024 13:14

Fix submodule HEAD

22faea4

Fix linting, improved comments

df98801

EnricoMi force-pushed the preserve-order-2 branch from 36243c3 to 9ca8c76 Compare October 18, 2024 16:58

Make TestFileSystemDataset test produce out of order data

28cb588

EnricoMi force-pushed the preserve-order-2 branch from 9ca8c76 to 28cb588 Compare October 18, 2024 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded #44470

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded #44470

EnricoMi commented Oct 18, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Oct 18, 2024

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded #44470

Are you sure you want to change the base?

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded #44470

Conversation

EnricoMi commented Oct 18, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Oct 18, 2024

EnricoMi commented Oct 18, 2024 •

edited by github-actions bot

Loading