GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded #44470
+219
−14
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale for this change
The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed,
What changes are included in this PR?
Preserving the dataset order of rows requires the
SourceNode
to useImplicitOrdering
(this gives exec batches an index), and theConsumingSinkNode
to sequence exec batches (preserve order of batches by their index).User-facing changes:
preserve_order
toFileSystemDatasetWriteOptions
Dev-facing changes:
ordering
toSourceNodeOptions
implicit_ordering
toScanNodeOptions
Default behaviour is current behaviour.
Are these changes tested?
Unit tests have been added,
Are there any user-facing changes?
Users can set
FileSystemDatasetWriteOptions.preserve_order = true
(C++) /arrow.dataset.write_dataset(..., preserve_order=True)
(Python).