PARQUET-2430: Add parquet joiner v2 #1335

MaxNevermind · 2024-04-28T22:36:42Z

This is a simplified version of original proposed functionality of a joiner, see description of original idea and simplified design below.

Original design

See related original PR: [WIP][Proposal] PARQUET-2430: Add parquet joiner

ParquetJoiner feature is similar to ParquetRewrite class. ParquetRewrite allows to stitch files with the same schema into a single file while ParquetJoiner should enable stitching files with different schemas into a single file. That is possible when: 1) the number of rows in the main and extra files is the same, 2) the ordering of rows in the main and extra files is the same. Main benefit of ParquetJoiner is performance, for the cases when you join/stitch Terabytes/Petabytes of data that seemingly simple low level API can be up to 10x more resource efficient.
Implementation details

ParquetJoiner allows to specify the main input parquet file and extra input parquet files. ParquetJoiner will copy the main input as binary data and write extra input files with row groups adjusted to the main input. If main input is much larger than extra inputs then a lot of resources will be saved by working with the main input as binary.
Use-case examples

A very large Parquet based dataset(dozens or hundreds of fields/Terabytes of data daily/Petabytes of historical partitions). The task is to modify a column or add a new column to it for all the historic data. It is trivial using Spark, but taking into consideration the share scale of a dataset it will take a lot of resources to do that.
Side notes

Note that this class of problems could be in theory solved by storing main input and extra inputs in HMS/Iceberg bucketed tables and use a view that joins those tables on the fly into the final version but in practice there is often a requirement to merge parquet files and have a single parquet sources in the file system.
Use-case implementation details using Apache Spark

You can use Apache Spark to perform the join with ParquetJoiner, read the large main input and prepare the right side of a join in a way that each file on the left have a corresponding file on the right and it preserves records ordering on the right side in the same order as on the left side, that allows the whole input on the left and right to have the same number of files and the same number of records in corresponding files and the same ordering of records in each file pair. Then run ParquetJoiner in parallel for each file pair and perform a join. Example of the code that utilizes this new feature: https://gist.github.com/MaxNevermind/0feaaf380520ca34c2637027ef349a7d.

A simplified design(this PR)

has only one list of inputFilesToJoin instead of List<List<>> as in original PR
inputFilesToJoin is expected to have the same rowGroups ordering as in inputFiles, number of files in inputFiles and inputFilesToJoin is not necessarily has be the same, but ordering of rowGroups and the rowCount of paired rowGroups must be the same
joinColumnsOverwrite is used if the inputFilesToJoin is expected to overwrite column in inputFiles
all the capabilities that available for inputFiles like pruning, nullification, binary copy, now should be available for inputFilesToJoin too

Post PR action points

~~Add a part on the ParqeutRewriter to parquet-java's README.md as noted by @wgtmac in this comment~~ Was addressed by adding ParquetRewriter description in JavaDoc
Add issue / PR for an existing bug: described PARQUET-2430: Add parquet joiner v2 #1335 (comment)

MaxNevermind · 2024-04-28T22:37:13Z

@wgtmac @ConeyLiu

This PR is the outcome of simplification I mention in a comment here a couple of weeks ago: #1273 (comment)
I’ve limited the set of capabilities, see this PR description.
I’ve tired different ideas and it all come out as having too complex of implementation, so I decided to finalize at least something with as simple implementation as possible.
PR is not yet polished. Just wanted to do a quick overview of the new approach. If it looks good, I will polish it.

MaxNevermind · 2024-08-20T05:57:56Z

@wgtmac

I started to work on the tests but I can't figure out the current approach to ParquetRewriter testing based on already existing tests. The whole list of features I see:

data validity after merging
single / multiple files merging
column nullification
column pruning
column encryption
codec preservation
bloom filter preservation
page index verification
metadata(CREATED_BY_KEY) preservation

I'm used to approach when features are unit tested independently sequentially. But looking at existing ParquetRewriter tests I can see that some of test tests for multiple things in the same test and I'm not able to figure out the system behind mixing features to tests into a single test.

So how I should approach it, should I just target covering all the features in one/two big tests or multiple tests while trying to cover all of those at least in one of those?

wgtmac · 2024-08-20T15:04:41Z

Good question. The ParquetRewriter is created by the refactoring work to consolidate ColumnEncryptor, ColumnMasker, ColumnPruner and CompressionConverter. You can see individual unit test in the ColumnEncryptorTest, ColumnMaskerTest, ColumnPrunerTest and CompressionConverterTest respectively. So the main goal of ParquetRewriterTest is to cover the combination of these features. I think we mainly need test cases covering the new join features, and prove that it does not break when other features are turned on.

…erTest

…Overwrite

# Conflicts: # parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

…yInputFilesToJoin

MaxNevermind · 2024-08-31T18:38:27Z

@wgtmac @ConeyLiu
Can you check out the tests changes? I created a single big test for the new functionality. The documentation is still in progress.

Some clarifications.
I had to rewrite some existing validating methods to accommodate joined columns. In couple of places I used anonymous nested functions and a local nested class to try to localize/nest method's tightly related logic into a single block. Let me know if that is too weird / too functional.

Some bugs.
I found what looks like a bug in a current version of ParquetRewriter. Probably will fill an issue.
When you try to nullify and encrypt different columns it fails. There is a related test but it nullifies and encrypts the same column which doesn't reproduce a bug. The bug can be reproduced by changing a single line maskColumns.put("DocId", MaskMode.NULLIFY); to maskColumns.put("Links.Forward", MaskMode.NULLIFY); in testNullifyAndEncryptColumn() method. The reason of a failure as I understand is that during the nullification we create a single column schema MessageType newSchema = newSchema(schema, descriptor) and later use our main writer's encryption with that schema but that encyptor is expecting our final target schema, not a single column schema.

wgtmac

Sorry for the delay. The current test cases are already very complicated so the refactoring work on the validation methods makes sense to me.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java

wgtmac · 2024-09-05T14:22:56Z

@wgtmac @ConeyLiu Can you check out the tests changes? I created a single big test for the new functionality. The documentation is still in progress.

It's reasonable on my side, as long as the new feature is covered.

Some bugs. I found what looks like a bug in a current version of ParquetRewriter. Probably will fill an issue. When you try to nullify and encrypt different columns it fails. There is a related test but it nullifies and encrypts the same column which doesn't reproduce a bug. The bug can be reproduced by changing a single line maskColumns.put("DocId", MaskMode.NULLIFY); to maskColumns.put("Links.Forward", MaskMode.NULLIFY); in testNullifyAndEncryptColumn() method. The reason of a failure as I understand is that during the nullification we create a single column schema MessageType newSchema = newSchema(schema, descriptor) and later use our main writer's encryption with that schema but that encyptor is expecting our final target schema, not a single column schema.

It would be great if you have time to create a PR to fix this. Thanks!

MaxNevermind · 2024-09-10T00:45:59Z

@wgtmac
I addressed small issues you found and added / tried to polish the documentation, check it out.

wgtmac · 2024-09-12T05:01:29Z

@wgtmac I addressed small issues you found and added / tried to polish the documentation, check it out.

LGTM. Thanks!

MaxNevermind · 2024-09-17T19:01:36Z

@wgtmac
Is there anything else I'm expected to do in this PR?

wgtmac · 2024-09-18T01:11:26Z

I'm not sure if @ConeyLiu wants to take another look.

BTW, could you fix the PR title and description? It is no longer a WIP.

ConeyLiu · 2024-09-18T08:38:25Z

+1, I have no further comments, thanks for the great work

wgtmac · 2024-09-19T15:30:27Z

Thanks @MaxNevermind and @ConeyLiu!

maxim_konstantinov added 29 commits January 28, 2024 14:22

add initial ParquetJoiner implementation

f5144b2

add initial ParquetJoiner implementation

01a08dd

Merge remote-tracking branch 'origin/master' into add-parquet-joiner

28c987c

refactor ParquetJoiner implementation

7ae3505

extend the main test for multiple files on the right

05eb22a

extend the main test for multiple files on the right

6bb950d

Merge branch 'master' into add-parquet-joiner

87b923c

converge join logic, crate a draft of options and rewriter

f9536c3

move ParquetJoinTest logic to ParquetRewriterTest

d7f11d9

improve Parquet stitching test

e8e7ffe

remove custom ParquetRewriter constructor

3ee946c

remove custom ParquetRewriter constructor

fd409c4

refactor ParquetRewriter

5a98219

apply spotless and address PR comments

7b2fd1a

move extra column writing into processBlocksFromReader

8da8291

add getInputFiles back

68e41ba

Merge remote-tracking branch 'fork/master' into add-parquet-joiner

98b9b23

fix extra ParquetRewriter constructor so tests can pass

6d2c222

remove not needed TODOs

883e935

address PR comments

8ef36b5

Merge remote-tracking branch 'origin/master' into add-parquet-joiner

79cc2b8

rename inputFilesR to inputFilesToJoin

0bbf72f

rename inputFilesR to inputFilesToJoinColumns

ca53bff

add getParquetInputFiles listing to the rewrite start logging

1e7998a

redesign file joiner in ParquetRewriter

2ee9b40

Merge remote-tracking branch 'origin/master' into add-parquet-joiner-v2

fc32dfd

redesign file joiner in ParquetRewriter

db52c85

redesign file joiner in ParquetRewriter

9057e91

redesign file joiner in ParquetRewriter

5b055c0

maxim_konstantinov added 7 commits August 25, 2024 22:11

Merge remote-tracking branch 'origin/master' into add-parquet-joiner-v2

57432ee

extend tests in ParquetRewriterTest for joiner part

f674bcf

add testMergeFilesToJoinWithDifferentRowCount test into ParquetRewrit…

8514f39

…erTest

Merge remote-tracking branch 'origin/master' into add-parquet-joiner-v2

0aaf963

add testOneInputFileManyInputFilesToJoin with and without JoinColumns…

4340c42

…Overwrite

Merge remote-tracking branch 'origin/master' into add-parquet-joiner-v2

e475648

# Conflicts: # parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

add encrypt validation into ParquetRewriterTest's testOneInputFileMan…

bb42979

…yInputFilesToJoin

wgtmac reviewed Sep 5, 2024

View reviewed changes

maxim_konstantinov added 3 commits September 8, 2024 14:33

refactor ParquetRewriter slightly to address PR comments

5b97a4c

add javadoc to ParquetRewriter

27ba73b

add javadoc to ParquetRewriter

07f1e74

MaxNevermind requested a review from wgtmac September 10, 2024 00:55

maxim_konstantinov added 3 commits September 12, 2024 19:42

fix javadoc in ParquetRewriter to comply with Maven javadoc plugin

e96c022

fix javadoc in ParquetRewriter to comply with Maven javadoc plugin

d1c1d76

fix javadoc in ParquetRewriter to comply with Maven javadoc plugin

9de20d7

MaxNevermind changed the title ~~[WIP][Proposal] PARQUET-2430: Add parquet joiner v2~~ [Proposal] PARQUET-2430: Add parquet joiner v2 Sep 19, 2024

MaxNevermind changed the title ~~[Proposal] PARQUET-2430: Add parquet joiner v2~~ PARQUET-2430: Add parquet joiner v2 Sep 19, 2024

wgtmac approved these changes Sep 19, 2024

View reviewed changes

wgtmac merged commit 08a4e7e into apache:master Sep 19, 2024
9 checks passed

wgtmac added this to the 1.15.0 milestone Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2430: Add parquet joiner v2 #1335

PARQUET-2430: Add parquet joiner v2 #1335

MaxNevermind commented Apr 28, 2024 •

edited

Loading

MaxNevermind commented Apr 28, 2024 •

edited

Loading

MaxNevermind commented Aug 20, 2024

wgtmac commented Aug 20, 2024

MaxNevermind commented Aug 31, 2024 •

edited

Loading

wgtmac left a comment

wgtmac commented Sep 5, 2024

MaxNevermind commented Sep 10, 2024

wgtmac commented Sep 12, 2024

MaxNevermind commented Sep 17, 2024

wgtmac commented Sep 18, 2024

ConeyLiu commented Sep 18, 2024

wgtmac commented Sep 19, 2024

PARQUET-2430: Add parquet joiner v2 #1335

PARQUET-2430: Add parquet joiner v2 #1335

Conversation

MaxNevermind commented Apr 28, 2024 • edited Loading

Original design

A simplified design(this PR)

Post PR action points

MaxNevermind commented Apr 28, 2024 • edited Loading

MaxNevermind commented Aug 20, 2024

wgtmac commented Aug 20, 2024

MaxNevermind commented Aug 31, 2024 • edited Loading

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac commented Sep 5, 2024

MaxNevermind commented Sep 10, 2024

wgtmac commented Sep 12, 2024

MaxNevermind commented Sep 17, 2024

wgtmac commented Sep 18, 2024

ConeyLiu commented Sep 18, 2024

wgtmac commented Sep 19, 2024

MaxNevermind commented Apr 28, 2024 •

edited

Loading

MaxNevermind commented Apr 28, 2024 •

edited

Loading

MaxNevermind commented Aug 31, 2024 •

edited

Loading