Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36939: [C++][Parquet] Direct put of BooleanArray is incorrect when called several times #36972
GH-36939: [C++][Parquet] Direct put of BooleanArray is incorrect when called several times #36972
Changes from 3 commits
c0cd7e0
961e255
1765718
375c094
45bbef1
809a718
1a45056
5c3a23e
5e7d841
ccdedb4
68f28e8
06dac34
8c23944
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line also seems problematic if this method is called multiple times with in a row with boolean arrays (not sure actual code does this though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#36972 (comment)
You're right. This would not produce bug if PutArrow is not mixed with PutImpl, but will make Boolean leaves a larger space than expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned this below, using sink_.length() to store the actual values seems wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be n_valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Parquet encoding stores "valid" value. The invalid value will be marked in rep-levels and def-levels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the entire value passed into sink_.UnsafeAdvance
This seems like it was incorrect even for all present values because we are advancing the Byte buffer by number of values (in this case these would be number of bits) and not number bytes. So we would be overadvancing in both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e. we seem to be advancing by more bytes then are being reserved in both cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the logic here is confusing, but it's right after my change. When input k boolean values.
sink_.Reserve
will only reserve bytes for bits (k).sink_.UnsafeAdvance
will advance k bytes.However, when used,
sink_.length()
will only be regarded as bits. So (2) has a bug, but it works here...I'd like to fix the bug first, and take time to optimize the code later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case we need to be reserving k bytes and not k/8. I think this wasn't caught sooner because columnwriter appears to have a specialization for Boolean values that bypasses this method (e.g. I don't think anything but encoder tests will fail if you put
throw exception(....)
in this method in general.