Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36939: [C++][Parquet] Direct put of BooleanArray is incorrect when called several times #36972

Merged
merged 13 commits into from
Aug 10, 2023
6 changes: 5 additions & 1 deletion cpp/src/parquet/encoding.cc
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,9 @@ class PlainEncoder<BooleanType> : public EncoderImpl, virtual public BooleanEnco
throw ParquetException("direct put to boolean from " + values.type()->ToString() +
" not supported");
}
// Cannot Put arrow array when PlainEncoder<BooleanType>::PutImpl has remaining writes
// in `bit_writer_`.
DCHECK_EQ(0, bit_writer_.bytes_written());
mapleFU marked this conversation as resolved.
Show resolved Hide resolved

const auto& data = checked_cast<const ::arrow::BooleanArray&>(values);
if (data.null_count() == 0) {
Expand All @@ -354,6 +357,7 @@ class PlainEncoder<BooleanType> : public EncoderImpl, virtual public BooleanEnco
sink_.length(), n_valid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line also seems problematic if this method is called multiple times with in a row with boolean arrays (not sure actual code does this though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#36972 (comment)
You're right. This would not produce bug if PutArrow is not mixed with PutImpl, but will make Boolean leaves a larger space than expected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned this below, using sink_.length() to store the actual values seems wrong.


for (int64_t i = 0; i < data.length(); i++) {
// Only valid boolean data will call `writer.Next`.
if (data.IsValid(i)) {
if (data.Value(i)) {
writer.Set();
Expand All @@ -365,7 +369,7 @@ class PlainEncoder<BooleanType> : public EncoderImpl, virtual public BooleanEnco
}
writer.Finish();
}
sink_.UnsafeAdvance(data.length());
sink_.UnsafeAdvance(data.length() - data.null_count());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be n_valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield Parquet encoding stores "valid" value. The invalid value will be marked in rep-levels and def-levels

Copy link
Contributor

@emkornfield emkornfield Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the entire value passed into sink_.UnsafeAdvance

This seems like it was incorrect even for all present values because we are advancing the Byte buffer by number of values (in this case these would be number of bits) and not number bytes. So we would be overadvancing in both cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e. we seem to be advancing by more bytes then are being reserved in both cases)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the logic here is confusing, but it's right after my change. When input k boolean values.

  1. sink_.Reserve will only reserve bytes for bits (k).
  2. sink_.UnsafeAdvance will advance k bytes.

However, when used, sink_.length() will only be regarded as bits. So (2) has a bug, but it works here...

I'd like to fix the bug first, and take time to optimize the code later.

Copy link
Contributor

@emkornfield emkornfield Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we need to be reserving k bytes and not k/8. I think this wasn't caught sooner because columnwriter appears to have a specialization for Boolean values that bypasses this method (e.g. I don't think anything but encoder tests will fail if you put throw exception(....) in this method in general.

}

private:
Expand Down
34 changes: 26 additions & 8 deletions cpp/src/parquet/encoding_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
#include "arrow/array.h"
#include "arrow/array/builder_binary.h"
#include "arrow/array/builder_dict.h"
#include "arrow/array/concatenate.h"
#include "arrow/compute/cast.h"
#include "arrow/testing/gtest_util.h"
#include "arrow/testing/random.h"
Expand Down Expand Up @@ -580,7 +581,7 @@ TEST(PlainEncodingAdHoc, ArrowBinaryDirectPut) {
decoder->SetData(num_values, buf->data(), static_cast<int>(buf->size()));

typename EncodingTraits<ByteArrayType>::Accumulator acc;
acc.builder.reset(new ::arrow::StringBuilder);
acc.builder = std::make_unique<::arrow::StringBuilder>();
ASSERT_EQ(num_values,
decoder->DecodeArrow(static_cast<int>(values->length()),
static_cast<int>(values->null_count()),
Expand Down Expand Up @@ -641,27 +642,38 @@ class EncodingAdHocTyped : public ::testing::Test {

static std::shared_ptr<::arrow::DataType> arrow_type();

void Plain(int seed) {
auto values = GetValues(seed);
void Plain(int seed, int round = 1) {
pitrou marked this conversation as resolved.
Show resolved Hide resolved
auto random_array = GetValues(seed);
auto encoder = MakeTypedEncoder<ParquetType>(
Encoding::PLAIN, /*use_dictionary=*/false, column_descr());
auto decoder = MakeTypedDecoder<ParquetType>(Encoding::PLAIN, column_descr());

ASSERT_NO_THROW(encoder->Put(*values));
for (int i = 0; i < round; ++i) {
ASSERT_NO_THROW(encoder->Put(*random_array));
}
std::shared_ptr<::arrow::Array> values;
if (round == 1) {
values = random_array;
} else {
::arrow::ArrayVector arrays;
arrays.resize(round, random_array);
pitrou marked this conversation as resolved.
Show resolved Hide resolved
EXPECT_OK_AND_ASSIGN(values,
::arrow::Concatenate(arrays, ::arrow::default_memory_pool()));
}
auto buf = encoder->FlushValues();

int num_values = static_cast<int>(values->length() - values->null_count());
decoder->SetData(num_values, buf->data(), static_cast<int>(buf->size()));
decoder->SetData(static_cast<int>(values->length()), buf->data(),
static_cast<int>(buf->size()));

BuilderType acc(arrow_type(), ::arrow::default_memory_pool());
ASSERT_EQ(num_values,
ASSERT_EQ(static_cast<int>(values->length() - values->null_count()),
decoder->DecodeArrow(static_cast<int>(values->length()),
static_cast<int>(values->null_count()),
values->null_bitmap_data(), values->offset(), &acc));

std::shared_ptr<::arrow::Array> result;
ASSERT_OK(acc.Finish(&result));
ASSERT_EQ(50, result->length());
ASSERT_EQ(values->length(), result->length());
::arrow::AssertArraysEqual(*values, *result, /*verbose=*/true);
}

Expand Down Expand Up @@ -882,6 +894,12 @@ TYPED_TEST(EncodingAdHocTyped, PlainArrowDirectPut) {
}
}

TYPED_TEST(EncodingAdHocTyped, PlainArrowDirectPutMultiRound) {
for (auto seed : {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}) {
pitrou marked this conversation as resolved.
Show resolved Hide resolved
mapleFU marked this conversation as resolved.
Show resolved Hide resolved
this->Plain(seed, /*round=*/5);
}
}

TYPED_TEST(EncodingAdHocTyped, ByteStreamSplitArrowDirectPut) {
for (auto seed : {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}) {
this->ByteStreamSplit(seed);
Expand Down
Loading