Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Proof-of-concept Parquet GEOMETRY logical type implementation #43196

Closed
wants to merge 29 commits into from

Conversation

paleolimbot
Copy link
Member

Rationale for this change

In apache/parquet-format#240 a GEOMETRY logical type for Parquet is proposed with a proof-of-concept Java implementation ( apache/parquet-java#1379 ). This is a PR to explore what an implementation would look like in C++.

What changes are included in this PR?

Work in progress!

Are these changes tested?

They will be!

Are there any user-facing changes?

Yes! (And will eventually be documented)

@paleolimbot paleolimbot force-pushed the parquet-geometry branch 3 times, most recently from aa5424d to b34b1c2 Compare August 12, 2024 19:06
cpp/src/parquet/types.h Show resolved Hide resolved
cpp/src/parquet/types.cc Outdated Show resolved Hide resolved
cpp/src/parquet/types.cc Show resolved Hide resolved
@@ -24,6 +24,7 @@
#include <string>
#include <utility>

#include "parquet/geometry_util.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use forward declaration? This is a public header.

}
};

class WKBBuffer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move them into arrow/util instead of parquet only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking twice, I think we can keep this in the parquet module. But it should be renamed to geometry_util_internal.h to avoid making it public.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 17, 2024
@jiayuasu
Copy link
Member

@paleolimbot Hey Dewey, I know you have been busy at other projects. We really want to speed up the process of the Parquet Geometry PR since the proposal and Java POC is already done. Do you have an estimated timeline for the C++ POC implementation?

If you have too much on your plate, we will be happy to take over or help with the C++ POC.

@paleolimbot
Copy link
Member Author

Sorry for the slow reply, I've been out of office! You are of course welcome to continue on top of this PR...I'm back in office Thursday and can definitely review. I think the main remaining portion is the Arrow IO and serde of the statistics.

@jiayuasu
Copy link
Member

jiayuasu commented Sep 3, 2024

@paleolimbot Thanks for the message. We will soon create a PR against your branch!

cpp/src/parquet/types.h Show resolved Hide resolved
crs = type.GEOMETRY.crs;
}

LogicalType::GeometryEdges::edges edges = LogicalType::GeometryEdges::UNKNOWN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw when edges is UNKNOWN? It is a required field by design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was emulating how enum parameters are handled in the date/time types here (but I may not have been!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in fb134d3

}

LogicalType::GeometryEncoding::geometry_encoding encoding =
LogicalType::GeometryEncoding::UNKNOWN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, this is a required field by design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in fb134d3

@@ -135,6 +186,7 @@ class PARQUET_EXPORT EncodedStatistics {
bool has_max = false;
bool has_null_count = false;
bool has_distinct_count = false;
bool has_geometry_statistics = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know other fields already have has_xxx flags before C++11. Should we use std::optional<EncodedGeometryStatistics> to make it simpler?

Copy link
Member Author

@paleolimbot paleolimbot Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense would be to handle changing that pattern with a dedicated PR but I have no opinions either way 🙂

return std::static_pointer_cast<TypedStatistics<DType>>(Statistics::Make(
descr, encoded_min, encoded_max, num_values, null_count, distinct_count,
has_min_max, has_null_count, has_distinct_count, pool));
int64_t distinct_count, const EncodedGeometryStatistics& geometry_statistics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this will break downstream users as it is a public API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would adding the two new arguments to the end of the argument list (with default values) make this a non-breaking change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make it ABI-compatible or source-compatible? If ABI compatibility is required we must add a new override version of this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think API-compatible is sufficient. Please correct me if I'm wrong @pitrou

bounder_.ReadGeometryTypes(other.bounder_.WkbTypes());
}

void Update(const ByteArray* values, int64_t num_values, int64_t null_count) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need a UpdateSpaced variant which deals with inputs with validity bits, like what TypeStatistics::UpdateSpaced does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added UpdateSpaced in fb134d3, will add tests for it later.

}
};

class WKBBuffer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking twice, I think we can keep this in the parquet module. But it should be renamed to geometry_util_internal.h to avoid making it public.

class WKBGenericSequenceBounder {
public:
WKBGenericSequenceBounder()
: xy_(chunk_),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to keep 8 variants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is me being fancy and trying to avoid branching 🙂 . A runtime switch is probably fine for a POC and it's possibly worth benchmarking before introducing this complexity.

@wgtmac
Copy link
Member

wgtmac commented Sep 5, 2024

Thanks @paleolimbot for initiating the PoC! I've left some inline comments on this PR.

Thanks @jiayuasu and @Kontinuation to continue the effort! Let's work together to make it to the finish line.

@paleolimbot
Copy link
Member Author

Closing in favour of #43977 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants