Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2471: Add geometry logical type #240

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
5c9e110
WIP: Add geometry logical type
wgtmac May 10, 2024
5ef28cd
address various comments
wgtmac May 25, 2024
ecd8cc2
add file level geo stats
wgtmac May 27, 2024
d81dacb
address feedback:
wgtmac May 31, 2024
80f4051
change naming and remove controversial items
wgtmac Jun 13, 2024
0db6d9f
address feedback
wgtmac Jun 16, 2024
e817af4
fix typo
wgtmac Jun 16, 2024
f78f7bd
use WKB type code
wgtmac Jun 19, 2024
1aaaca8
Update covering and geometry type protocol based on comments (#2)
zhangfengcdt Aug 7, 2024
ee5b2df
Add the new suggestion according to the meeting with Snowflake (#3)
jiayuasu Aug 15, 2024
19cc081
change metadata to string type and rewording WKB description
wgtmac Aug 20, 2024
16c5868
add example for crs
wgtmac Aug 21, 2024
56a65de
reword crs
wgtmac Aug 21, 2024
f28b282
clarify WKB
wgtmac Aug 22, 2024
5127702
clarify coverings
wgtmac Aug 24, 2024
298ab64
Update the suggestion for bbox stats (#4)
jiayuasu Sep 11, 2024
41c6394
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
d86abe4
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
c7a4f4c
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
f20f685
Update src/main/thrift/parquet.thrift
wgtmac Sep 20, 2024
dbf9d54
address feedback about edges and wkb
wgtmac Sep 20, 2024
b4296aa
add geoparquet column metadata back
wgtmac Sep 27, 2024
9bcea6e
Update the spec according to the new feedback (#5)
jiayuasu Oct 4, 2024
99f0403
Update src/main/thrift/parquet.thrift
wgtmac Oct 12, 2024
dbb78cf
Update src/main/thrift/parquet.thrift
wgtmac Oct 12, 2024
25df0ff
add description to LogicalTypes.md
wgtmac Oct 13, 2024
d349727
add explanation for Z & M values
wgtmac Oct 13, 2024
9ea6559
move geo stats to ColumnMetaData
wgtmac Oct 16, 2024
011de45
Update src/main/thrift/parquet.thrift
wgtmac Oct 17, 2024
6425a3c
fix typo
wgtmac Oct 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -767,6 +767,188 @@ optional group my_map (MAP_KEY_VALUE) {
}
```

## Geospatial Types

### GEOMETRY

`GEOMETRY` is used for geometry features from [OGC – Simple feature access][simple-feature-access].
See [Geospatial Notes](#geospatial-notes).

The type has three type parameters:
- `encoding`: A required enum value for annonated physical type and encoding
for the `GEOMETRY` type. See [Geometry Encoding](#geometry-encoding).
- `edges`: A required enum value for interpretation for edges of elements of the
`GEOMETRY` type, i.e. whether the interpolation between points along
an edge represents a straight cartesian line or the shortest line on
the sphere. See [Edges](#edges).
- `crs`: An optional string value for CRS (coordinate reference system), which
is a mapping of how coordinates refer to precise locations on earth.
See [Coordinate Reference System](#coordinate-reference-system).

The sort order used for `GEOMETRY` is undefined. When writing data, no min/max
statistics should be saved for this type and if such non-compliant statistics
are found during reading, they must be ignored. Instead, [GeometryStatistics](#geometry-statistics)
is introduced for `GEOMETRY` type.

#### Geometry Encoding

Physical type and encoding for the `GEOMETRY` type. Supported values:
- `WKB`: `GEOMETRY` type with `WKB` encoding can only be used to annotate the
`BYTE_ARRAY` primitive type. See [WKB](#well-known-binary-wkb).

Note that geometry encoding is required for `GEOMETRY` type. In order to correctly
interpret geometry data, writer implementations SHOULD always set this field, and
reader implementations SHOULD fail for an unknown geometry encoding value.

##### Well-known binary (WKB)

Well-known binary (WKB) representations of geometries, see [Geospatial Notes](#geospatial-notes).

To be clear, we follow the same definitions of GeoParquet for [WKB][geoparquet-wkb]
and [coordinate axis order][coordinate-axis-order]:
- Geometries SHOULD be encoded as ISO WKB supporting XY, XYZ, XYM, XYZM. Supported
standard geometry types: Point, LineString, Polygon, MultiPoint, MultiLineString,
MultiPolygon, and GeometryCollection.
- Coordinate axis order is always (x, y) where x is easting or longitude, and
y is northing or latitude. This ordering explicitly overrides the axis order
as specified in the CRS following the [GeoPackage specification][geopackage-spec].

This is the preferred encoding for maximum portability.

[geoparquet-wkb]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L92
[coordinate-axis-order]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L155
[geopackage-spec]: https://www.geopackage.org/spec130/#gpb_spec

#### Edges

Interpretation for edges of elements of `GEOMETRY` type. In other words, it
specifies how a point between two vertices should be interpolated in its XY
dimensions. Supported values and corresponding interpolation approaches are:
- `PLANAR`: a Cartesian line connecting the two vertices.
- `SPHERICAL`: a shortest spherical arc between the longitude and latitude
represented by the two vertices.

This value applies to all non-point geometry objects and is independent of the
[Coordinate Reference System](#coordinate-reference-system).

Because most systems currently assume planar edges and do not support spherical
edges, `PLANAR` should be used as the default value.

Note that edges is required for `GEOMETRY` type. In order to correctly
interpret geometry data, writer implementations SHOULD always set this field,
and reader implementations SHOULD fail for an unknown edges value.

#### Coordinate Reference System

CRS (coordinate reference system) is a mapping of how coordinates refer to
precise locations on earth. A CRS is specified by a key-value entry in the
`key_value_metadata` field of `FileMetaData` whose key is a short name of
the CRS and value is the CRS representation. An additional entry in the
`key_value_metadata` field with the suffix ".type" is required to describe
the encoding of this CRS representation.

For example, if a geometry column (e.g., "geom1") uses the CRS "OGC:CRS84", the
writer may write two entries to `key_value_metadata` field of `FileMetaData` as
below, and set the `crs` field of the `GEOMETRY` type to "geom1_crs":
```
"geom1_crs": an UTF-8 encoded PROJJSON representation of OGC:CRS84
"geom1_crs.type": "PROJJSON"
```

The PROJJSON representation of OGC:CRS84 can be seen at [OGC:CRS84][ogc-crs84].
Multiple geometry columns can refer to the same CRS metadata field
(e.g., "geom1_crs") if they share the same CRS.

[ogc-crs84]: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#ogccrs84-details

#### Geometry Statistics

`GeometryStatistics` is a struct to store geometry statistics of a column chunk
of `GEOMETRY` type. It is an optional field of `ColumnMetaData` and contains
[Bounding Box](#bounding-box) and [Geometry Types](#geometry-types).

##### Bounding Box

A geometry has at least two coordinate dimensions: X and Y for 2D coordinates
of each point. A geometry can optionally have Z and / or M values associated
with each point in the geometry.

The Z values introduce the third dimension coordinate. Usually they are used
to indicate the height, or elevation.

M values are an opportunity for a geometry to express a fourth dimension as
a coordinate value. These values can be used as a linear reference value
(e.g., highway milepost value), a timestamp, or some other value as defined
by the CRS.

Bounding box is defined as the thrift struct below in the representation of
min/max value pair of coordinates from each axis. Note that X and Y Values
are always present. Z and M are omitted for 2D geometries.

```thrift
struct BoundingBox {
/** Min X value when edges = PLANAR, westmost value if edges = SPHERICAL */
1: required double xmin;
/** Max Y value when edges = PLANAR, eastmost value if edges = SPHERICAL */
2: required double xmax;
/** Min Y value when edges = PLANAR, southmost value if edges = SPHERICAL */
3: required double ymin;
/** Max Y value when edges = PLANAR, northmost value if edges = SPHERICAL */
4: required double ymax;
/** Min Z value if the axis exists */
5: optional double zmin;
/** Max Z value if the axis exists */
6: optional double zmax;
/** Min M value if the axis exists */
7: optional double mmin;
/** Max M value if the axis exists */
8: optional double mmax;
}
```

The meaning of each value depends on the `Edges` attribute of the `GEOMETRY` type:
- If Edges is `PLANAR`, the values are literally the actual min/max value from each axis.
- If Edges is `SPHERICAL`, the values for X and Y are `[westmost, eastmost, southmost, northmost]`,
with necessary min/max values for Z and M if needed.

##### Geometry Types

A list of geometry types from all geometries in the `GEOMETRY` column, or an
empty list if they are not known.

This is borrowed from [geometry_types of GeoParquet][geometry-types]
except that values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code].
Table below shows the most common geometry types and their codes:

| Type | XY | XYZ | XYM | XYZM |
| :----------------- | :--- | :--- | :--- | :--: |
| Point | 0001 | 1001 | 2001 | 3001 |
| LineString | 0002 | 1002 | 2002 | 3002 |
| Polygon | 0003 | 1003 | 2003 | 3003 |
| MultiPoint | 0004 | 1004 | 2004 | 3004 |
| MultiLineString | 0005 | 1005 | 2005 | 3005 |
| MultiPolygon | 0006 | 1006 | 2006 | 3006 |
| GeometryCollection | 0007 | 1007 | 2007 | 3007 |

In addition, the following rules are applied:
- A list of multiple values indicates that multiple geometry types are present (e.g. `[0003, 0006]`).
- An empty array explicitly signals that the geometry types are not known.
- The geometry types in the list must be unique (e.g. `[0001, 0001]` is not valid).

[geometry-types]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159
[wkb-integer-code]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary

#### Geospatial Notes

The Geometry class hierarchy and its WKT and WKB serializations (ISO supporting
XY, XYZ, XYM, XYZM) are defined by [OpenGIS Implementation Specification for
Geographic information – Simple feature access – Part 1: Common architecture](
https://portal.ogc.org/files/?artifact_id=25355), from [OGC (Open Geospatial
Consortium)](https://www.ogc.org/standard/sfa/).

The version of the OGC standard first used here is 1.2.1, but future versions
may also used if the WKB representation remains wire-compatible.

## UNKNOWN (always null)

Sometimes, when discovering the schema of existing data, values are always null
Expand Down
70 changes: 70 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,37 @@ struct SizeStatistics {
3: optional list<i64> definition_level_histogram;
}

/**
* Bounding box of geometries in the representation of min/max value pair of
* coordinates from each axis.
*/
struct BoundingBox {
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** Min X value when edges = PLANAR, westmost value if edges = SPHERICAL */
1: required double xmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit (maybe not necessary) but considering using x_min, x_max, etc I'd need to review the file if there is any prior art for consistency. I guess the values here are consistent with geoparquet spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ( https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#bbox ), and there is some prior art in other spatial libraries, too (e.g., https://github.com/libgeos/geos/blob/3.12/capi/geos_c.h.in#L1661-L1664 ). x_max is certainly fine as well!

/** Max Y value when edges = PLANAR, eastmost value if edges = SPHERICAL */
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
2: required double xmax;
/** Min Y value when edges = PLANAR, southmost value if edges = SPHERICAL */
3: required double ymin;
/** Max Y value when edges = PLANAR, northmost value if edges = SPHERICAL */
4: required double ymax;
/** Min Z value if the axis exists */
5: optional double zmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** Max Z value if the axis exists */
6: optional double zmax;
/** Min M value if the axis exists */
7: optional double mmin;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
/** Max M value if the axis exists */
8: optional double mmax;
}

/** Statistics specific to GEOMETRY logical type */
struct GeometryStatistics {
/** A bounding box of geometries */
1: optional BoundingBox bbox;
/** Geometry type codes of all geometries, or an empty list if not known */
2: optional list<i32> geometry_types;
}

/**
* Statistics per row group and per page
* All fields are optional.
Expand Down Expand Up @@ -380,6 +411,40 @@ struct JsonType {
struct BsonType {
}

/** Physical type and encoding for the geometry type */
enum GeometryEncoding {
/**
* Allowed for physical type: BYTE_ARRAY.
*
* Well-known binary (WKB) representations of geometries.
*/
WKB = 0;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
}

/** Interpretation for edges of elements of a GEOMETRY type */
enum Edges {
PLANAR = 0;
SPHERICAL = 1;
}

/**
* GEOMETRY logical type annotation (added in 2.11.0)
*
* GeometryEncoding and Edges are required. In order to correctly interpret
* geometry data, writer implementations SHOULD always them, and reader
* implementations SHOULD fail for unknown values.
*
* CRS is optional. Once CRS is set, it MUST be a key to an entry in the
* `key_value_metadata` field of `FileMetaData`.
*
* See LogicalTypes.md for detail.
*/
struct GeometryType {
1: required GeometryEncoding encoding;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
2: required Edges edges;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
3: optional string crs;
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
pitrou marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* LogicalType annotations to replace ConvertedType.
*
Expand Down Expand Up @@ -410,6 +475,7 @@ union LogicalType {
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: GeometryType GEOMETRY // no compatible ConvertedType
}

/**
Expand Down Expand Up @@ -850,6 +916,9 @@ struct ColumnMetaData {
* filter pushdown.
*/
16: optional SizeStatistics size_statistics;

/** Optional statistics specific to GEOMETRY logical type */
17: optional GeometryStatistics geometry_stats;
}

struct EncryptionWithFooterKey {
Expand Down Expand Up @@ -980,6 +1049,7 @@ union ColumnOrder {
* ENUM - unsigned byte-wise comparison
* LIST - undefined
* MAP - undefined
* GEOMETRY - undefined
*
* In the absence of logical types, the sort order is determined by the physical type:
* BOOLEAN - false, true
Expand Down