From 8a4b9744994dcca6ef903ddca6afe5c4958b7805 Mon Sep 17 00:00:00 2001 From: Sagar Lakshmipathy <18vidhyasagar@gmail.com> Date: Tue, 14 Nov 2023 13:16:41 -0800 Subject: [PATCH] Doc updates from delta source tests (#211) * added notes in glue and athena docs related to hudi 0.14.0 requirement * fixed grammatical errors * fixed typos * update applicable docs to use hive style partitioning --- website/docs/athena.md | 9 +++++- website/docs/features-and-limitations.md | 4 ++- website/docs/glue-catalog.md | 36 +++++++++++++++++++++++- website/docs/spark.md | 12 +++++++- 4 files changed, 57 insertions(+), 4 deletions(-) diff --git a/website/docs/athena.md b/website/docs/athena.md index a4fd0367..9a9f5ca4 100644 --- a/website/docs/athena.md +++ b/website/docs/athena.md @@ -13,4 +13,11 @@ you can create the table either by: * Or maintain the tables in Glue Data Catalog For an end to end tutorial that walks through S3, Glue Data Catalog and Athena to query a OneTable synced table, -you can refer to the OneTable [Glue Data Catalog Guide](/docs/glue-catalog). \ No newline at end of file +you can refer to the OneTable [Glue Data Catalog Guide](/docs/glue-catalog). + +:::danger LIMITATION for Hudi target format: +To validate the Hudi targetFormat table results, you need to ensure that the query engine that you're using +supports Hudi version 0.14.0 as mentioned [here](/docs/features-and-limitations#hudi). +Currently, Athena [only supports 0.12.2](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html) +in Athena engine version 3, so querying Hudi targetFormat tables from Athena will not work. +::: \ No newline at end of file diff --git a/website/docs/features-and-limitations.md b/website/docs/features-and-limitations.md index 9d32d0fa..6fd6aeb0 100644 --- a/website/docs/features-and-limitations.md +++ b/website/docs/features-and-limitations.md @@ -25,7 +25,9 @@ This sync provides users with the following: - Only Copy-on-Write or Read-Optimized views of tables are currently supported. This means that only the underlying parquet files are synced but log files from Hudi and [delete vectors](https://docs.delta.io/latest/delta-deletion-vectors.html#:~:text=Deletion%20vectors%20indicate%20changes%20to,is%20run%20on%20the%20table.) from Delta and Iceberg are not captured by the sync. ### Hudi -- Hudi 0.14.0 is required when reading a Hudi target table. Users will also need to enable the metadata table (`hoodie.metadata.enable=true`) when reading the data. +- Hudi 0.14.0 is required when reading a Hudi target table. Users will also need to enable + - the metadata table (`hoodie.metadata.enable=true`) and + - hive style partitioning (`hoodie.datasource.write.hive_style_partitioning=true`) wherever applicable when reading the data. - Be sure to enable `parquet.avro.write-old-list-structure=false` for proper compatibility with lists when syncing from Hudi to Iceberg. - When using Hudi as the source for an Iceberg target, you may require field IDs set in the parquet schema. To enable that, follow the instructions [here](https://github.com/onetable-io/onetable/tree/main/hudi-support/extensions). diff --git a/website/docs/glue-catalog.md b/website/docs/glue-catalog.md index cc635c59..159eadf1 100644 --- a/website/docs/glue-catalog.md +++ b/website/docs/glue-catalog.md @@ -149,14 +149,48 @@ From your terminal, run the glue crawler. Once the crawler succeeds, you’ll be able to query this Iceberg table from Athena, EMR and/or Redshift query engines. + + + + +:::danger LIMITATION for Hudi target format: +To validate the Hudi targetFormat table results, you need to ensure that the query engine that you're using +supports Hudi version 0.14.0 as mentioned [here](/docs/features-and-limitations#hudi) +::: + + + + ### Validating the results -After the crawler runs successfully, you can inspect the catalogued tables in Glue +After the crawler runs successfully, you can inspect the catalogued tables in Glue and also query the table in Amazon Athena like below: ```sql SELECT * FROM onetable_synced_db.; ``` + + + +### Validating the results +After the crawler runs successfully, you can inspect the catalogued tables in Glue +and also query the table in Amazon Athena like below: + +```sql +SELECT * FROM onetable_synced_db.; +``` + + + + ## Conclusion In this guide we saw how to, 1. sync a source table to create metadata for the desired target table formats using OneTable diff --git a/website/docs/spark.md b/website/docs/spark.md index 8185ec9b..7490c92a 100644 --- a/website/docs/spark.md +++ b/website/docs/spark.md @@ -28,8 +28,18 @@ values={[ * For Hudi, refer the [Spark Guide](https://hudi.apache.org/docs/quick-start-guide#spark-shellsql) page +:::danger LIMITATION for Hudi target format: +To validate the Hudi targetFormat table results, you need to ensure that you're using Hudi version 0.14.0 as mentioned [here](/docs/features-and-limitations#hudi) +::: + ```python md title="python" -df = spark.read.format("hudi").load("/path/to/source/data") + +hudi_options = { + "hoodie.metadata.enable": "true", + "hoodie.datasource.write.hive_style_partitioning": "true", +} + +df = spark.read.format("hudi").options(**hudi_options).load("/path/to/source/data") ```