Doc updates from delta source tests (#211)

* added notes in glue and athena docs related to hudi 0.14.0 requirement * fixed grammatical errors * fixed typos * update applicable docs to use hive style partitioning
apache · Nov 14, 2023 · 8a4b974 · 8a4b974
1 parent 146741d
commit 8a4b974
Show file tree

Hide file tree

Showing 4 changed files with 57 additions and 4 deletions.
diff --git a/website/docs/athena.md b/website/docs/athena.md
@@ -13,4 +13,11 @@ you can create the table either by:
 * Or maintain the tables in Glue Data Catalog
 
 For an end to end tutorial that walks through S3, Glue Data Catalog and Athena to query a OneTable synced table,
-you can refer to the OneTable [Glue Data Catalog Guide](/docs/glue-catalog).
+you can refer to the OneTable [Glue Data Catalog Guide](/docs/glue-catalog).
+
+:::danger LIMITATION for Hudi target format:
+To validate the Hudi targetFormat table results, you need to ensure that the query engine that you're using
+supports Hudi version 0.14.0 as mentioned [here](/docs/features-and-limitations#hudi). 
+Currently, Athena [only supports 0.12.2](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html) 
+in Athena engine version 3, so querying Hudi targetFormat tables from Athena will not work. 
+:::
diff --git a/website/docs/features-and-limitations.md b/website/docs/features-and-limitations.md
@@ -25,7 +25,9 @@ This sync provides users with the following:
 - Only Copy-on-Write or Read-Optimized views of tables are currently supported. This means that only the underlying parquet files are synced but log files from Hudi and [delete vectors](https://docs.delta.io/latest/delta-deletion-vectors.html#:~:text=Deletion%20vectors%20indicate%20changes%20to,is%20run%20on%20the%20table.) from Delta and Iceberg are not captured by the sync.
 
 ### Hudi
-- Hudi 0.14.0 is required when reading a Hudi target table. Users will also need to enable the metadata table (`hoodie.metadata.enable=true`) when reading the data.
+- Hudi 0.14.0 is required when reading a Hudi target table. Users will also need to enable 
+  - the metadata table (`hoodie.metadata.enable=true`) and 
+  - hive style partitioning (`hoodie.datasource.write.hive_style_partitioning=true`) wherever applicable when reading the data.
 - Be sure to enable `parquet.avro.write-old-list-structure=false` for proper compatibility with lists when syncing from Hudi to Iceberg.
 - When using Hudi as the source for an Iceberg target, you may require field IDs set in the parquet schema. To enable that, follow the instructions [here](https://github.com/onetable-io/onetable/tree/main/hudi-support/extensions).
 

diff --git a/website/docs/glue-catalog.md b/website/docs/glue-catalog.md
@@ -149,14 +149,48 @@ From your terminal, run the glue crawler.
 Once the crawler succeeds, you’ll be able to query this Iceberg table from Athena,
 EMR and/or Redshift query engines.
 
+<Tabs
+groupId="table-format"
+defaultValue="hudi"
+values={[
+{ label: 'targetFormat: HUDI', value: 'hudi', },
+{ label: 'targetFormat: DELTA', value: 'delta', },
+{ label: 'targetFormat: ICEBERG', value: 'iceberg', },
+]}
+>
+
+<TabItem value="hudi">
+
+:::danger LIMITATION for Hudi target format:
+To validate the Hudi targetFormat table results, you need to ensure that the query engine that you're using
+supports Hudi version 0.14.0 as mentioned [here](/docs/features-and-limitations#hudi)
+:::
+
+</TabItem>
+<TabItem value="delta">
+
 ### Validating the results
-After the crawler runs successfully, you can inspect the catalogued tables in Glue 
+After the crawler runs successfully, you can inspect the catalogued tables in Glue
 and also query the table in Amazon Athena like below:
 
 ```sql
 SELECT * FROM onetable_synced_db.<table_name>;
 ```
 
+</TabItem>
+<TabItem value="iceberg">
+
+### Validating the results
+After the crawler runs successfully, you can inspect the catalogued tables in Glue
+and also query the table in Amazon Athena like below:
+
+```sql
+SELECT * FROM onetable_synced_db.<table_name>;
+```
+
+</TabItem>
+</Tabs>
+
 ## Conclusion
 In this guide we saw how to,
 1. sync a source table to create metadata for the desired target table formats using OneTable

diff --git a/website/docs/spark.md b/website/docs/spark.md
@@ -28,8 +28,18 @@ values={[
 
 * For Hudi, refer the [Spark Guide](https://hudi.apache.org/docs/quick-start-guide#spark-shellsql) page
 
+:::danger LIMITATION for Hudi target format:
+To validate the Hudi targetFormat table results, you need to ensure that you're using Hudi version 0.14.0 as mentioned [here](/docs/features-and-limitations#hudi)
+:::
+
 ```python md title="python"
-df = spark.read.format("hudi").load("/path/to/source/data")
+
+hudi_options = {
+    "hoodie.metadata.enable": "true",
+    "hoodie.datasource.write.hive_style_partitioning": "true",
+}
+
+df = spark.read.format("hudi").options(**hudi_options).load("/path/to/source/data")
 ```
 
 </TabItem>