Skip to content

Commit

Permalink
Doc updates from delta source tests (#211)
Browse files Browse the repository at this point in the history
* added notes in glue and athena docs related to hudi 0.14.0 requirement

* fixed grammatical errors

* fixed typos

* update applicable docs to use hive style partitioning
  • Loading branch information
sagarlakshmipathy authored Nov 14, 2023
1 parent 146741d commit 8a4b974
Show file tree
Hide file tree
Showing 4 changed files with 57 additions and 4 deletions.
9 changes: 8 additions & 1 deletion website/docs/athena.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,11 @@ you can create the table either by:
* Or maintain the tables in Glue Data Catalog

For an end to end tutorial that walks through S3, Glue Data Catalog and Athena to query a OneTable synced table,
you can refer to the OneTable [Glue Data Catalog Guide](/docs/glue-catalog).
you can refer to the OneTable [Glue Data Catalog Guide](/docs/glue-catalog).

:::danger LIMITATION for Hudi target format:
To validate the Hudi targetFormat table results, you need to ensure that the query engine that you're using
supports Hudi version 0.14.0 as mentioned [here](/docs/features-and-limitations#hudi).
Currently, Athena [only supports 0.12.2](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html)
in Athena engine version 3, so querying Hudi targetFormat tables from Athena will not work.
:::
4 changes: 3 additions & 1 deletion website/docs/features-and-limitations.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@ This sync provides users with the following:
- Only Copy-on-Write or Read-Optimized views of tables are currently supported. This means that only the underlying parquet files are synced but log files from Hudi and [delete vectors](https://docs.delta.io/latest/delta-deletion-vectors.html#:~:text=Deletion%20vectors%20indicate%20changes%20to,is%20run%20on%20the%20table.) from Delta and Iceberg are not captured by the sync.

### Hudi
- Hudi 0.14.0 is required when reading a Hudi target table. Users will also need to enable the metadata table (`hoodie.metadata.enable=true`) when reading the data.
- Hudi 0.14.0 is required when reading a Hudi target table. Users will also need to enable
- the metadata table (`hoodie.metadata.enable=true`) and
- hive style partitioning (`hoodie.datasource.write.hive_style_partitioning=true`) wherever applicable when reading the data.
- Be sure to enable `parquet.avro.write-old-list-structure=false` for proper compatibility with lists when syncing from Hudi to Iceberg.
- When using Hudi as the source for an Iceberg target, you may require field IDs set in the parquet schema. To enable that, follow the instructions [here](https://github.com/onetable-io/onetable/tree/main/hudi-support/extensions).

Expand Down
36 changes: 35 additions & 1 deletion website/docs/glue-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,14 +149,48 @@ From your terminal, run the glue crawler.
Once the crawler succeeds, you’ll be able to query this Iceberg table from Athena,
EMR and/or Redshift query engines.

<Tabs
groupId="table-format"
defaultValue="hudi"
values={[
{ label: 'targetFormat: HUDI', value: 'hudi', },
{ label: 'targetFormat: DELTA', value: 'delta', },
{ label: 'targetFormat: ICEBERG', value: 'iceberg', },
]}
>

<TabItem value="hudi">

:::danger LIMITATION for Hudi target format:
To validate the Hudi targetFormat table results, you need to ensure that the query engine that you're using
supports Hudi version 0.14.0 as mentioned [here](/docs/features-and-limitations#hudi)
:::

</TabItem>
<TabItem value="delta">

### Validating the results
After the crawler runs successfully, you can inspect the catalogued tables in Glue
After the crawler runs successfully, you can inspect the catalogued tables in Glue
and also query the table in Amazon Athena like below:

```sql
SELECT * FROM onetable_synced_db.<table_name>;
```

</TabItem>
<TabItem value="iceberg">

### Validating the results
After the crawler runs successfully, you can inspect the catalogued tables in Glue
and also query the table in Amazon Athena like below:

```sql
SELECT * FROM onetable_synced_db.<table_name>;
```

</TabItem>
</Tabs>

## Conclusion
In this guide we saw how to,
1. sync a source table to create metadata for the desired target table formats using OneTable
Expand Down
12 changes: 11 additions & 1 deletion website/docs/spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,18 @@ values={[

* For Hudi, refer the [Spark Guide](https://hudi.apache.org/docs/quick-start-guide#spark-shellsql) page

:::danger LIMITATION for Hudi target format:
To validate the Hudi targetFormat table results, you need to ensure that you're using Hudi version 0.14.0 as mentioned [here](/docs/features-and-limitations#hudi)
:::

```python md title="python"
df = spark.read.format("hudi").load("/path/to/source/data")

hudi_options = {
"hoodie.metadata.enable": "true",
"hoodie.datasource.write.hive_style_partitioning": "true",
}

df = spark.read.format("hudi").options(**hudi_options).load("/path/to/source/data")
```

</TabItem>
Expand Down

0 comments on commit 8a4b974

Please sign in to comment.