Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Direct iceberg table reading #5880

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

devinrsmith
Copy link
Member

This adds support to read iceberg tables directly from a specific metadata file without the need for a catalog (although, a catalog may be present).

At a minimum, this should be a very useful tool for debugging iceberg issues. In some cases, it may be the best way to read iceberg data as a catalog may not be supported. For example, clickhouse iceberg integration uses direct access without catalog support (no table name, no namespace, etc):

$ aws s3 ls --recursive --no-sign-request s3://datasets-documentation/ookla/iceberg/
2024-01-22 07:46:40          0 ookla/iceberg/
2024-01-22 08:48:38  611058150 ookla/iceberg/data/7XNeNQ/year_month_year=2019/20240122_164644_00156_m96dt-a29b72df-0432-46db-8194-7cb911f08800.parquet
2024-01-22 08:47:14  756200550 ookla/iceberg/data/87-7xw/year_month_year=2020/20240122_164644_00156_m96dt-63c79f23-dd64-4a7e-890b-700b722b5d03.parquet
2024-01-22 08:47:07  767259012 ookla/iceberg/data/Mmyt8A/year_month_year=2021/20240122_164644_00156_m96dt-bb51598a-6f86-4d19-9717-bcef163a4f05.parquet
2024-01-22 08:47:01  781589111 ookla/iceberg/data/X9Wyog/year_month_year=2022/20240122_164644_00156_m96dt-d27f1b84-bb50-4205-b323-990a32e18ff6.parquet
2024-01-22 08:47:01  836014231 ookla/iceberg/data/wRhLaA/year_month_year=2023/20240122_164644_00156_m96dt-e5ce13ef-f40c-4834-b032-39e403121d3c.parquet
2024-01-22 08:25:30       1968 ookla/iceberg/metadata/00000-6bfbd5a5-c431-4a41-98c8-12328da25947.metadata.json
2024-01-22 08:49:55       3107 ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json
2024-01-22 08:49:53       8347 ookla/iceberg/metadata/a3a81488-f4ec-42ad-9819-54527e7f6385-m0.avro
2024-01-22 08:49:54       4280 ookla/iceberg/metadata/snap-8326954415243093563-1-a3a81488-f4ec-42ad-9819-54527e7f6385.avro
SELECT
  *
FROM
  iceberg('https://datasets-documentation.s3.eu-west-3.amazonaws.com/ookla/iceberg/')

https://clickhouse.com/blog/exploring-global-internet-speeds-with-apache-iceberg-clickhouse
https://clickhouse.com/docs/en/sql-reference/table-functions/iceberg

With this PR, the equivalent in Deephaven would be:

from deephaven.experimental import iceberg

ookla = iceberg.read_static_table(
    "s3://datasets-documentation/ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json"
)

(For ease of use, the bit more verbose version works out-of-the-box without relying on implicit AWS credentials:

from deephaven.experimental import iceberg, s3
from datetime import timedelta

ookla = iceberg.read_static_table(
    "s3://datasets-documentation/ookla/iceberg/metadata/00001-ad43ea5c-fd93-474c-93eb-2e8400c925aa.metadata.json",
    instructions=iceberg.IcebergInstructions(
        data_instructions=s3.S3Instructions(
            region_name="eu-west-3",
            anonymous_access=True,
            read_timeout=timedelta(seconds=10),
        )
    ),
)

)

There's potential to extend this support to point to the root of the table location (like clickhouse supports) as opposed to a specific metadata file, ie, s3://datasets-documentation/ookla/iceberg/, but that would take some additional logic.

@devinrsmith devinrsmith added this to the 0.36.0 milestone Aug 1, 2024
@devinrsmith devinrsmith self-assigned this Aug 1, 2024
@devinrsmith
Copy link
Member Author

This is missing documentation, as I want to make sure there's some agreement on the interfaces before proceeding.

@devinrsmith devinrsmith changed the title Direct iceberg table reading feat: Direct iceberg table reading Aug 1, 2024
Copy link
Contributor

@lbooker42 lbooker42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very interesting and would be a great and fast way to load static Iceberg tables.

@NotNull final Schema schema,
@NotNull final org.apache.iceberg.Table table,
@NotNull final IcebergInstructions userInstructions) {
return TableTools.newTable(tableDefinition(schema, table, userInstructions, -1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for static tables, but for refreshing tables I believe we'll need to return PartitionAwareSourceTable with zero partitions (or the IcebergTable equivalent) in order to populate data with discovered files.

Comment on lines +63 to +64
// final HadoopTables tables = new HadoopTables(hadoopConf);
// final org.apache.iceberg.Table table = tables.load(uri);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover debugging code?

Suggested change
// final HadoopTables tables = new HadoopTables(hadoopConf);
// final org.apache.iceberg.Table table = tables.load(uri);

IcebergInstructions instructions,
Map<String, String> properties,
Configuration hadoopConf) {
// final HadoopTables tables = new HadoopTables(hadoopConf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L/O debug code...

@@ -0,0 +1,29 @@
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These adapters are interesting, in #5754 we are creating data instruction objects (S3Instructions etc.) from the properties maps. This is doing the inverse, right?

@devinrsmith
Copy link
Member Author

This is partially related to #5868, at least for providing a refactoring of the TableDefinition logic and exposing it to end users for the static entrypoints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants