Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: Daft support for Azure storage for Unity Catalog daft.read_deltalake #3025

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

anilmenon14
Copy link

There are 2 changes in this PR:

  1. Credential vending for Azure in UnityCatalog class method load_table:
    Now azure_user_delegation_sas.get('sas_token') is passed down as 'sas_token' in AzureConfig in the IOConfig object, after retrieving it from Unity catalog credential vending API endpoint. This token can be used downstream by DeltaLake.
    Code that handles existing working functionality for S3 will remain unchanged and is now wrapped within conditional logic based on the storage system involved (i.e. S3, ADLS, or GCS)
  2. Handling Azure storage system in DeltaLakeScanOperator:
    Conditional logic to ensure that S3 handling logic does not negatively impact daft.read_deltalake() calls from other storage systems (E.g. Azure or GCS).
  • For S3, the logic will continue to work as usual
  • For other storage systems (i.e., Azure particularly in this case), the deltalake_sdk_io_config is left unchanged and passed down to DeltaLake without any modifications.

Conditional blocks have been placed for Azure and GCS for future implementations in case any special handling of deltalake_sdk_io_config has to be done within those blocks.
Note: GCS support has not been added yet. However, based on a better understanding of the credential vending for GCS, can do that in a future PR.

Copy link

codspeed-hq bot commented Oct 10, 2024

CodSpeed Performance Report

Merging #3025 will not alter performance

Comparing anilmenon14:unity-azure-support (27bba26) with main (73ff3f3)

Summary

✅ 17 untouched benchmarks

@anilmenon14 anilmenon14 changed the title Daft support for Azure storage for Unity Catalog daft.read_deltalake [FEAT]: Daft support for Azure storage for Unity Catalog daft.read_deltalake Oct 10, 2024
@github-actions github-actions bot added the enhancement New feature or request label Oct 10, 2024
Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good overall! One question about the azure temp credentials stuff.

I think this PR could be cleaned up by adding a little IOConfig factory class, perhaps in a new daft.io.config_factories.* module, where we can have two factories FromEnv and FromUnity and they can handle the deduplication of the if schema == "..." logic for us.

But I would think that's not in scope here, we can do that as a follow-on cleanup.

pass
elif scheme == "az" or scheme == "abfs" or scheme == "abfss":
io_config = IOConfig(
azure=AzureConfig(sas_token=temp_table_credentials.azure_user_delegation_sas.get("sas_token"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, do we know what version of the unity catalog SDK this requires? I feel like this might have been added in a later version and might need us to regenerate the SDK. Have you tested this yet?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jaychia , I think this should be fine for the existing version shipped out . I tested on Daft 0.3.4 and it includes unitycatalog==0.1.1 . On this SDK version, the sas_token is being returned successfully.
I have tested this internally by walking through the function execution flow, using a virtual environment created with Daft 0.3.4 without any other upgrades and it has worked well for me. I could use the vended credentials in subsequent calls by instantiating an object of DeltaLakeScanOperator from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing

@jaychia
Copy link
Contributor

jaychia commented Oct 10, 2024

@kevinzwang could you handle follow-up here WRT the SDK generation and getting this PR in? This looks really good already thanks @anilmenon14 !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants