Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Use DH S3Instructions to build Iceberg AWS clients #6113

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

devinrsmith
Copy link
Member

@devinrsmith devinrsmith commented Sep 23, 2024

This provides a way for users who are responsible for providing AWS / S3 credentials to specify it in a way where Deephaven can own the S3 client building logic for the Iceberg Catalog in additional to our own data access layer.

Note, this does not deprecate DataInstructionsProviderPlugin, as there may be cases where the user is not responsible for providing these credentials, and it is instead provided via the catalog after catalog authorization. See #6191

@devinrsmith devinrsmith added this to the 0.37.0 milestone Sep 23, 2024
@devinrsmith devinrsmith self-assigned this Sep 23, 2024
@@ -15,9 +16,11 @@
import static org.apache.iceberg.aws.s3.S3FileIOProperties.ENDPOINT;
import static org.apache.iceberg.aws.s3.S3FileIOProperties.SECRET_ACCESS_KEY;

@Tag("testcontainers")
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we had to deprecate these?
They might still give us coverage on a lot of smaller cases that Larry tested.
Is it deprecated in a sense that new tests should not be added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly that I think we should not add new tests here, and work to migrating them as mentioned in IcebergToolsTest. @lbooker42 has so far owned this layer, but it should be relatively easy to migrate to db_resource, and then we get the benefit of:

  1. No separate container (s3) needed (+ no need to upload files)
  2. Ne need to have custom catalog IcebergTestCatalog
  3. On disk JDBC catalog + on disk warehouse

I see db_resource as mainly a way to test out how well we can interoperate with Iceberg that has been written via different processes (pyiceberg, spark, etc).

For more thorough testing (once we have our own writing support) we should be able to extend SqliteCatalogBase (or create further specialized tests that look similar to it), which can work with any warehouse - currently the same logic is tested out via local disk, minio, and localstack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, these new catalog testing code is the more comprehensive way for all future tests. Once we can migrate these tests so we don't lose any coverage, we should remove these as well as the IcebergTestCatalog class.
cc: @lbooker42

@Override
public S3Client s3() {
checkInit();
return S3ClientFactory.getSyncClient(instructions);
}

@Override
public GlueClient glue() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the glue, kms and dynamodb clients creation covered under any tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No; you could argue we should prefer to throw unsupported for DynamoDB / KMS, but it was easy enough construct those clients if some Iceberg catalog happens to use them (I'm not sure under what circumstances, but it is part of that API...).

I don't know if we've done any Glue testing. https://docs.localstack.cloud/user-guide/aws/glue/ might be possible with the "pro" version, not sure what that entails. https://docs.localstack.cloud/user-guide/aws/glue/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, the implementation does seem straightforward and the internal methods are being touched in the S3 code, so I think that part should be okay.

The only thing I am slightly worried about is preloading in case of kms and dynamo clients.
You have a comment stating glue client is preloaded, is it possible to verify the same for kms and dynamo too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some manual testing of the Glue Catalog with the DeephavenAwsClientFactory, and it works. It was a good exercise though, because I realized that there is a new S3Client built for ever Table load (at least for the Glue catalog), so we can't immediately invoke the cleanup Runnable. After a quick change, it worked.

The only caller I see of kms() is LakeFormationAwsClientFactory - it's possible that some private Catalogs use it, but I'm not sure. It seems like this factory has something do do with Role ARN, and it may not actually be in use anymore since I think Iceberg has expanded how configurable AWS is since the creation of it.

The only caller of dynamodb() is DynamoDbCatalog; it may be possible to test it out with Localstack since it looks to be freely available, but it's just a matter of how much time do we want to spend trying? https://docs.localstack.cloud/user-guide/aws/dynamodb/

Copy link
Contributor

@malhotrashivam malhotrashivam Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking into this.
At this point, maybe just add a TODO comment for kms and dynamodb, for testing it in future?

py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
@Override
public S3Client s3() {
checkInit();
return S3ClientFactory.getSyncClient(instructions);
}

@Override
public GlueClient glue() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, the implementation does seem straightforward and the internal methods are being touched in the S3 code, so I think that part should be okay.

The only thing I am slightly worried about is preloading in case of kms and dynamo clients.
You have a comment stating glue client is preloaded, is it possible to verify the same for kms and dynamo too?

@@ -15,9 +16,11 @@
import static org.apache.iceberg.aws.s3.S3FileIOProperties.ENDPOINT;
import static org.apache.iceberg.aws.s3.S3FileIOProperties.SECRET_ACCESS_KEY;

@Tag("testcontainers")
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, these new catalog testing code is the more comprehensive way for all future tests. Once we can migrate these tests so we don't lose any coverage, we should remove these as well as the IcebergTestCatalog class.
cc: @lbooker42

Comment on lines +64 to +65
throw new IllegalStateException(
"This S3Iinstructions were already cleaned up; please ensure that the returned Runnable from addToProperties is not invoked until the Catalog is no longer in use.");
Copy link
Contributor

@malhotrashivam malhotrashivam Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so weird that spotless let this one pass.
Also, typo in S3Instructions.

Comment on lines +146 to +147
CLEANER.register(adapter.catalog(), cleanup);
return adapter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I had to do something similar for the S3Request objects, Ryan suggested using CleanupReferenceProcessor. You can check that too, if that has any advantages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants