Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43535: [C++] Support the AWS S3 SSE-C encryption #43601

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

ripplehang
Copy link

@ripplehang ripplehang commented Aug 7, 2024

Rationale for this change

server-side encryption with customer-provided keys is an important security feature for aws s3, it's useful when user want to manager the encryption key themselves, say, they don't want the data to be exposed to the aws system admin, and ensure the object is safe even the ACCESS_KEY and SECRET_KEY is somehow leaked.
Some comparison of S3 encryption options :
https://www.linkedin.com/pulse/delusion-s3-encryption-benefits-ravi-ivaturi/

What changes are included in this PR?

  1. Add the sse_customer_key member for S3Options to support server-side encryption with customer-provided keys (SSE-C keys).

    • The sse_customer_key was expected to be 256 bits (32 bytes) according to aws doc
    • The sse_customer_key was expected to be the raw key rather than base64 encoded value, arrow would calculate the base64 and MD5 on the fly.
    • By default the sse_customer_key is empty, and when the sse_customer_key is empty, there is no impact on the existing workflow. When the sse_customer_key is configured, it would require the aws sdk version to newer than 1.9.201.
  2. Add the tls_ca_file_path, tls_ca_dir_path and tls_verify_certificates members for S3Options.

    • the tls_ca_file_path, tls_ca_dir_path member for S3Options would override the value configured by arrow::fs::FileSystemGlobalOptions.
    • for s3, according to aws sdk doc, the tls_ca_file_path and tls_ca_dir_path only take effect in Linux, in order to support connect to the the storage server like minio with self-signed certificates on non-linux platform, we expose the tls_verify_certificates.
  3. Refine the unit test to start the minio server with self-signed certificate on linux platform, so the unit test could cover the https case on linux, and http case on non-linux platform.

Are these changes tested?

Yes

Are there any user-facing changes?

yes, the new member sse_customer_key tls_ca_file_path, tls_ca_dir_path and tls_verify_certificates added for S3Options

Copy link

github-actions bot commented Aug 7, 2024

⚠️ GitHub issue #43535 has been automatically assigned in GitHub to PR creator.

@ripplehang
Copy link
Author

@pitrou @mapleFU @felipecrv can you help to review the PR, Thanks

@pitrou
Copy link
Member

pitrou commented Aug 22, 2024

I took a quick look and here are some assorted thoughts:

  • there are no unit tests
  • only the encryption key needs to be on S3Options, not the MD5 and algorithm string (which can be computed on the fly)

@pitrou
Copy link
Member

pitrou commented Aug 22, 2024

Also, for the other readers wondering, I've found a useful (though possibly outdated?) comparison of S3 encryption options here:
https://www.linkedin.com/pulse/delusion-s3-encryption-benefits-ravi-ivaturi/

@ripplehang
Copy link
Author

I took a quick look and here are some assorted thoughts:

@pitrou Thanks for your review, for #1, i would try to add ut for the change, for #2, yes, actually ONLY the Key is needed, that's why i only expose one SetSSECKey function, and three Get function, and private the new added datamember.
do you suggest to only add one datamember for S3Options? but if there is only one memeber, we may needs to calculate everywhere?

@ripplehang
Copy link
Author

@pitrou I've submitted another commit to only expose the ssec-key in S3Options, and then calculate the MD5 every time on the fly, can you help to review whether there are other comments? I would try to add the ut if there is no interface change required, Thanks

@ripplehang ripplehang force-pushed the support_ssec_for_aws_s3 branch 3 times, most recently from bb2e024 to 539fa7c Compare August 27, 2024 06:00
@ripplehang
Copy link
Author

@pitrou Can you help to review the PR again?Thanks

@ripplehang
Copy link
Author

@kou @mapleFU can you help to review the change, Thanks

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a careful around this weekend

/// \param expect_input_key_size, default 32
/// \return true if the decode and calculate MD5 success, otherwise return false
ARROW_EXPORT
bool CalculateSSECKeyMD5(const std::string& base64_encoded_key, std::string& md5_result,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind change to Result<std::string>?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I would try to refine the return value for the function.

Comment on lines 278 to 280
// Decode the Base64-encoded key to get the raw binary key
Aws::Utils::ByteBuffer rawKey =
Aws::Utils::HashingUtils::Base64Decode(base64_encoded_key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do this handles error if base64_encoded_key is not valid?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://github.com/aws/aws-sdk-cpp/blob/ac2da09e6930e3988d1289717e2df5d4b7408f17/src/aws-cpp-sdk-core/source/utils/base64/Base64.cpp#L91-L121, seems this aws util API didn't validate the input properly,but directly calculate the output size and allocate the buffer, so i add the check to ensure every character is the valid character here.
Meanwhile, I also add the related Unit test for the CalculateSSECKeyMD5 to test the different input value.


// the key needs to be // 256 bits(32 bytes)according to
// https://docs.aws.amazon.com/AmazonS3/latest/userguide/ServerSideEncryptionCustomerKeys.html#specifying-s3-c-encryption
if (rawKey.GetLength() != expect_input_key_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this not equal to 256?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expect_input_key_size is exposed via S3Options, so it's possible for this value to be set as any length.

@github-actions github-actions bot added awaiting committer review Awaiting committer review awaiting review Awaiting review and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Aug 30, 2024
@ripplehang ripplehang force-pushed the support_ssec_for_aws_s3 branch 3 times, most recently from de05328 to 8649388 Compare August 30, 2024 12:24
@kou kou changed the title GH-43535:[C++] support the AWS S3 SSE-C encryption GH-43535: [C++] support the AWS S3 SSE-C encryption Aug 31, 2024
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix style by pre-commit run --show-diff-on-failure --color=always --all?

Comment on lines 58 to 61
ASSERT_FALSE(CalculateSSECKeyMD5("").ok()); // invalid base64
ASSERT_FALSE(CalculateSSECKeyMD5("%^H").ok()); // invalid base64
ASSERT_FALSE(CalculateSSECKeyMD5("INVALID").ok()); // invalid base64
ASSERT_FALSE(CalculateSSECKeyMD5("MTIzNDU2Nzg5").ok()); // invalid, the input key size not match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use ASSERT_RAISES_WITH_MESSAGES() instead?

cpp/src/arrow/filesystem/filesystem_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3fs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3fs.cc Outdated Show resolved Hide resolved
@@ -443,6 +443,7 @@ class TestS3FS : public S3TestMixin {
// Most tests will create buckets
options_.allow_bucket_creation = true;
options_.allow_bucket_deletion = true;
options_.sse_customer_key = "1WH9aTJ0+Tn0NLbTMHZn9aCW3Li3ViAdBsoIldPCREw=";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to cover the workflow with sse_customer_key nonempty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always set sse_custom_key, we can't test no sse_custom_key case.
Can we add a test that uses sse_custom_key instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, and ideally the SSE-C encryption test would write a file with the encryption key, and check that trying to read it with another key (or with no key at all) fails.

cpp/src/arrow/filesystem/util_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/util_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/util_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/util_internal.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting review Awaiting review labels Aug 31, 2024
@ripplehang
Copy link
Author

@kou @pitrou Can you help to review the PR again, Thanks a lot.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

cpp/src/arrow/filesystem/s3_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3_test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3fs.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3fs_test.cc Outdated Show resolved Hide resolved
@kou kou changed the title GH-43535: [C++] support the AWS S3 SSE-C encryption GH-43535: [C++] Support the AWS S3 SSE-C encryption Oct 16, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 16, 2024
@kou
Copy link
Member

kou commented Oct 16, 2024

Could you update the PR description to reflect the current implementation?

@ripplehang
Copy link
Author

ripplehang commented Oct 17, 2024

@kou do you think we can merge the PR now. Suppose those requested changes from @pitrou should also be resolved.

@kou
Copy link
Member

kou commented Oct 17, 2024

I'll wait for re-review from pitrou in this week.
I'll merge this in the next week if nobody reviews this.

class TestS3FSHTTP : public TestS3FS {
public:
void SetUp() override {
enable_tls_if_supported_ = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this necessary? This looks like a weird workaround, can there at least be an explanation in comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've temporarily disabled this workaround here: a69961b

We'll see how CI goes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it can occur sporadically even in non-stress tests:
https://github.com/ursacomputing/crossbow/actions/runs/11388191999/job/31684559075#step:7:3791

[ RUN      ] TestS3FS.OpenOutputStreamMetadata
/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:593: Failure
Failed
'output->Close()' failed with IOError: When completing multiple part upload for key 'mdfile1' in bucket 'bucket': AWS Error NO_SUCH_UPLOAD during CompleteMultipartUpload operation: The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.
/arrow/cpp/src/arrow/filesystem/s3fs.cc:1830  CleanupIfFailed(FinishPartUploadAfterFlush())
[  FAILED  ] TestS3FS.OpenOutputStreamMetadata (215 ms)

@kou Do you know if we were seeing this error before switching to HTTPS for Minio-based tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may have to do with forcing "localhost" for TLS certificate check purposes: in some cases, a port number might be reused?

What we should perhaps do is:

  1. switch all tests to HTTP except the SSE customer key tests
  2. use "localhost" only for HTTPS tests, otherwise use GetListenAddress() without a host argument.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Fine, i've done the related change in the latest commit. please help to review again.

@pitrou
Copy link
Member

pitrou commented Oct 17, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: a69961b

Submitted crossbow builds: ursacomputing/crossbow @ actions-cc413ca29c

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2 GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@ripplehang
Copy link
Author

@pitrou is it possible to help review the change again, Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants