ccmlib/common: improve `aws_bucket_ls` #473

fruch · 2023-07-05T22:03:14Z

Support disabling public access for some buckets that we don't have public access yet
Automaticlly sort by date the returned file list, since the users of this function only get the key names, and loses the modify date and parsing the date out of the name isn't trivial

tchaikov · 2023-07-06T03:15:20Z

ccmlib/common.py

    bucket_object = s3_url.replace('https://s3.amazonaws.com/', '').split('/')
    prefix = '/'.join(bucket_object[1:])
-    s3_conn = Session().client(service_name='s3', config=Config(signature_version=UNSIGNED))
+    config = Config(signature_version=UNSIGNED) if public else None


the related document of boto3 does not explain what a None config implies. see https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html#boto3.session.Session.client . so i headed over to the source code of boto3, which led me to botocore. see https://github.com/boto/botocore/blob/db0aed3654ec7a5a1c64936d17a724ba41b2e926/botocore/session.py#L902 and https://github.com/boto/botocore/blob/db0aed3654ec7a5a1c64936d17a724ba41b2e926/botocore/session.py#L917-L922 . where i found that default_client_config is used if config is not specified or None. but set_default_client_config() is never called in this case. as Session() creates an instance of Session whose _client_config is None. see https://github.com/boto/botocore/blob/db0aed3654ec7a5a1c64936d17a724ba41b2e926/botocore/session.py#L151 .

so i am confused. what do you mean by

Support disabling public access for some buckets that
we don't have public access yet

if we don't have the public access to a certain bucket. shall we just give up or get authorized for accessing it?

i went ahead reading https://github.com/boto/botocore/blob/develop/botocore/session.py#L939-L957 . which led me to the credential resolver at https://github.com/boto/botocore/blob/develop/botocore/credentials.py#L65. so i guess we rely on the environmental variables of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set by scylla-pkg, am i right?

if that's true. could you please add a comment to explain the dependency of these assumption? because, to developers like me who is not familiar with boto3, it would be difficult to reason the connection between a None config and the environmental variables.

Config(signature_version=UNSIGNED) means you'll use the API without using credentials at all
https://dev.to/aws-builders/access-s3-public-data-without-credentials-4f06

passing None, should fallback to defaults, as if you didn't passed that parameter, and after test, it's behaved as I would expect, I didn't read all the boto3/botocore to verify it... I've tested it.

I've removed it completely it's not really was needed

no, i am not questioning your change. i am just trying to find the official document to prove it.

tchaikov · 2023-07-06T04:08:44Z

ccmlib/common.py

-            files_in_bucket.append(obj['Key'].replace(prefix + "/", ''))
+        files_in_bucket.extend(page['Contents'])
+
+    files_in_bucket = [f['Key'].replace(prefix + "/", '') for f in sorted(files_in_bucket, key=get_last_modified)]


i'd recommend to avoid assigning different type of data to the same variable. after this change, files_in_bucket is used in two places:

to hold the objects listed with given prefix: it is a list of dict

to hold their keys after being normalized: it is a list of str

this would add the cognitive load of human readers. i'd suggest use something like

for page in pages: for obj in page.get('Contents', []): files_in_bucket.append(obj['Key'].replace(prefix + "/", '')) return sorted(files_in_bucket, key=get_last_modified)

[offtopic], i am wondering if we can restructure the code using Collections. like

bucket, prefix = s3_url.replace('https://s3.amazonaws.com/', '').split('/', 1) s3 = boto3.resource('s3') objects_in_bucket = (obj.key.replace(prefix + "/", '') for s3.Bucket(bucket).objects.all()) return sorted(objects_in_bucket, key=lambda obj: obj.last_modified)

where we can have some improvements:

construct the prefix when parsing the s3_url

reuse the builtin support of resource management

do not paginate the list_object_v2 manually

use iterator instead of materializing the list when possible

access the attribute of enumerated object instead of accessing it like a dict

see https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#list-objects-in-an-amazon-s3-bucket and https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/object/last_modified.html#last-modified

Automaticlly sort by date the returned file list, since the users of this function only get the key names, and loses the modify date and parsing the date out of the name isn't trivial

tchaikov · 2023-07-06T10:12:22Z

ccmlib/common.py

+    s3_resource = Session().resource(service_name='s3', config=Config(signature_version=UNSIGNED))
+    bucket = s3_resource.Bucket(bucket_object[0])
+
+    files_in_bucket = bucket.objects.filter(Prefix=prefix)


ahh, i missed the prefix part in my proposed change.

tchaikov

nice, i will create a follow-up PR to implement the change of

bucket, prefix = s3_url.replace('https://s3.amazonaws.com/', '').split('/', 1)

fruch added the no-backport-needed label Jul 5, 2023

fruch requested review from tchaikov and juliayakovlev July 5, 2023 22:03

tchaikov reviewed Jul 6, 2023

View reviewed changes

fruch force-pushed the improve_aws_bucket_ls branch from 55a180b to 1a2d634 Compare July 6, 2023 08:44

ccmlib/common: improve aws_bucket_ls

32856e2

Automaticlly sort by date the returned file list, since the users of this function only get the key names, and loses the modify date and parsing the date out of the name isn't trivial

fruch force-pushed the improve_aws_bucket_ls branch from 1a2d634 to 32856e2 Compare July 6, 2023 09:51

tchaikov reviewed Jul 6, 2023

View reviewed changes

tchaikov approved these changes Jul 6, 2023

View reviewed changes

fruch merged commit ad16596 into scylladb:next Jul 9, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccmlib/common: improve `aws_bucket_ls` #473

ccmlib/common: improve `aws_bucket_ls` #473

fruch commented Jul 5, 2023

tchaikov Jul 6, 2023

fruch Jul 6, 2023

fruch Jul 6, 2023

tchaikov Jul 6, 2023

tchaikov Jul 6, 2023

tchaikov Jul 6, 2023

tchaikov left a comment

ccmlib/common: improve aws_bucket_ls #473

ccmlib/common: improve aws_bucket_ls #473

Conversation

fruch commented Jul 5, 2023

tchaikov Jul 6, 2023

Choose a reason for hiding this comment

fruch Jul 6, 2023

Choose a reason for hiding this comment

fruch Jul 6, 2023

Choose a reason for hiding this comment

tchaikov Jul 6, 2023

Choose a reason for hiding this comment

tchaikov Jul 6, 2023

Choose a reason for hiding this comment

tchaikov Jul 6, 2023

Choose a reason for hiding this comment

tchaikov left a comment

Choose a reason for hiding this comment

ccmlib/common: improve `aws_bucket_ls` #473

ccmlib/common: improve `aws_bucket_ls` #473