-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccmlib/common: improve aws_bucket_ls
#473
Conversation
ccmlib/common.py
Outdated
bucket_object = s3_url.replace('https://s3.amazonaws.com/', '').split('/') | ||
prefix = '/'.join(bucket_object[1:]) | ||
s3_conn = Session().client(service_name='s3', config=Config(signature_version=UNSIGNED)) | ||
config = Config(signature_version=UNSIGNED) if public else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the related document of boto3 does not explain what a None
config implies. see https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html#boto3.session.Session.client . so i headed over to the source code of boto3, which led me to botocore. see https://github.com/boto/botocore/blob/db0aed3654ec7a5a1c64936d17a724ba41b2e926/botocore/session.py#L902 and https://github.com/boto/botocore/blob/db0aed3654ec7a5a1c64936d17a724ba41b2e926/botocore/session.py#L917-L922 . where i found that default_client_config
is used if config
is not specified or None
. but set_default_client_config()
is never called in this case. as Session()
creates an instance of Session
whose _client_config
is None. see https://github.com/boto/botocore/blob/db0aed3654ec7a5a1c64936d17a724ba41b2e926/botocore/session.py#L151 .
so i am confused. what do you mean by
Support disabling
public
access for some buckets that
we don't have public access yet
if we don't have the public access to a certain bucket. shall we just give up or get authorized for accessing it?
i went ahead reading https://github.com/boto/botocore/blob/develop/botocore/session.py#L939-L957 . which led me to the credential resolver at https://github.com/boto/botocore/blob/develop/botocore/credentials.py#L65. so i guess we rely on the environmental variables of AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
set by scylla-pkg, am i right?
if that's true. could you please add a comment to explain the dependency of these assumption? because, to developers like me who is not familiar with boto3, it would be difficult to reason the connection between a None
config and the environmental variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config(signature_version=UNSIGNED)
means you'll use the API without using credentials at all
https://dev.to/aws-builders/access-s3-public-data-without-credentials-4f06
passing None, should fallback to defaults, as if you didn't passed that parameter, and after test, it's behaved as I would expect, I didn't read all the boto3/botocore to verify it... I've tested it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed it completely it's not really was needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i am not questioning your change. i am just trying to find the official document to prove it.
ccmlib/common.py
Outdated
files_in_bucket.append(obj['Key'].replace(prefix + "/", '')) | ||
files_in_bucket.extend(page['Contents']) | ||
|
||
files_in_bucket = [f['Key'].replace(prefix + "/", '') for f in sorted(files_in_bucket, key=get_last_modified)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd recommend to avoid assigning different type of data to the same variable. after this change, files_in_bucket
is used in two places:
- to hold the objects listed with given prefix: it is a list of dict
- to hold their keys after being normalized: it is a list of str
this would add the cognitive load of human readers. i'd suggest use something like
for page in pages:
for obj in page.get('Contents', []):
files_in_bucket.append(obj['Key'].replace(prefix + "/", ''))
return sorted(files_in_bucket, key=get_last_modified)
[offtopic], i am wondering if we can restructure the code using Collections. like
bucket, prefix = s3_url.replace('https://s3.amazonaws.com/', '').split('/', 1)
s3 = boto3.resource('s3')
objects_in_bucket = (obj.key.replace(prefix + "/", '') for s3.Bucket(bucket).objects.all())
return sorted(objects_in_bucket, key=lambda obj: obj.last_modified)
where we can have some improvements:
- construct the prefix when parsing the
s3_url
- reuse the builtin support of resource management
- do not paginate the list_object_v2 manually
- use iterator instead of materializing the list when possible
- access the attribute of enumerated object instead of accessing it like a dict
see https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#list-objects-in-an-amazon-s3-bucket and https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/object/last_modified.html#last-modified
55a180b
to
1a2d634
Compare
Automaticlly sort by date the returned file list, since the users of this function only get the key names, and loses the modify date and parsing the date out of the name isn't trivial
1a2d634
to
32856e2
Compare
s3_resource = Session().resource(service_name='s3', config=Config(signature_version=UNSIGNED)) | ||
bucket = s3_resource.Bucket(bucket_object[0]) | ||
|
||
files_in_bucket = bucket.objects.filter(Prefix=prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, i missed the prefix part in my proposed change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, i will create a follow-up PR to implement the change of
bucket, prefix = s3_url.replace('https://s3.amazonaws.com/', '').split('/', 1)
Support disabling
public
access for some buckets that we don't have public access yetAutomaticlly sort by date the returned file list, since the users of this function only get the key names, and loses the modify date and parsing the date out of the name isn't trivial