Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccmlib/common: improve aws_bucket_ls #473

Merged
merged 1 commit into from
Jul 9, 2023

Conversation

fruch
Copy link
Contributor

@fruch fruch commented Jul 5, 2023

  • Support disabling public access for some buckets that we don't have public access yet

  • Automaticlly sort by date the returned file list, since the users of this function only get the key names, and loses the modify date and parsing the date out of the name isn't trivial

ccmlib/common.py Outdated
bucket_object = s3_url.replace('https://s3.amazonaws.com/', '').split('/')
prefix = '/'.join(bucket_object[1:])
s3_conn = Session().client(service_name='s3', config=Config(signature_version=UNSIGNED))
config = Config(signature_version=UNSIGNED) if public else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the related document of boto3 does not explain what a None config implies. see https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html#boto3.session.Session.client . so i headed over to the source code of boto3, which led me to botocore. see https://github.com/boto/botocore/blob/db0aed3654ec7a5a1c64936d17a724ba41b2e926/botocore/session.py#L902 and https://github.com/boto/botocore/blob/db0aed3654ec7a5a1c64936d17a724ba41b2e926/botocore/session.py#L917-L922 . where i found that default_client_config is used if config is not specified or None. but set_default_client_config() is never called in this case. as Session() creates an instance of Session whose _client_config is None. see https://github.com/boto/botocore/blob/db0aed3654ec7a5a1c64936d17a724ba41b2e926/botocore/session.py#L151 .

so i am confused. what do you mean by

Support disabling public access for some buckets that
we don't have public access yet

if we don't have the public access to a certain bucket. shall we just give up or get authorized for accessing it?

i went ahead reading https://github.com/boto/botocore/blob/develop/botocore/session.py#L939-L957 . which led me to the credential resolver at https://github.com/boto/botocore/blob/develop/botocore/credentials.py#L65. so i guess we rely on the environmental variables of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set by scylla-pkg, am i right?

if that's true. could you please add a comment to explain the dependency of these assumption? because, to developers like me who is not familiar with boto3, it would be difficult to reason the connection between a None config and the environmental variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config(signature_version=UNSIGNED) means you'll use the API without using credentials at all
https://dev.to/aws-builders/access-s3-public-data-without-credentials-4f06

passing None, should fallback to defaults, as if you didn't passed that parameter, and after test, it's behaved as I would expect, I didn't read all the boto3/botocore to verify it... I've tested it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed it completely it's not really was needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i am not questioning your change. i am just trying to find the official document to prove it.

ccmlib/common.py Outdated
files_in_bucket.append(obj['Key'].replace(prefix + "/", ''))
files_in_bucket.extend(page['Contents'])

files_in_bucket = [f['Key'].replace(prefix + "/", '') for f in sorted(files_in_bucket, key=get_last_modified)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd recommend to avoid assigning different type of data to the same variable. after this change, files_in_bucket is used in two places:

  1. to hold the objects listed with given prefix: it is a list of dict
  2. to hold their keys after being normalized: it is a list of str

this would add the cognitive load of human readers. i'd suggest use something like

for page in pages:
    for obj in page.get('Contents', []):
        files_in_bucket.append(obj['Key'].replace(prefix + "/", ''))
return sorted(files_in_bucket, key=get_last_modified)

[offtopic], i am wondering if we can restructure the code using Collections. like

bucket, prefix = s3_url.replace('https://s3.amazonaws.com/', '').split('/', 1)
s3 = boto3.resource('s3')
objects_in_bucket = (obj.key.replace(prefix + "/", '') for s3.Bucket(bucket).objects.all())
return sorted(objects_in_bucket, key=lambda obj: obj.last_modified)

where we can have some improvements:

  1. construct the prefix when parsing the s3_url
  2. reuse the builtin support of resource management
  3. do not paginate the list_object_v2 manually
  4. use iterator instead of materializing the list when possible
  5. access the attribute of enumerated object instead of accessing it like a dict

see https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#list-objects-in-an-amazon-s3-bucket and https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/object/last_modified.html#last-modified

Automaticlly sort by date the returned file list, since the users
of this function only get the key names, and loses the modify date
and parsing the date out of the name isn't trivial
s3_resource = Session().resource(service_name='s3', config=Config(signature_version=UNSIGNED))
bucket = s3_resource.Bucket(bucket_object[0])

files_in_bucket = bucket.objects.filter(Prefix=prefix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, i missed the prefix part in my proposed change.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, i will create a follow-up PR to implement the change of

bucket, prefix = s3_url.replace('https://s3.amazonaws.com/', '').split('/', 1)

@fruch fruch merged commit ad16596 into scylladb:next Jul 9, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants