Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing a size-hint to s3.download_fileobj #3083

Open
ronkorving opened this issue Dec 2, 2021 · 7 comments
Open

Allow passing a size-hint to s3.download_fileobj #3083

ronkorving opened this issue Dec 2, 2021 · 7 comments
Labels
feature-request This issue requests a feature. p3 This is a minor priority issue s3

Comments

@ronkorving
Copy link

ronkorving commented Dec 2, 2021

Is your feature request related to a problem? Please describe.

When calling download_fileobj, the transfer_future.meta.size is unsurprisingly None, causing a head_object to take place. Only to fetch the size of the object. You can see this here: https://github.com/boto/s3transfer/blob/ccb71ddd89149a4bc5a45a2fcd5e42aafba3f0ea/s3transfer/download.py#L339-L348

In cases where you're calling download_fileobj on many files following a list-operation -- which already provides object sizes in its response! -- this seems rather wasteful, and causes overall latency to hurt measurably. This is especially visible when dealing with many small objects.

I only looked into download_fileobj, but I can imagine the same applies to download_file, and possibly other scenarios. I understand that me using unmanaged downloads would avoid this problem altogether, but I'm not really asking for workarounds.

Describe the solution you'd like

If download_fileobj, either via TransferConfig, or perhaps better, via ExtraArgs would accept a "size hint" that we could provide following a list-operation, that head-request could be avoided, latency would drop and cost (Lambda execution time and S3 requests) would decrease.

@ronkorving ronkorving added feature-request This issue requests a feature. needs-triage This issue or PR still needs to be triaged. labels Dec 2, 2021
@stobrien89 stobrien89 added s3 and removed needs-triage This issue or PR still needs to be triaged. labels Dec 2, 2021
@stobrien89
Copy link
Contributor

Hi @ronkorving,

Thanks for the feature request! I think this is reasonable. We'll leave this open to track interest for the time being, so if anyone is interested, please leave a reaction on the original post.

@panthony
Copy link

panthony commented Dec 8, 2021

I had the same problem and I worked around it like this:

from boto3.s3.transfer import S3Transfer, BaseSubscriber, S3TransferRetriesExceededError, RetriesExceededError

def download_file(transfer: S3Transfer, bucket_name: str, key_name: str, key_size: int, download_path: str):
    """
    This is a workaround to provide the key size to the transfer routine to avoid a HEAD
    request for every file we download.

    See https://github.com/boto/boto3/issues/3083

    This is an override of the method 'download_file' from S3Transfer.
    """
    class ProvideKeySize(BaseSubscriber):
        def on_queued(self, future, **kwargs):
            future.meta.provide_transfer_size(key_size)

    future = transfer._manager.download(bucket_name, key_name, download_path, None, [ProvideKeySize()])
    try:
        future.result()
    # This is for backwards compatibility where when retries are
    # exceeded we need to throw the same error from boto3 instead of
    # s3transfer's built in RetriesExceededError as current users are
    # catching the boto3 one instead of the s3transfer exception to do
    # their own retries.
    except S3TransferRetriesExceededError as e:
        raise RetriesExceededError(e.last_exception)

I'm re-defining download_file & using the private property _manager but this is better than performing millions of unnecessary HEAD requests.

I believe accepting subscribers as parameter for download_file could solve this issue (that could be documented as exemple) without modifying the code too much.

@ronkorving
Copy link
Author

@panthony I commend you for having found this workaround :)

I hope AWS sees it as another confirmation that there's a real need to be addressed here.

@stobrien89
Copy link
Contributor

Hi all,

I reviewed this with the team today and they agreed that this seems reasonable and is likely something we'll implement if this gets significant traction.

@aBurmeseDev aBurmeseDev added the p3 This is a minor priority issue label Nov 4, 2022
@pencil
Copy link

pencil commented May 9, 2024

I just ran into this as well, downloading a long list of files from S3. I was surprised to find that there always was a HEAD request before the GET request in our service logs. I guess this explains it?

I'm not sure why this is even necessary at all since the GET request will return the size in the headers which can be read before streaming the body.

@daveisfera
Copy link

Is there a way to download using just a GET so there's no HEAD call before? This makes the download take longer and increases the number of S3 operations

@daveisfera
Copy link

Dug a bit more and noticed that this behavior could be removed (and I'd argue should become the default)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request This issue requests a feature. p3 This is a minor priority issue s3
Projects
None yet
Development

No branches or pull requests

6 participants