Allow passing a size-hint to s3.download_fileobj #3083

ronkorving · 2021-12-02T03:08:04Z

Is your feature request related to a problem? Please describe.

When calling download_fileobj, the transfer_future.meta.size is unsurprisingly None, causing a head_object to take place. Only to fetch the size of the object. You can see this here: https://github.com/boto/s3transfer/blob/ccb71ddd89149a4bc5a45a2fcd5e42aafba3f0ea/s3transfer/download.py#L339-L348

In cases where you're calling download_fileobj on many files following a list-operation -- which already provides object sizes in its response! -- this seems rather wasteful, and causes overall latency to hurt measurably. This is especially visible when dealing with many small objects.

I only looked into download_fileobj, but I can imagine the same applies to download_file, and possibly other scenarios. I understand that me using unmanaged downloads would avoid this problem altogether, but I'm not really asking for workarounds.

Describe the solution you'd like

If download_fileobj, either via TransferConfig, or perhaps better, via ExtraArgs would accept a "size hint" that we could provide following a list-operation, that head-request could be avoided, latency would drop and cost (Lambda execution time and S3 requests) would decrease.

The text was updated successfully, but these errors were encountered:

stobrien89 · 2021-12-02T19:52:08Z

Hi @ronkorving,

Thanks for the feature request! I think this is reasonable. We'll leave this open to track interest for the time being, so if anyone is interested, please leave a reaction on the original post.

panthony · 2021-12-08T07:48:54Z

I had the same problem and I worked around it like this:

from boto3.s3.transfer import S3Transfer, BaseSubscriber, S3TransferRetriesExceededError, RetriesExceededError

def download_file(transfer: S3Transfer, bucket_name: str, key_name: str, key_size: int, download_path: str):
    """
    This is a workaround to provide the key size to the transfer routine to avoid a HEAD
    request for every file we download.

    See https://github.com/boto/boto3/issues/3083

    This is an override of the method 'download_file' from S3Transfer.
    """
    class ProvideKeySize(BaseSubscriber):
        def on_queued(self, future, **kwargs):
            future.meta.provide_transfer_size(key_size)

    future = transfer._manager.download(bucket_name, key_name, download_path, None, [ProvideKeySize()])
    try:
        future.result()
    # This is for backwards compatibility where when retries are
    # exceeded we need to throw the same error from boto3 instead of
    # s3transfer's built in RetriesExceededError as current users are
    # catching the boto3 one instead of the s3transfer exception to do
    # their own retries.
    except S3TransferRetriesExceededError as e:
        raise RetriesExceededError(e.last_exception)

I'm re-defining download_file & using the private property _manager but this is better than performing millions of unnecessary HEAD requests.

I believe accepting subscribers as parameter for download_file could solve this issue (that could be documented as exemple) without modifying the code too much.

ronkorving · 2021-12-09T04:10:54Z

@panthony I commend you for having found this workaround :)

I hope AWS sees it as another confirmation that there's a real need to be addressed here.

stobrien89 · 2022-01-06T19:27:56Z

Hi all,

I reviewed this with the team today and they agreed that this seems reasonable and is likely something we'll implement if this gets significant traction.

pencil · 2024-05-09T23:42:44Z

I just ran into this as well, downloading a long list of files from S3. I was surprised to find that there always was a HEAD request before the GET request in our service logs. I guess this explains it?

I'm not sure why this is even necessary at all since the GET request will return the size in the headers which can be read before streaming the body.

daveisfera · 2024-10-17T18:03:47Z

Is there a way to download using just a GET so there's no HEAD call before? This makes the download take longer and increases the number of S3 operations

daveisfera · 2024-10-17T19:57:09Z

Dug a bit more and noticed that this behavior could be removed (and I'd argue should become the default)

ronkorving added feature-request This issue requests a feature. needs-triage This issue or PR still needs to be triaged. labels Dec 2, 2021

stobrien89 added s3 and removed needs-triage This issue or PR still needs to be triaged. labels Dec 2, 2021

stobrien89 added needs-review and removed needs-review labels Jan 5, 2022

aBurmeseDev added the p3 This is a minor priority issue label Nov 4, 2022

daveisfera mentioned this issue Oct 17, 2024

Skip HEAD for multipart check when multipart_threshold is 0 boto/s3transfer#316

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow passing a size-hint to s3.download_fileobj #3083

Allow passing a size-hint to s3.download_fileobj #3083

ronkorving commented Dec 2, 2021 •

edited

Loading

stobrien89 commented Dec 2, 2021

panthony commented Dec 8, 2021 •

edited

Loading

ronkorving commented Dec 9, 2021

stobrien89 commented Jan 6, 2022

pencil commented May 9, 2024 •

edited

Loading

daveisfera commented Oct 17, 2024

daveisfera commented Oct 17, 2024

Allow passing a size-hint to s3.download_fileobj #3083

Allow passing a size-hint to s3.download_fileobj #3083

Comments

ronkorving commented Dec 2, 2021 • edited Loading

stobrien89 commented Dec 2, 2021

panthony commented Dec 8, 2021 • edited Loading

ronkorving commented Dec 9, 2021

stobrien89 commented Jan 6, 2022

pencil commented May 9, 2024 • edited Loading

daveisfera commented Oct 17, 2024

daveisfera commented Oct 17, 2024

ronkorving commented Dec 2, 2021 •

edited

Loading

panthony commented Dec 8, 2021 •

edited

Loading

pencil commented May 9, 2024 •

edited

Loading