Eliminate HEAD requests during downloads, especially for faster transfers of small files#363
Eliminate HEAD requests during downloads, especially for faster transfers of small files#363crowecawcaw wants to merge 5 commits intoboto:developfrom
Conversation
aemous
left a comment
There was a problem hiding this comment.
I left 1 nit. Note that I am not the primary reviewer for this PR, and we are still pending the primary review.
|
Looks like downloads may be failing when the object has a size of 0 bytes. The error I'm seeing is Can you confirm whether you are able to reproduce this, and look into resolving it? InvalidRange occurs when the range we are requesting has no overlap with the object itself. For 0 byte objects, this will always be the case if we are requesting the first 0-x bytes, since there are no bytes at all. The only feasible patch I can imagine here would be a try-except clause that handles InvalidRange by performing a non-ranged GET, which may be undesirable. If you have a better alternative for resolving this that doesn't involve needing an extra request, that would be preferred. |
|
I see the same issue. I added an integ test, fixed the unit test to mirror the same behavior, then fixed the issue. Instead of sending a second Get request for an object we know is empty, it just returns the empty content. |
|
@crowecawcaw Can you push a revision that satisfies the 'Lint code' GH Action? In case you cannot see the Action details, I pasted it here: |
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
Signed-off-by: Stephen Crowe <6042774+crowecawcaw@users.noreply.github.com>
b1fe8b8 to
5635ee3
Compare
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #363 +/- ##
===========================================
+ Coverage 80.41% 81.19% +0.77%
===========================================
Files 16 16
Lines 3013 3106 +93
===========================================
+ Hits 2423 2522 +99
+ Misses 590 584 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Address reviewer feedback that the HEAD request is needed by default to enable full-object checksum validation on downloads. Only skip the HEAD when the caller has opted out via botocore's `response_checksum_validation = when_required` config. The prior HEAD + size-based GET logic is restored as the default path. Also: - Add TestDownloadWhenRequiredChecksumValidation covering the HEAD-less path for non-empty and 0-byte objects (InvalidRange handling) - Fix ruff import ordering / formatting in time-batch-download.py flagged by the Lint CI action
aemous
left a comment
There was a problem hiding this comment.
Loogs good overall. Left a few questions/comments.
| @@ -0,0 +1,5 @@ | |||
| { | |||
| "type": "feature", | |||
There was a problem hiding this comment.
Can we regenerate the changelog entry to use the enhancement type? Feature results in a minor version bump, and we try to restrict it to extremely large changes.
Also, I recommend s3 as a category instead.
And can we update the description to mention S3 up-front? e.g. Skip HEAD requests during S3 downloads when ...
|
|
||
| def __call__(self): | ||
| # Always check if we have a task and response first | ||
| if not self._task: |
There was a problem hiding this comment.
What cases do we expect the task to be None? Or is this check just for defensive programming purposes?
| @@ -0,0 +1,62 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
Has this script been used to produce benchmarking results that summarize the latency improvement this PR gives?
|
|
||
| def _validator(self): | ||
| # Bind the unbound method; validator doesn't touch instance state. | ||
| return GetObjectTask._validate_content_range.__get__( |
There was a problem hiding this comment.
Is there any good way to test this functionality without accessing private functions/variables of another class? For example, maybe using the public interface that indirectly calls these private functions to indirectly test this underlying behavior?
This PR optimizes downloads through the TransferManager by removing the upfront HEAD request. Previously, every download issued a HEAD request to determine object size before starting the GET request. This
change eliminates that extra round-trip by extracting metadata from the first GET response instead.
For small files, the download time is dominated by the request latency, so eliminating one of the two requests results in a ~50% download time reduction. For large files, the effect is less noticeable because the download time is dominated by the the transfer time and because there are multiple chunks to download. In both cases, we save the cost of the HEAD request.
What Changed
• Removed HEAD requests: Downloads now start immediately with a ranged GET request for the first chunk
• Dynamic size detection: Extract object size and ETag from the first GET response headers (ContentRange or ContentLength)
• Dynamic chunk scheduling: After the first chunk completes, schedule additional chunks only if the object is larger than the chunk size
• Simplified code flow: Consolidated download logic into a single path instead of branching on size upfront
• ETag consistency for all chunks: When an ETag is available (either pre-provided via a subscriber, e.g. from a prior HEAD request by the AWS CLI, or extracted from the first GET response), all subsequent ranged GET requests include an
IfMatchheader with that ETag. This ensures S3 rejects the request if the object changes mid-download. The first GET request also includesIfMatchif an ETag was pre-provided before the download started.Testing
Unit, functional, and integ tests pass. I also added a new script to benchmark downloading many small files. For downloading 1000 1kB files on my laptop, the total duration dropped 41% from 15.0s to 8.9.
Backward Compatibility
External API unchanged. All download methods have the same signatures.
Flow Diagrams
Before (with HEAD request)
flowchart TD A[HEAD Request] --> C{Size < 8MB?} C -->|Yes| D[GET Request] C -->|No| E[Multiple GET Requests] D --> F[Complete] E --> FAfter (no HEAD request)
flowchart TD A[GET First Chunk] --> B{Size < 8MB?} B -->|Yes| C[Complete] B -->|No| D[GET Remaining Chunks] D --> CBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.