Skip to content

Conversation

@bwalsh
Copy link
Contributor

@bwalsh bwalsh commented Jan 20, 2026

See #18

Engineering note — Optimal Chunk Size Calculation for Multipart Uploads

OLD:

optimalChunkSize determines the ideal chunk/part size for multipart upload based on file size.
The chunk size (also known as "message size" or "part size") affects upload performance and
must comply with S3 constraints.

Calculation logic:

  • For files ≤ 512 MB: Returns 32 MB chunks for optimal performance
  • For files > 512 MB: Calculates fileSize/maxMultipartParts, with minimum of 5 MB
  • Enforces minimum of 5 MB (S3 requirement for all parts except the last)
  • Rounds up to nearest MB for alignment

This results in:

  • Files ≤ 512 MB: 32 MB chunks
  • Files 512 MB - ~49 GB: 5 MB chunks (minimum enforced)
    The ~49 GB threshold (10,000 parts × 5 MB) is where files exceed S3's
    10,000 part limit when using the minimum chunk size
  • Files > ~49 GB: Dynamically calculated to stay under 10,000 parts

Examples:

  • 100 MB file → 32 MB chunks (4 parts)
  • 1 GB file → 5 MB chunks (~205 parts)
  • 10 GB file → 5 MB chunks (~2,048 parts)
  • 50 GB file → 6 MB chunks (~8,534 parts)
  • 100 GB file → 11 MB chunks (~9,310 parts)
  • 1 TB file → 105 MB chunks (~9,987 parts)

NEW

go test ./client/upload -run '^TestOptimalChunkSize$' -v
go test ./client/common -run '^TestGetLfsCustomTransferInt$' -v

Purpose

  • Validate OptimalChunkSize behavior and return values (chunk size and number of parts) across thresholds, boundaries and scaled ranges.

Key behavior to assert

  1. Input type and units: sizes are int64 bytes; tests should use common.MB / common.GB constants.
  2. Parts calculation: parts = ceil(fileSize / chunk); fileSize == 0 returns parts == 0.
  3. Scaling: scaled ranges are linear, rounded down to the nearest MB and clamped to range.
  4. Minimum chunk clamp: result is at least 1 MB.
  5. Boundary semantics: implementation uses <= and some ranges start at X + 1 — include exact, -1 and +1 byte checks.

Parameterized test cases (file size ⇒ expected chunk ⇒ expected parts)

  1. 0 bytes

    • chunk: 1 MB (fallback)
    • parts: 0
  2. 1 MB

    • chunk: 1 MB (<= 100 MB)
    • parts: 1
  3. 100 MB

    • chunk: 100 MB (<= 100 MB)
    • parts: 1
  4. 100 MB + 1 B

    • chunk: 10 MB (> 100 MB - <= 1 GB)
    • parts: ceil((100 MB + 1 B) / 10 MB) = 11
  5. 500 MB

    • chunk: 10 MB
    • parts: 50
  6. 1 GB (1024 MB)

    • chunk: 10 MB (<= 1 GB)
    • parts: ceil(1024 / 10) = 103
  7. 1 GB + 1 B

    • chunk: 25 MB (start of 1 GB - 10 GB scaled range)
    • parts: ceil((1024 MB + 1 B) / 25 MB) = 41
  8. 5 GB (5120 MB)

    • chunk: linear between 25 MB and 128 MB → ≈ 70 MB (rounded down)
    • parts: ceil(5120 / 70) = 74
  9. 10 GB (10240 MB)

    • chunk: 128 MB (end of 1 GB - 10 GB scaled range)
    • parts: 80
  10. 10 GB + 1 B

    • chunk: 256 MB (> 10 GB - <= 100 GB fixed)
    • parts: ceil((10240 MB + 1 B) / 256 MB) = 41
  11. 50 GB (51200 MB)

    • chunk: 256 MB
    • parts: 200
  12. 100 GB (102400 MB)

    • chunk: 256 MB
    • parts: 400
  13. 100 GB + 1 B

    • chunk: 512 MB (start of > 100 GB scaled range)
    • parts: ceil((102400 MB + 1 B) / 512 MB) = 201
  14. 500 GB (512000 MB)

    • chunk: linear between 512 MB and 1024 MB → ≈ 739 MB (rounded down)
    • parts: ceil(512000 / 739) = 693
  15. 1 TB (1024 GB = 1,048,576 MB) — note: use project units consistently

    • chunk: 1024 MB (max of scaled range)
    • parts: 1,048,576 / 1024 = 1024

Test design notes (concise)

  1. Use table-driven subtests in client/upload/utils_test.go. Include fields: name, fileSize int64, wantChunk int64, wantParts int64.
  2. For scaled cases assert: MB alignment, clamped to min/max, and exact wantParts. Use integer arithmetic for parts.
  3. Add explicit boundary triples for each threshold: exact, -1 byte, +1 byte.
  4. Include negative and zero cases to verify fallback behavior.
  5. Keep tests deterministic and fast (no external deps).

Execution

  • Run from repo root: go test ./client/upload -v
  • Run single test: go test ./client/upload -run '^TestOptimalChunkSize$' -v

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR increases the minimum chunk size for multipart uploads from 5 MB to 10 MB, addressing issue #18 in the data-client repository.

Changes:

  • Updated MinChunkSize constant from 5 MB to 10 MB

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bwalsh and others added 3 commits January 21, 2026 14:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 100 to 105
func init() {
v, err := GetLfsCustomTransferInt("lfs.customtransfer.drs.multipart-min-chunk-size", 10)
if err != nil {
MinChunkSize = int64(10) * MB
return
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init function silently ignores errors from GetLfsCustomTransferInt and falls back to the default value without any logging or warning. This means if a user has misconfigured their git config (e.g., set an invalid value), they won't know that their configuration is being ignored. Consider adding a log statement when the error occurs so users are aware their configuration is not being applied.

Copilot uses AI. Check for mistakes.
name string
fileSize int64
wantChunkSize int64
wantChunks int64
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name wantChunks is misleading. Based on the calculation (fileSize + chunkSize - 1) / chunkSize, this represents the number of parts in the multipart upload, not chunks. Consider renaming to wantParts to match the terminology used in the multipart upload code and be consistent with the PR description which refers to "number of parts".

Copilot uses AI. Check for mistakes.
MaxConcurrentUploads = 10
MaxRetries = 5
MinChunkSize = 5 * 1024 * 1024
// MinChunkSize = 10 * 1024 * 1024
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old constant MinChunkSize is commented out but not removed. This commented-out code should be removed entirely to maintain code cleanliness, especially since MinChunkSize has been moved to a variable declaration below.

Suggested change
// MinChunkSize = 10 * 1024 * 1024

Copilot uses AI. Check for mistakes.

parsed, err := strconv.ParseInt(value, 10, 64)
if err != nil {
return defaultValue, fmt.Errorf("invalid int value for %s: >%q<", key, value)
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message format uses angle brackets around the quoted value (e.g., >%q<), which is unusual and inconsistent with standard Go error formatting. Consider using a more conventional format like %q without the angle brackets, or use a clearer message format such as invalid int value for %s: %q.

Suggested change
return defaultValue, fmt.Errorf("invalid int value for %s: >%q<", key, value)
return defaultValue, fmt.Errorf("invalid int value for %s: %q", key, value)

Copilot uses AI. Check for mistakes.
Comment on lines 67 to 68
chunks := (tt.fileSize + chunkSize - 1) / chunkSize
if chunks != tt.wantChunks {
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name chunks is misleading. This represents the number of parts, not chunks. The calculation (fileSize + chunkSize - 1) / chunkSize is computing how many parts the file will be divided into. Consider renaming to parts for consistency with the multipart upload terminology used throughout the codebase.

Copilot uses AI. Check for mistakes.
Comment on lines 9 to 73
func TestOptimalChunkSize(t *testing.T) {
tests := []struct {
name string
fileSize int64
wantChunkSize int64
wantChunks int64
}{
{
name: "100MB",
fileSize: 100 * common.MB,
wantChunkSize: 100 * common.MB,
wantChunks: 1,
},
{
name: "1GB",
fileSize: 1 * common.GB,
wantChunkSize: 10 * common.MB,
wantChunks: 103,
},
{
name: "5GB",
fileSize: 5 * common.GB,
wantChunkSize: 70 * common.MB,
wantChunks: 74,
},
{
name: "10GB",
fileSize: 10 * common.GB,
wantChunkSize: 128 * common.MB,
wantChunks: 80,
},
{
name: "50GB",
fileSize: 50 * common.GB,
wantChunkSize: 256 * common.MB,
wantChunks: 200,
},
{
name: "100GB",
fileSize: 100 * common.GB,
wantChunkSize: 256 * common.MB,
wantChunks: 400,
},
{
name: "1TB",
fileSize: 1 * common.TB,
wantChunkSize: 1 * common.GB,
wantChunks: 1024,
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
chunkSize := OptimalChunkSize(tt.fileSize)
if chunkSize != tt.wantChunkSize {
t.Fatalf("chunk size = %d, want %d", chunkSize, tt.wantChunkSize)
}

chunks := (tt.fileSize + chunkSize - 1) / chunkSize
if chunks != tt.wantChunks {
t.Fatalf("chunks = %d, want %d", chunks, tt.wantChunks)
}
})
}
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test coverage is incomplete compared to the documented requirements in the PR description. The documentation specifies 15 test cases including boundary cases (exact value, -1 byte, +1 byte), zero bytes, negative values, and specific thresholds. However, the implementation only includes 7 test cases and is missing critical boundary tests such as: 0 bytes, 1 MB, 100 MB + 1 byte, 1 GB + 1 byte, 10 GB + 1 byte, 100 GB + 1 byte, 500 MB, and 500 GB. These boundary cases are important to verify the correct behavior of the switch statement and scaling logic.

Copilot uses AI. Check for mistakes.
}

MinChunkSize = int64(v) * MB

Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init function has an unnecessary blank line at the end before the closing brace. This should be removed for consistency with Go style guidelines.

Suggested change

Copilot uses AI. Check for mistakes.
switch {
case fileSize <= 100*common.MB:
// Single PUT: return whole file size

Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OptimalChunkSize function has an unnecessary blank line between the comment and the return statement in the first case. This blank line should be removed for consistency with the other cases in the switch statement.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +117
cmd := exec.Command("git", "config", "--get", "--default", defaultText, key)
output, err := cmd.Output()
if err != nil {
return defaultValue, fmt.Errorf("error reading git config %s: %v", key, err)
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git config command includes the --default flag on line 114, but git config --default was only introduced in Git 2.18 (June 2018). While this is likely not an issue for most modern environments, the code should verify that this feature is available or document the minimum required Git version. Additionally, the error handling silently falls back to the default value when git config fails, which could mask configuration issues. Consider logging a warning when the git config command fails so users know their configuration is not being applied.

Copilot uses AI. Check for mistakes.
Comment on lines 31 to 37
## NEW


```bash
go test ./client/upload -run '^TestOptimalChunkSize$' -v

```
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation structure is confusing. The "NEW" section (line 31) doesn't describe the new algorithm or behavior - it only contains test commands and test case specifications. The documentation should include a clear description of the new chunk size calculation algorithm (similar to the "OLD" section) before diving into testing details. Consider adding a subsection that describes the new behavior: <= 100 MB uses fileSize, > 100 MB - <= 1 GB uses 10 MB, > 1 GB - <= 10 GB scales 25-128 MB, > 10 GB - <= 100 GB uses 256 MB, and > 100 GB scales 512-1024 MB.

Copilot uses AI. Check for mistakes.
@bwalsh
Copy link
Contributor Author

bwalsh commented Jan 22, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link

Copilot AI commented Jan 22, 2026

@bwalsh I've opened a new pull request, #21, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 4 commits January 22, 2026 21:33
Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com>
Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com>
Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com>
Address PR #19 review comments: naming, test coverage, and documentation
bwalsh and others added 6 commits January 26, 2026 13:49
- Added nil check for req.Progress in multipart.go to prevent panic
- Removed duplicate progress event in singleFile.go (Finalize already emits)
- Fixed concurrent BytesSoFar calculation using atomic counter for monotonicity

Co-authored-by: bwalsh <47808+bwalsh@users.noreply.github.com>
…progress tracking

Fix nil pointer dereferences and duplicate progress events in upload progress tracking
@bwalsh bwalsh merged commit b5432fe into develop Jan 27, 2026
@bwalsh bwalsh deleted the fix/increase-multipart-chunk-size branch January 27, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants