Skip to content

Conversation

@rvagg
Copy link
Collaborator

@rvagg rvagg commented Feb 6, 2026

Sits on top of #544 which has the synapse-core side of this.


Implement store->pull->commit flow for efficient multi-copy storage replication.

Split operations API on StorageContext:

  • store(): upload data to SP, wait for parking confirmation
  • presignForCommit(): pre-sign EIP-712 extraData for pull + commit reuse
  • pull(): request SP-to-SP transfer from another provider
  • commit(): add pieces on-chain with optional pre-signed extraData
  • getPieceUrl(): get retrieval URL for SP-to-SP pulls

StorageManager.upload() orchestration:

  • Default 2 copies (primary + endorsed secondary)
  • Single-provider: store->commit flow
  • Multi-copy: store on primary, presign, pull to secondaries, commit all
  • Auto-retry failed secondaries with provider exclusion (up to 5 attempts)
  • Pre-signing avoids redundant wallet prompts across providers

Callback refinements:

  • Remove redundant onUploadComplete (use onStored instead)
  • onStored(providerId, pieceCid) - after data parked on provider
  • onPieceAdded(providerId, pieceCid) - after on-chain submission
  • onPieceConfirmed(providerId, pieceCid, pieceId) - after confirmation

Type clarity:

  • Rename UploadOptions.metadata -> pieceMetadata (piece-level)
  • Rename CommitOptions.pieces[].metadata -> pieceMetadata
  • Dataset-level metadata remains in CreateContextOptions.metadata
  • New: StoreError, CommitError for clear failure semantics
  • New: CopyResult, FailedCopy for multi-copy transparency

Implements #494

@rvagg rvagg requested a review from hugomrdias as a code owner February 6, 2026 14:06
@github-project-automation github-project-automation bot moved this to 📌 Triage in FOC Feb 6, 2026
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Feb 6, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
synapse-dev 29ac8ad Commit Preview URL

Branch Preview URL
Feb 10 2026, 11:55 PM

@rvagg
Copy link
Collaborator Author

rvagg commented Feb 6, 2026

Docs lint failing, this still needs a big docs addition but that can come a little later as we get through review here.

Here's some notes I built up about failure modes and handling:

Multi-Copy Upload: Failure Handling

Philosophy

  1. Store failure = hard fail: If we can't store data anywhere, throw immediately
  2. All commits fail = hard fail: If no provider commits successfully, throw CommitError
  3. Partial commit failure = record and return: Record failed providers in failures[] (with role), return result with successful copies[]
  4. Secondary failure = best-effort: Retry with replacement SPs, then commit whatever succeeded
  5. Never throw away successful work: If data is committed on any provider, the user gets a result -- not an exception
  6. Explicit providers = no retry: User specified providers, respect their choice
  7. Batch semantics: All pieces must succeed on a provider, or that provider is failed
  8. Transparency over exceptions: failures[] tells the user what went wrong; copies[] tells them what worked

Partial Success Over Atomicity

When a user requests N copies and we can only achieve fewer, we commit what we have rather than throwing everything away:

  • Best-effort exhaustion: For auto-selected providers, we retry up to 5 secondaries before giving up
  • Upload work is expensive: Throwing discards successful uploads; parked pieces get GC'd by the SP
  • No information loss: throw after partial success destroys information about what did succeed
  • Result inspection is the contract: result.copies.length < count tells the user they got fewer copies; result.failures tells them why

Failure Modes by Stage

The multi-copy upload has a sequential pipeline: select → store → pull → commit.

Stage 0: Provider Selection (before any upload)

Provider selection uses a tiered approach with ping validation at each step:

Priority Selection Strategy When Used
1 Existing data set with endorsed provider Primary selection, has stored before
2 New data set with endorsed provider Primary selection, fresh start
3 Existing data set with non-endorsed provider Fallback if no endorsed available
4 New data set with non-endorsed provider Final fallback

Ping validation: Before selecting any provider, we ping their PDP endpoint. If ping fails, we try the next provider in the current tier before falling to the next tier.

What happens Behavior
Provider ping succeeds Use this provider
Provider ping fails Try next provider in tier, warn to console
All providers in tier fail ping Move to next tier
All tiers exhausted (providers remain but unreachable) Throw error: "All N providers failed health check"
No providers remain after filtering Throw error for primary, break loop for secondaries

Key distinction:

  • For primary selection (first context), exhaustion = error (can't proceed)
  • For secondary selection (subsequent contexts), exhaustion = get fewer copies (proceed with what we have)

Stage 1: Store (upload data to primary SP)

Store has two sub-stages:

Sub-stage What happens Data state Behavior
1a: Upload HTTP upload stream succeeds Data on SP (parked) Continue to 1b
1a: Upload HTTP upload stream fails (network, timeout) No Throw StoreError
1b: Confirm Polling for "parked" status succeeds Data on SP (parked) Continue to pull
1b: Confirm Polling for "parked" status times out Unknown (may or may not exist) Throw StoreError

Store failure is unambiguous from the SDK's perspective: either we have confirmed parked data, or we don't. The user can safely retry.

Note: If 1b times out, data might exist on the SP but we can't confirm it. The SP will eventually GC parked pieces that aren't committed.

Stage 2: Pull (SP-to-SP fetch to secondaries)

What happens Data on secondary? On-chain? Behaviour
Pull succeeds Yes (parked) No Continue to commit
Pull fails (auto-selected) No No Retry with next provider (up to 5 attempts)
Pull fails (explicit provider) No No Record in failures[], no retry
All secondary attempts exhausted No No Proceed to commit with primary only

Pull failure is recoverable: data is still on the primary, no on-chain state exists yet. Retrying pull is cheap (SP-to-SP, no client bandwidth).

Stage 3: Commit (addPieces on-chain transaction)

What happens Data on SP? On-chain? Behaviour
All commits succeed Yes Yes Build result with all copies
Primary commit succeeds, secondary fails Yes Primary: yes Record secondary in failures[]
Primary commit fails, secondary succeeds Yes Secondary: yes Record primary in failures[] with role: 'primary', return with secondary in copies[]
Primary commit fails, secondary also fails Yes (parked) No Throw CommitError -- nothing on-chain, safe to retry
Secondary commit fails Yes (parked) No Record in failures[] -- data on SP, will be GC'd

Behaviour Matrix

Scenario Behaviour
Primary store fails Throw StoreError -- nothing happened
Primary commit fails, secondary succeeds Record primary in failures[] with role: 'primary', return result
All commits fail Throw CommitError -- nothing on-chain
Secondary pull fails (auto-selected) Retry with next provider (up to 5 attempts)
Secondary pull fails (explicit) Record in failures[], no retry
All secondary attempts exhausted Commit primary only, record failures
Secondary commit fails Record in failures[] -- data on SP, will be GC'd
Failover creates new dataset Mark isNewDataSet: true in CopyResult
copies.length < count Partial success -- user should inspect failures[]

Error Types

/** Primary store failed - no data stored anywhere, safe to retry */
class StoreError extends Error {
  name = 'StoreError'
}

/** All commits failed - data stored on SP(s) but nothing on-chain, safe to retry */
class CommitError extends Error {
  name = 'CommitError'
}

// Partial commit failures appear in result.failures[] with role: 'primary' or 'secondary'
// Only throws CommitError when ALL providers fail to commit

What Users Must Check

Users should always inspect result.failures, not just check that upload() didn't throw:

// If ALL commits fail, upload() throws CommitError
// If at least one succeeds, we get a result:
const result = await synapse.storage.upload(data, { count: 3 })

// Check if endorsed provider (primary) failed
const primaryFailed = result.failures.find(f => f.role === 'primary')
if (primaryFailed) {
  console.warn(`Endorsed provider ${primaryFailed.providerId} failed: ${primaryFailed.error}`)
  // Data is only on non-endorsed secondaries
}

// Check if we got all requested copies
if (result.copies.length < 3) {
  console.warn(`Only ${result.copies.length}/3 copies succeeded`)
  for (const failure of result.failures) {
    console.warn(`  Provider ${failure.providerId} (${failure.role}): ${failure.error}`)
  }
}

// Every copy in copies[] is committed on-chain
for (const copy of result.copies) {
  console.log(`Provider ${copy.providerId}, dataset ${copy.dataSetId}, piece ${copy.pieceId}`)
}

Auto-Retry Logic

When user calls upload(data, { count: 2 }) without explicit providerIds or dataSetIds:

  1. Select primary (endorsed preferred)
  2. Store on primary
  3. Select secondary candidate from pool (excluding primary)
  4. Pull to secondary
  5. If pull fails:
    • Mark secondary as failed
    • Select next secondary from pool
    • Retry pull (data already on primary)
    • Repeat until: success OR exhausted pool OR hit MAX_SECONDARY_ATTEMPTS (5)
  6. If no secondary succeeded → proceed to commit with primary only
  7. Commit on all successful providers
  8. Return result with copies[] and failures[]

When user specifies providerIds or dataSetIds: no auto-retry, failures recorded in failures[].

Design Decision: Primary Commit Failure Handling

Current implementation commits on all providers in parallel via Promise.allSettled(). If primary commit fails but secondary commit succeeds, we record the primary failure and return with the secondary in copies[].

Endorsed providers are selected as primary because they're curated for reliability. If primary (endorsed) fails but secondary (non-endorsed) succeeds, the user ends up with data only on non-endorsed providers. This may not meet product requirements of having one copy on an endorsed provider.

// Check if endorsed provider failed
const primaryFailed = result.failures.some(f => f.role === 'primary')
if (primaryFailed) {
  // Handle: retry, alert, or treat as error depending on requirements
}

@timfong888
Copy link

timfong888 commented Feb 6, 2026

I noticed this:

Primary store failure = hard fail: If we can't store on primary, throw immediately

What is the test for the availability of an Endorsed Provider in the case we have more than one? If the first store fails, is there a retry?

Under retry:

Select primary (endorsed preferred)
Store on primary

If we have 2 Endorsed, and the store on primary operation fails do we retry the other endorsed?

@rvagg
Copy link
Collaborator Author

rvagg commented Feb 8, 2026

@timfong888 I've clarified the post above with more detail:

  • Now says: Store failure = hard fail: If we can't store data anywhere, throw immediately.
  • There's now also a "Stage 0" that details how we select a provider
  • I updated "Stage 1" with details about the failure modes that can happen there too because there's nuanced ways it can go wrong.

@rvagg
Copy link
Collaborator Author

rvagg commented Feb 9, 2026

Docs updated to pass lint, additional tests added to address some gaps.

@rvagg rvagg mentioned this pull request Feb 9, 2026
@BigLep BigLep linked an issue Feb 9, 2026 that may be closed by this pull request
5 tasks
@BigLep BigLep moved this from 📌 Triage to 🔎 Awaiting review in FOC Feb 9, 2026
@timfong888
Copy link

I am not clear on this:

All providers in tier fail ping Move to next tier

My understanding is if no Endorsed SP succeeds, it's a failure operation, because if there is no Endorsed and we only have Approved, that has a low durability guarantee.

@timfong888
Copy link

Key distinction:

For primary selection (first context), exhaustion = error (can't proceed)
For secondary selection (subsequent contexts), exhaustion = get fewer copies (proceed with what we have)

The above seems right. If Primary exhausts, it's error, not go to the next tier, right?

@timfong888
Copy link

Question: If the endorsed provider passed ping during selection but then fails during store() (HTTP upload or parking
confirmation), StoreError is thrown immediately. There doesn't appear to be an attempt to try another endorsed provider. But if there is, then great, but checking.

@timfong888
Copy link

All commits failed - data stored on SP(s) but nothing on-chain, safe to retry
parked pieces get GC'd by the SP

What happens if GC before retry?

Implement store->pull->commit flow for efficient multi-copy storage replication.

Split operations API on StorageContext:
- store(): upload data to SP, wait for parking confirmation
- presignForCommit(): pre-sign EIP-712 extraData for pull + commit reuse
- pull(): request SP-to-SP transfer from another provider
- commit(): add pieces on-chain with optional pre-signed extraData
- getPieceUrl(): get retrieval URL for SP-to-SP pulls

StorageManager.upload() orchestration:
- Default 2 copies (primary + endorsed secondary)
- Single-provider: store->commit flow
- Multi-copy: store on primary, presign, pull to secondaries, commit all
- Auto-retry failed secondaries with provider exclusion (up to 5 attempts)
- Pre-signing avoids redundant wallet prompts across providers

Callback refinements:
- Remove redundant onUploadComplete (use onStored instead)
- onStored(providerId, pieceCid) - after data parked on provider
- onPieceAdded(providerId, pieceCid) - after on-chain submission
- onPieceConfirmed(providerId, pieceCid, pieceId) - after confirmation

Type clarity:
- Rename UploadOptions.metadata -> pieceMetadata (piece-level)
- Rename CommitOptions.pieces[].metadata -> pieceMetadata
- Dataset-level metadata remains in CreateContextOptions.metadata
- New: StoreError, CommitError for clear failure semantics
- New: CopyResult, FailedCopy for multi-copy transparency

Implements #494
@rvagg rvagg force-pushed the rvagg/pull-upload-flow branch from eb878ac to 29ac8ad Compare February 10, 2026 23:49
@rvagg
Copy link
Collaborator Author

rvagg commented Feb 11, 2026

@timfong888:

On the tier question: yes, the current code does fall back to approved-only if no endorsed provider passes the health check. A requireEndorsed option is something I wrote down as on the table for the future, but right now the priority is "data gets stored" over "only endorsed". If that's a problem we should talk about it, but I think for launch it's the right trade-off since endorsed providers failing the health check would be an unusual situation? Maybe a hard failure is a better signal for us though.

There doesn't appear to be an attempt to try another endorsed provider

Not right now. Couple of reasons:

  1. Scope, this is where I'm drawing the line for the first iteration. First pass, best effort, fail clearly.
  2. It's hard because streams can only be consumed once. If the user gives us raw bytes or a File we could restart, but for a plain stream we can't, and the DX gets complicated fast (do we silently re-send 1GiB? what about streams that can't restart?). Better to throw and let the user decide until we work through the DX of it and see if it's worth the complexity.

What happens if GC before retry

Curio GCs unreferenced pieces after 24 hours, so there's a comfortable window for retries for the commit phase.

@timfong888
Copy link

Okay. So it randomizes across the Endorsed SP for ping if no existing context.

As long as they are good and an endorsed stores and commits successfully we are good. That's a fair assumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🔎 Awaiting review

Development

Successfully merging this pull request may close these issues.

GA DURABILITY: Multi-copy upload via SP-to-SP pull

2 participants