Skip to content

feat(deploy): direct-HTTPS peer relay for streamed deploy_component#146

Closed
kriszyp wants to merge 10 commits into
mainfrom
feat/deploy-component-peer-relay
Closed

feat(deploy): direct-HTTPS peer relay for streamed deploy_component#146
kriszyp wants to merge 10 commits into
mainfrom
feat/deploy-component-peer-relay

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 14, 2026

Stacked on core PRs #530/#531/#536 — those need to land first; the bumped core submodule pointer here points at the slice-3a branch.

Summary

Final slice of HarperFast/harper#524. Replicating a streamed `deploy_component` no longer goes through the WebSocket `sendOperation` path — which wraps the whole operation (including the payload Buffer) into a single WS frame and so can't carry payloads beyond Node's 2 GB cap. Instead, the origin opens a direct HTTPS connection to each peer's operations API and streams the same `multipart/form-data` body the CLI used.

How it works

When core's `deployComponent` has staged the payload to a temp file (see HarperFast/harper#536), `replicateOperation` detects the streaming-deploy case and dispatches to `relayDeployToNode` instead of `sendOperationToNode`:

  1. Mint a token via the existing replication WS. `create_authentication_tokens` runs against the peer's auth context — the peer signs a token tied to whatever user authenticated that WS connection. No new identity, no new trust relationship to manage.
  2. Open HTTPS to the peer's operations API. TLS verification reuses the per-node `verify_tls`/`rejectUnauthorized` posture (same flag setNode already honors). Operations API port defaults to 9925 from `OPERATIONSAPI_NETWORK_SECUREPORT`/`OPERATIONSAPI_NETWORK_PORT`, overridable per-node.
  3. Stream multipart deploy. Same wire format the CLI sends (#530); fields go first, then the file part read fresh from the staged temp file per relay attempt. `Transfer-Encoding: chunked` so we never need to know the size upfront. The peer processes it as a normal local deploy with `replicated: false` so it doesn't re-fan-out.
  4. Per-peer aggregation. `Promise.allSettled` over the relay calls; per-peer `{ node, status, ... }` records are returned in `response.replicated` exactly like the WS path's existing shape. One peer failing doesn't abort the others — the per-peer-status semantics we landed on for #524.

Where to look

  • `replication/deployRelay.ts` — the relay function. Notable design points:

    • `RelayDeps` is an injection seam (`mintToken`) so unit tests can stub the token mint without mocking ESM modules. The production default dynamic-imports `sendOperationToNode` from `replicator.ts` so consumers of this module that don't actually relay (most of them) don't pay the replicator's full transitive cost.
    • `buildForwardableFields` strips CLI/internal fields (`_stagedPayloadPath`, `progress`, `hdb_user`, `payload`) and forces `replicated: false` on the forwarded request. Without that last line the deploy would fan out from each peer back to every other node.
    • `resolveOperationsApiUrl` honors `node.operationsApiUrl` first (tests/proxies), then `node.port`, then config. The TLS check (`rejectUnauthorized`) mirrors how `setNode` reads the per-node `verify_tls` flag.
  • `replication/replicator.ts` — small branch in `replicateOperation`. The new path is gated on `req.operation === 'deploy_component' && typeof req._stagedPayloadPath === 'string'` and the dynamic import of `./deployRelay.ts` keeps the relay code out of every callsite that doesn't need it.

  • `unitTests/replication/deployRelay.test.mjs` — `node --test` unit tests. Spins up a local HTTP server playing the role of a peer, stubs `mintToken`, and verifies wire format end-to-end. Covers: happy path (multipart streamed correctly, JSON response parsed, peer cannot re-replicate), 5xx response surfaced as `{ status: 'failed' }` with status code, token-mint failure (no HTTPS attempt made), missing staged payload.

What's deliberately deferred (called out as follow-ups in the commit message)

  • Transient-error retries per peer — for now this fails fast on a peer error; the per-peer-retry-on-transport-errors policy we agreed on for #524 is a small follow-up once we've seen real failure modes.
  • SSE re-emission of per-peer events on the origin's response stream — the SSE channel exists in core via #531; the relay can subscribe to peer SSE responses and forward `replicate` events to the CLI as a follow-up.
  • `restart_service` relay — small payload, the WS path is fine.

Test plan

  • `node --test unitTests/replication/deployRelay.test.mjs` — 4/4 pass.
  • `npm run lint:required` — 0 errors.
  • `npm run test:integration` — not exercised; the integration harness for replicated deploys is heavy and not adapted to the new relay yet. A multi-node integration test is a sensible follow-up.

Related

🤖 Generated with Claude Code

Replicating a `deploy_component` with a streamed payload no longer goes
through the WebSocket sendOperation path — that wraps the entire
operation (including the payload Buffer) in a single WS frame, which
can't carry payloads beyond Node's 2 GB cap.

When core's deployComponent has staged the payload to a temp file (see
the corresponding core change at feat/deploy-component-payload-staging),
replicateOperation now relays to each peer over direct HTTPS:

 1. Mint a short-lived operation token by calling
    `create_authentication_tokens` over the existing replication WS
    connection — the peer signs a token tied to whatever user authed
    that WS connection, so the auth model matches existing replication.
 2. Open HTTPS to the peer's operations API (port 9925 by default,
    overridable per node) and stream a multipart/form-data deploy
    request with the staged payload as the file part. Peer processes
    it as a normal local deploy with `replicated: false` so it doesn't
    re-fan-out.
 3. Per-peer success/failure is aggregated with Promise.allSettled —
    one peer failing doesn't abort the others, matching the per-peer-
    status semantics agreed for HarperFast/harper#524.

Out of scope for this slice:
 - Transient-error retries per peer (basic fail-fast; can refine after
   we see real-world failure modes)
 - SSE re-emission of per-peer events on the origin's response stream
   (the SSE channel is in core via #531; the relay can plug into it as
   a follow-up)
 - restart_service relay (small payload, WS path is fine)

Bumps the core submodule pointer to pick up the payload staging
required for this to work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kriszyp kriszyp requested a review from a team as a code owner May 14, 2026 15:26
hostname: target.hostname,
port: target.port || (target.protocol === 'https:' ? 443 : 80),
method: 'POST',
path: '/',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path is hardcoded to '/' instead of using target.pathname. When node.operationsApiUrl is set to a URL with a non-root path (e.g. http://proxy:9925/ops/), the request silently goes to / instead — the intended path prefix is dropped entirely.

The comment on resolveOperationsApiUrl explicitly says this field is "used by tests and by deployments that put the ops API behind a proxy", so path-based proxies are a supported scenario.

Suggested change
path: '/',
path: target.pathname,

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 14, 2026

Reviewed; no blockers found.

Kris Zyp and others added 9 commits May 19, 2026 07:55
Cascades the fix from #531 (a320af514): drain the IncomingMessage when
useSse=true but server returns non-SSE content (e.g. 401 auth failure).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…error stringify fixes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kriszyp kriszyp marked this pull request as draft May 20, 2026 16:11
@kriszyp
Copy link
Copy Markdown
Member Author

kriszyp commented May 20, 2026

Pausing this PR — converting to draft.

This work is being rebuilt on top of the design in #641: a replicated hdb_deployment system table that owns the deploy lifecycle, audit trail, and (because Harper's BLOB_CHUNK replication already supports chunked, back-pressured blob transfer) the payload delivery channel.

In the new design:

  • The ProgressEmitter from this PR becomes one of two subscribers — the second being a DeploymentRecorder that writes the persistent record. SSE remains the live channel for the CLI; get_deployment becomes content-negotiated to serve the same stream to Studio.
  • _stagedPayloadPath (#536) and the direct-HTTPS peer relay (harper-pro#146) both go away — peers read the payload directly from the replicated blob attribute on the row.

This work resumes as part of Slice B in #641 once Slice A (table + blob-backed multipart receive) lands.

— Claude

@kriszyp
Copy link
Copy Markdown
Member Author

kriszyp commented May 23, 2026

Closing in favor of the deployment-tracking redesign tracked in HarperFast/harper#641, now in flight as Slices A (HarperFast/harper#655, merged), B1 (HarperFast/harper#657, merged), and B2 (HarperFast/harper#758).

This PR's direct-HTTPS peer relay was solving the operations-API frame cap for deploy_component payloads. The redesign removes that need: payloads now ship to peers via the existing BLOB_CHUNK streamed-blob replication channel attached to the hdb_deployment.payload_blob attribute. Slice B2 in harper strips req.payload before replicateOperation and switches peers to read from the replicated row's blob; the harper-pro-side change becomes a small simplification (drop the peer-relay branch from replicateOperation's deploy_component path) rather than a new transport.

The harper-pro-side simplification will land alongside HarperFast/harper#758 as a separate small PR in this repo.

🤖 Generated by Claude

@kriszyp kriszyp closed this May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant