Skip to content

continuum transfunctioner: TSS peer mesh with keygen, signing, resharing#796

Open
marcopeereboom wants to merge 124 commits intomainfrom
marco_tss
Open

continuum transfunctioner: TSS peer mesh with keygen, signing, resharing#796
marcopeereboom wants to merge 124 commits intomainfrom
marco_tss

Conversation

@marcopeereboom
Copy link
Copy Markdown
Contributor

@marcopeereboom marcopeereboom commented Dec 10, 2025

Implements the continuum TSS service end-to-end: a peer mesh network that runs threshold ECDSA/EdDSA ceremonies (keygen, signing, resharing) over encrypted RPC transport.

Architecture
Peer mesh — TCP transport with X25519 ECDH key agreement and NaCl secretbox encryption. Peers discover each other via DNS seeding with forward verification, maintain connections through gossip and liveness pings, and track idle/stale peers via TTL-based eviction. The mesh targets PeersWanted total connections (inbound + outbound) and fills gaps each maintenance cycle by dialing shuffled candidates.

Ceremony lifecycle — Coordinator election picks the peer with the lowest key hash. The elected coordinator dispatches ceremonies (keygen/sign/reshare) to participants, who execute TSS rounds and exchange messages over the encrypted mesh. Ceremony state is context-scoped with proper cancellation propagation. Results are persisted to a NaCl-encrypted key store (HKDF-derived storage key).

TSS integration — Uses hemilabs/x tss-lib v3 channel-free round functions. Each ceremony is a loop over explicit round calls with message collection gated on committee membership. Resharing supports overlapping old/new committees. Wire format uses package-prefixed type discriminators for 32 message types (21 ECDSA + 11 EdDSA).

What's included
Core service (service/continuum/):

continuum.go — server lifecycle, peer tracking, gossip, maintenance, session management
protocol.go — RPC envelope format, handshake, message routing with hash verification
tss.go — TSS ceremony abstraction (tssImpl), Paillier precompute, key store with NaCl encryption
tss_round.go — round-function ceremony drivers for keygen/sign/reshare
tss_rpc.go — ceremony RPC message types and handlers
tss_wire.go — JSON wire format: marshal/unmarshal with type discriminators
ceremony.go — ceremony struct, context/cancel lifecycle
dispatch.go — type-switch dispatch map replacing monolithic handle()
election.go — coordinator election by lowest key hash
doc.go — package godoc with broadcast scaling analysis

Admin tooling:
cmd/hemictl/continuum.go — hemictl continuum subcommand: status, peers, key info
cmd/hemictl/continuum_ceremony.go — keygen, sign, reshare ceremony commands (gated behind continuum_debug build tag)
cmd/transfunctionerd/ — daemon entry point updates
docker/transfunctionerd/Dockerfile
Infrastructure:

Prometheus:
Metrics for ceremony counts, peer gauge, broadcast latency

Testing:
Integration tests (continuum_test.go, rpc_test.go, rpc_tss_test.go) — 5-node keygen with broadcast verification, full keygen→sign→reshare lifecycle, transport write/DNS/outbound verify paths, ceremony dispatch error paths, election fuzzing
Unit tests — dispatch map, wire format (38 round-trip + exhaustive type tests), TTL error paths, hemictl ceremony commands
Reference tests (tss_examples/v3_reference_test.go) — v3 round function API usage examples with pre-computed Paillier params
All test nodes use production tssImpl via rpcTransportAdapter over encrypted TCP
Zero time.Sleep in tests — all synchronization via context waits
Key fixes along the way
Unlock-before-cancel to prevent deadlock during broadcast I/O
handleCeremonyResult must not race SaveKeyShare
Sentinel errors and status constants for ceremony lifecycle
Transport payload hash verification (replay/tampering protection)
Session busy response instead of silent drop
Handshake semaphore to bound concurrent connection setup
Forward DNS verification as a policy gate (configurable, loopback exempt)

@marcopeereboom marcopeereboom requested a review from a team as a code owner December 10, 2025 17:11
@marcopeereboom marcopeereboom added type: feature This adds new functionality size: XXL This change is extremely large (+/- 1000+). Changes this large should be split into multiple PRs changelog: skip This pull request does not require a changelog entry (e.g. tests, docs, CI, minor refactors). labels Dec 10, 2025
@marcopeereboom marcopeereboom marked this pull request as draft December 10, 2025 17:11
@github-actions github-actions Bot added the area: docs This is a change to documentation label Jan 15, 2026
@marcopeereboom marcopeereboom force-pushed the marco_tss branch 4 times, most recently from 0584452 to 6c21c60 Compare February 12, 2026 18:24
@github-actions github-actions Bot added the area: hemictl This is a change to hemictl label Feb 21, 2026
@github-actions github-actions Bot added the area: make This changes a Makefile label Feb 26, 2026
@github-actions github-actions Bot added the area: docker This is a change to a Dockerfile label Feb 27, 2026
@marcopeereboom marcopeereboom force-pushed the marco_tss branch 2 times, most recently from 5e1957e to 5104340 Compare March 9, 2026 14:25
@joshuasing joshuasing removed the changelog: skip This pull request does not require a changelog entry (e.g. tests, docs, CI, minor refactors). label Mar 11, 2026
@github-actions github-actions Bot added the changelog: required This pull request must update the CHANGELOG.md file or explicitly be marked with changelog: skip label Mar 11, 2026
Comment thread service/continuum/tss_round.go Fixed
@marcopeereboom marcopeereboom changed the title continuum: Add TSS POC continuum transfunctioner: TSS peer mesh with keygen, signing, resharing Mar 17, 2026
@marcopeereboom marcopeereboom marked this pull request as ready for review March 17, 2026 12:54
marcopeereboom added a commit that referenced this pull request Mar 17, 2026
Remove local filesystem replace directive — CI has no access to
/home/marco/Documents/src/x/tss-lib.  Resolve to the pushed commit
on origin/max/tss_changes (30339d0b0ce1).

Bump go directive from 1.25.0 to 1.26.0 to match main (577d577).
CI runs GOTOOLCHAIN=local with go 1.25.4 which refuses modules
requiring >= 1.26.

Remove stale nolint:prealloc directive — golangci-lint v2 dropped
the prealloc linter.  Add missing trailing newline to preparams.json
fixture files.

Add CHANGELOG entry for #796.
@github-actions github-actions Bot removed the changelog: required This pull request must update the CHANGELOG.md file or explicitly be marked with changelog: skip label Mar 17, 2026
marcopeereboom and others added 26 commits April 23, 2026 10:13
Bump hemilabs/x/tss-lib/v2 to Max's security fork (112 audit
fixes, SSID domain separation, ReceiverID binding, deterministic
protobuf, secret zeroing).

Wire SetCeremonyID with the 32-byte CeremonyID and SetSSIDNonce
with 0 (attempt counter) in Keygen, Sign, and both Reshare party
constructors.  CeremonyID gives per-instance uniqueness; the
nonce field is Max's retry attempt counter.

Add threshold validation in tss.go (Keygen, Sign) and in the
RPC integration test helpers before calling NewParameters.
The fork panics on invalid threshold/partyCount; we validate
early and return a clean error.

NOTE: go.mod has a temporary replace directive pointing at the
local x repo.  Remove after pushing the x commit and running
go get with the real hash.
Rewrite Keygen() and Sign() to use the pure round functions from
tss-lib instead of the channel-based NewLocalParty + goroutine
pump pattern.

Each ceremony gets a single buffered inCh for inbound messages.
HandleMessage delivers parsed messages to inCh; the ceremony
driver (Keygen/Sign) reads with select on ctx.Done().  No pump
goroutine, no outCh/errCh/endCh.

Add msgBuf to handle message reordering: faster peers may send
round N+1 messages before the local node finishes round N.
Messages that don't match the current round's accept filter are
buffered and drained on the next round.

Delete pumpMessages (dead code — keygen/sign no longer use it).
Reshare still uses the channel-based path (pending conversion).

tss_round.go: msgBuf, sendRound helpers (157 lines).
tss.go: +459/-125 lines (net +334).
Convert tssImpl.Reshare from channel-based tss-lib LocalParty
instances to explicit round-function calls (ReshareRound1-5),
completing the pattern established by keygen/sign in 45d1762.

Production code:
- ceremony struct: remove party, outCh, errCh, oldParty,
  oldKeyToID, newKeyToID; ceremony lifecycle uses ctx/cancel
  derived from caller context (no termination channels)
- Reshare(): 5-round driver with msgBuf.collect gated on committee
  membership (old-only nodes skip new->new message collection)
- HandleMessage(ctx, ...): ctx threaded through interface and all
  callers; channel sends select on ctx.Done() + c.ctx.Done()
- sendReshareRound(): new helper encodes committee flags from
  MessageRouting and routes to both committee PID sets
- Delete handleReshareMessage() and pumpReshareMessages()
- FillBytes for pubkey encoding (X/Y padded to 32 bytes)

Server fixes:
- handle(): goroutine watches sessionCtx.Done() and closes
  transport to unblock ReadEnvelope on shutdown
- deleteSession/deleteAllSessions: demote close errors to Debug
  (double-close during shutdown is expected)
- connectRandom: dial gap-many shuffled candidates per maintain
  cycle instead of one random pick (fixes 100-node convergence)

Tests:
- Delete tss_transport_test.go (channel-based, redundant with RPC)
- Delete rpc_integration_test.go; port 3 unique error-path tests
  and 2 fuzz tests to rpc_tss_test.go
- Rewrite rpc_tss_test.go: test nodes use production tssImpl via
  rpcTransportAdapter over encrypted TCP; all 11 tests preserved
- All context.Background() in test code replaced with t.Context()
- All ceremony struct literals in tests carry ctx/cancel
- TestHundredNodeMesh: set InitialPingTimeout=30s, increase
  convergence timeouts to 60s (prevents chain link kills under
  CPU contention)
- .golangci.yaml: replace-local: true for tss-lib fork
Update all imports from tss-lib/v2 to tss-lib/v3.  The v3 module
deletes the channel-based Party/Round/BaseUpdate API and retains
only the pure round function API that continuum already uses.

tss_examples: move old v2 channel-based examples to
testdata/v2_channel_reference/ as documentation (does not compile
against v3).  Add v3_reference_test.go demonstrating keygen+sign
using the round function API.
Remove local filesystem replace directive — CI has no access to
/home/marco/Documents/src/x/tss-lib.  Resolve to the pushed commit
on origin/max/tss_changes (30339d0b0ce1).

Bump go directive from 1.25.0 to 1.26.0 to match main (577d577).
CI runs GOTOOLCHAIN=local with go 1.25.4 which refuses modules
requiring >= 1.26.

Remove stale nolint:prealloc directive — golangci-lint v2 dropped
the prealloc linter.  Add missing trailing newline to preparams.json
fixture files.

Add CHANGELOG entry for #796.
Wire format byte 0 (message type) and byte 1 (committee flags)
were sharing the wireFlag prefix and colliding at 0x01.  Split
into two namespaces: msgTypeP2P/msgTypeBroadcast for byte 0,
cflagToOld/cflagToNew/cflagFromNew for byte 1.

Add maxWireDataLen (16 MiB) bounds check before the allocation
in sendReshareRound (CodeQL integer-overflow finding).

Name remaining bare literals: dialTimeout, promPollInterval in
continuum.go; secp256k1KeySize, handshakeTimeout in protocol.go.

Update all production code and test files.
runtime.Caller(0) does not resolve in CI test binaries, causing
loadTestPreParams and loadPreParams to silently fall back to live
Paillier generation (~30s per node, exceeds test timeout).

Embed tss_examples/preparams.json via go:embed into preparams_test.go.
Both tss_test.go and rpc_tss_test.go now call testPreParams() which
fails hard on missing or corrupt fixture data.
Pick up SA1019 suppression, legacy build tags, coverage
tests, and golangci-lint v2.11.3 sync from the x repo.
The race detector adds ~10x overhead to goroutine scheduling.
With 100 nodes on a CI runner, maintain cycles fire before
handshakes complete, causing duplicate-identity rejections
and convergence timeout.  The test validates gossip scaling,
not concurrency correctness — the smaller mesh tests already
cover race safety.
Pick up KAT hash tests, commitment binding tests, lint fixes,
SA1019 suppression, and legacy build tags from the x repo.
The tss_examples sub-package existed to hold v2/v3 reference
implementations and pre-computed Paillier fixtures.  The v2
channel reference is dead code (v3 replaced it entirely) and
the v3 reference test is redundant with the x repo's own
example tests.

Move preparams.json to testdata/ (used by go:embed in
preparams_test.go).  Delete everything else: v3_reference_test.go,
v2_channel_reference/, README.  -2,912 lines.
Suppress G118 false positive in registerCeremony — cancel is stored
in CeremonyInfo and called on ceremony completion.  Eliminate G115
int-to-uint64 conversion in election shuffle by keeping remaining
as int.  Annotate safe test conversions with nolint:gosec.
Replace bytes.Equal with subtle.ConstantTimeCompare at four sites
where attacker-controlled input is compared against security-critical
values: signature identity verification, payload hash verification,
and both DNS identity checks.

Leave bytes.Equal for zero-sentinel checks (ZeroChallenge, zeroKey)
where the compared value is a public constant.
HashTSSMessage: add "continuum-tss-msg-v1" domain separator and
4-byte length prefix before data.  Prevents cross-protocol
signature replay and ambiguous field boundaries.

Transport.Close: zero encryptKey, decryptKey, and nonce key on
session teardown.  Nil the ephemeral private key.  Limits key
material exposure in swap files and core dumps.

Handshake challenge: add "continuum-challenge-v1" domain separator
to Hash256(challenge || ETP) on both signing and verification
sides.  Prevents cross-protocol challenge-response replay.

maintainConnections: replace math/rand/v2 Shuffle with crypto/rand
Fisher-Yates.  Remove math/rand/v2 import from production code.
TestVerifyRejectsWrongIdentity — exercises subtle.ConstantTimeCompare
in Verify(), tests correct/wrong/bit-flipped identity paths.

TestHashTSSMessageDomainSeparation — known-answer test proving the
domain separator is present, verifies it differs from raw hash.

TestHashTSSMessageLengthPrefix — different data lengths produce
different hashes, determinism check.

TestTransportCloseZerosKeys — asserts encryptKey, decryptKey, and
nonce.key are zeroed after Close(), ephemeral private key is nil.

TestChallengeHashDomainSeparation — proves domain-separated challenge
hash differs from unseparated.

TestSealBoxOpenBoxRoundTrip — e2e encryption round trip, positive
path and wrong-sender-key rejection.

Fix TestConnKeyExchange: move clientTransport.Close() after key
assertions since Close() now zeros keys.

Strip internal document references from comments.
Wire-initiated ceremony requests (KeygenRequest, SignRequest,
ReshareRequest) are now only processed when built with the
continuum_debug tag.  Production binaries compile debug_off.go
which returns nil from serverDebugInit(); debug builds compile
debug_on.go which returns a debugInitiator.

Previously newDebugInitiator() was called unconditionally in
NewServer(), making the nil-checks in dispatch.go dead code.
Any peer could trigger a ceremony over the wire.

Add noopInitiator for production ceremonyLoop — blocks on nil
channel until blockchain watcher is wired in.

Tests wire up debug initiation explicitly in newTestServer().
Update hemilabs/x/tss-lib/v3 to 810b4757 which replaces
binance-chain/edwards25519 with standard elliptic.Curve
operations in eddsa/signing and adds pre-computed preparams
fixtures for faster CI.

binance-chain/edwards25519 removed from indirect deps.
Replace cleartext 3-byte size prefix with two-phase secretbox
framing.  Phase 1 is a fixed 44-byte encrypted header containing
the body size.  Phase 2 is the encrypted payload.

Wire format (v2):

  [24-byte nonce_h][secretbox(4-byte body_size)]  <- 44 bytes
  [24-byte nonce_p][secretbox(payload)]           <- body_size bytes

An attacker corrupting any byte of the header causes secretbox.Open
to fail.  The receiver never trusts an unauthenticated length.

TransportVersion bumped from 1 to 2.  TransportMaxSize reduced
from 16 MB to 1 MB (sufficient for 100-party TSS keygen).
Replace static sender NaCl key with per-message ephemeral X25519
keypair (sealed-box pattern).  Sender generates fresh keypair,
encrypts with nacl.box to the recipient's static X25519 key,
ships ephemeral public key in EncryptedPayload, destroys private
key immediately.

Sender authentication via secp256k1 compact signature over
SHA256("continuum-e2e-sig-v1" || EphemeralPub || Nonce ||
Ciphertext).  Receiver verifies signature against Sender identity
before opening the box.  Prevents forged payloads from anyone who
knows the recipient's gossip-advertised X25519 public key.

Provides sender-side forward secrecy: compromising a sender after
the fact cannot recover past ephemeral keys.

SealBox takes *Secret (signs envelope), OpenBox unchanged.
EncryptedPayload adds EphemeralPub and Signature fields.
decryptPayload verifies signature before decrypting.
Mix both parties' ephemeral public keys into the HKDF salt
in canonical order (server first, client second).

Salt: "continuum-hkdf-salt-v2" || serverPub || clientPub

Public keys are fixed-length per curve, no length delimiters
needed.  Validated against the curve's actual key size.
The caller provides them based on Transport.isServer.

Zero the ECDH shared secret after key derivation.  Go 1.26
runtime.ZeroMemory will provide a proper guarantee; for now
we zero the slice contents but cannot prevent GC copies.

Eliminates the static salt shared across all sessions.
Ephemeral ECDH already guarantees unique shared secrets but
session-specific salt prevents theoretical cross-session key
derivation collision.
Add use-after-close guard: all store operations return
ErrStoreClosed after Close().  Previously Close() zeroed the
key and subsequent encrypts silently produced unrecoverable
ciphertext with no error.

Add keyID binding: encrypt prepends a length-prefixed keyID to
plaintext before sealing.  decrypt verifies the bound keyID
matches the expected keyID.  Prevents file-swap attacks where
an attacker with filesystem access renames key files.

Add atomic writes: writeAtomic uses temp file + fsync + rename.
A crash at any point leaves either the old or new file, never
a partial write.

Add ErrEmptyKeyID validation on all Save/Load/Delete paths.
Zero the encryption key copy after use in encrypt/decrypt.
Copy encKey under mutex to avoid holding the lock during
secretbox operations.
TSS transport falls back to SendEncrypted when no direct
session exists, enabling ceremony completion across sparse
meshes where committee members lack direct TCP sessions.

Link-state routing via gossip topology: PeerRecord carries
session adjacency, generation-gated BFS routing table
rebuilt lazily on topology changes. SendTo and forward use
route table with flood fallback when stale.

Admin listener on dedicated port bypasses PeersWanted
capacity limits. No gossip, no ping lifecycle — ceremony
injection only. handle() takes isAdmin flag; admin sessions
skip gossip exchange and rate limiting.

Transport DDoS mitigations: per-session rate limiter drops
messages exceeding messageRate, read deadlines on all I/O,
reconnection cooldown for rejected peers.

notifyAllPeers no longer closes transports on write failure;
dead sessions are reaped by pingExpired instead.

PrivateKeyHex neutered in release builds; test code uses
DebugPrivateKeyHex (build-tagged continuum_debug).
Spins up 10 daemons with PeersWanted=3 (sparse mesh) in
chain topology, runs keygen, sign, reshare, post-reshare
sign, and second sign. Forces multi-hop encrypted envelope
delivery for TSS messages between non-adjacent committee
members. Build-tagged continuum_debug; uses admin listener
for ceremony injection.
Transport.encrypt() reads encryptKey and nonce.key without holding
t.mtx. Concurrent Close() zeroes those fields under the lock, causing
a data race detected by -race in TestRPCTSSKeygenCorruptPostSign.

Move lock acquisition in write() above the encrypt() call so the
entire encrypt+write sequence is synchronized with Close().

Reorder Close() to close the conn before zeroing key material so
that in-flight readers blocked in readExact unblock with an I/O
error before decrypt/decryptFrameHeader can touch zeroed keys.
Go function signatures must be on a single line. Godoc
requires it. Wrapped parameters are not idiomatic Go.

Flatten 9 functions across tss.go, tss_round.go,
rpc_tss_test.go, and continuum_e2e_test.go.
Unexport SendTo to sendTo — all TSS traffic uses SendEncrypted,
sendTo is internal delivery for already-encrypted envelopes.

Consolidate scattered const declarations into the main const block.

Use reflect.TypeFor instead of reflect.TypeOf((*X)(nil)) in dispatch
table and registration.

Use early-continue in forward and forwardBroadcast instead of
if/else on error.

Unwrap three if statements in handle() for readability.

Move spew.Sdump calls to Tracef to avoid evaluation when trace
logging is disabled.

Short-circuit isHostname behind DNS config check.

Invert preparams file logic for readability: try open first, fall
through to create on ErrNotExist. Use json.NewEncoder instead of
MarshalIndent to avoid buffering. Fix SetIndent prefix.

Simplify TTL cache initialization — direct field assignment.

Simplify initPaillierPrimes call.

Unwrap if in hemictl continuumStatus.

Remove resolved XXX in continuum_ceremony.go.
Convert all four e2e polling loops to ticker + t.Context() checks.

Fix e2e preparams path to use testdata/preparams.json.

Use reflect.TypeFor in dispatch test.

Merge TestDispatchMapCompleteness and signature test into single
test function.

Use json.NewEncoder in continuum_test.go preparams helper.
A TSSMessage must never carry a routing header (Destination != nil).
Legitimate cleartext TSS is one-hop only (Destination == nil, sent
via Write between direct peers).  Multi-hop TSS must be wrapped
in an EncryptedPayload.

A routed cleartext TSSMessage means the sending peer is either
buggy or actively leaking TSS round data to the mesh.  Both
intermediaries and destinations now reject it: the check runs in
the handle() loop before forwarding or dispatch, and the offending
peer is disconnected immediately (handle returns, triggering
session cleanup).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: docker This is a change to a Dockerfile area: docs This is a change to documentation area: hemictl This is a change to hemictl area: make This changes a Makefile changelog: done This pull request includes an appropriate update to CHANGELOG.md. size: XXL This change is extremely large (+/- 1000+). Changes this large should be split into multiple PRs type: feature This adds new functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants