Skip to content

The fedify bench command with its scenario and report schemas#791

Open
dahlia wants to merge 47 commits into
fedify-dev:mainfrom
dahlia:feat/bench/cli-engine
Open

The fedify bench command with its scenario and report schemas#791
dahlia wants to merge 47 commits into
fedify-dev:mainfrom
dahlia:feat/bench/cli-engine

Conversation

@dahlia
Copy link
Copy Markdown
Member

@dahlia dahlia commented Jun 5, 2026

Resolves #783, the second of the five benchmarking steps tracked in #744. It adds the client half of fedify bench to @fedify/cli: a load generator that exercises a Fedify server the way the fediverse does.

Generic HTTP load tools (autocannon, wrk, k6) cannot sign an inbox delivery, build a realistic ActivityStreams payload, or read a target's queue depth, so against a federation server they measure the wrong thing. The server half landed earlier in #782, which added benchmarkMode to FederationOptions together with the cooperative /.well-known/fedify/bench/stats and …/trigger endpoints. This PR is what drives that target.

The command acts as a synthetic remote actor. It generates keys and serves its own actor and key documents over loopback, then discovers the recipient's inbox the way a real peer would. Every delivery is signed with the same @fedify/fedify signer a real sender uses, so the crypto cost lands in the measured latency. It drives the load, reads the target's server-side metrics from the stats endpoint, and renders one report model as text, JSON, or Markdown.

What it includes:

  • A scenario suite in YAML or JSON that declares the target, the actors to sign as, shared defaults, and a list of scenarios, each with an expect block of pass/fail thresholds that doubles as a CI gate.
  • Two runners, inbox (the signed end-to-end delivery benchmark) and webfinger. The format and the schema can express the other types from Performance benchmarking tools for Fedify federation workloads #744 (actor, object, fanout, collection, failure, mixed), but a scenario whose type has no runner yet is rejected with a clear message rather than silently skipped.
  • Open-loop (rate) and closed-loop (concurrency) load, with coordinated-omission correction so a stalled target shows up as latency instead of disappearing, plus constant or Poisson arrivals and an optional maxInFlight cap.
  • Three signing strategies kept off the send critical path, chosen per scenario: pipeline (background signers fill a bounded buffer), jit, and presign.
  • Supporting machinery: target safety gating, recipient discovery, the synthetic actor/key server, and an in-house log-linear latency histogram. The text, JSON, and Markdown outputs all derive from one report model, so they cannot drift apart.

Scenario format and JSON schema

The schema is dual-maintained. A frozen TypeScript literal embedded in the CLI is what the runtime validates against, using @cfworker/json-schema (pure JavaScript, so it survives deno compile); the committed schema/bench/scenario-v1.json and schema/bench/report-v1.json are the published copies. A test guard keeps the embedded and published forms byte-identical and refuses any edit to an already-published version, so a -v1 URL never changes meaning. The # yaml-language-server: line in a suite gives editors autocomplete and validation against the published URL.

Safety

A run proceeds without friction against a loopback or private target, or any target that advertises benchmark mode. A public target that does not advertise it is refused unless you pass --allow-unsafe-target, which is mandatory and never prompted in CI. The gate classifies the actual load destination, not only the declared target, so a loopback target paired with a public recipient (or an explicit public inbox:) cannot route load to production behind the gate's back. For the same reason, benchmark traffic does not follow redirects. Signed scenarios additionally need the synthetic actor server to be reachable from the target: a loopback target reaches it automatically, and a non-loopback target requires --advertise-host.

Schema hosting

The schemas live at https://json-schema.fedify.dev/. This PR adds the static assets under schema/: the two JSON files, an index.html landing page, a contributor README.md, and _headers with netlify.toml that set CORS, long-lived immutable caching, and the application/schema+json content type. The hosting itself is configured on Netlify out of band; CI does not upload anything.

Testing and documentation

The benchmark test suite runs under both Node and Deno (about 240 tests), including an end-to-end inbox benchmark against a real benchmarkMode server that verifies the signatures, so the signed delivery path is run rather than mocked. docs/manual/benchmarking.md gains a client section covering the suite format, the actor and signing model, the output formats, and safety; CHANGES.md has an entry under version 2.3.0.

dahlia added 26 commits June 5, 2026 15:38
Add the skeleton for a new `fedify bench` subcommand in @fedify/cli that
will run ActivityPub-specific load benchmarks against a cooperative
Fedify target running in benchmark mode.

This first step wires the command into the CLI without the engine:

 -  Define the Optique `benchCommand` with the suite-file argument and the
    --target, --format, --output, --dry-run, and --allow-unsafe-target
    options, plus a stub `runBench` that is fleshed out in later steps.
 -  Register the command in the runner and dispatcher, and add a `bench`
    section to the configuration schema.
 -  Add the `@cfworker/json-schema` (draft 2020-12 validator) and `yaml`
    dependencies used by the scenario format, to both deno.json and
    package.json.
 -  Cover argument parsing with tests.

fedify-dev#783
fedify-dev#744

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Add a lightweight HdrHistogram-style log-linear histogram used by the
benchmark engine to record latency samples and compute percentiles with
bounded relative error.  Values are bucketed by octave and split into
linear sub-buckets, so the relative error stays roughly constant across
the whole range.  The structure is sparse, mergeable, and serializable,
which lets percentiles from several runs be re-aggregated without
coordinated-omission error and lets the report carry an optional
serialized histogram.

Sub-bucket indices are derived from the mantissa ratio to avoid denormal
underflow, and non-positive samples (including -0) are normalized to the
zero bucket.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Add the small, pure building blocks the scenario format is built on:

 -  `asList()`: scalar-or-list coercion, so fields such as `recipient`,
    `seed`, `collection`, and `type` can accept either a single value or a
    list while the common single-value case stays terse.
 -  `parseSize()` / `resolveGenerate()`: typed payload-generation
    directives (e.g. `content: { generate: lorem, size: 2KB }`) that
    produce deterministic output of an exact byte size, with the size
    parser bounded to the safe-integer range.
 -  A logic-less GitHub-Actions-style `${{ ... }}` template engine
    (dotted-path resolution plus whitelisted helper calls).  Lookups go
    through own properties only, with a denylist for prototype members,
    and unclosed delimiters, trailing text, and unbalanced quotes are
    rejected rather than silently mishandled, so the format cannot turn
    into a programming language.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Define the `fedify bench` scenario suite format and its published
JSON Schema (draft 2020-12).  The format is a suite of `version`,
`target`, `defaults`, `actors`, and `scenarios`, with an `expect` block
per scenario, and it can express every scenario type discussed for the
tool (inbox, webfinger, actor, object, fanout, collection, failure,
mixed) even though only inbox and webfinger will have runners.

Rather than a schema-first single source, the published JSON Schema and
the TypeScript types are maintained as two artifacts kept identical by a
drift guard.  Runtime validation uses `@cfworker/json-schema`, and a
validated value is narrowed with an `as unknown as` cast.  Three
cross-field rules live in the schema where an editor can flag them:

 -  exactly one HTTP request signature scheme per actor group
    (`contains` + `minContains`/`maxContains`);
 -  `rate` XOR `concurrency` in a load block (`oneOf`);
 -  the allowed `expect` metrics per scenario type (`if`/`then` +
    `propertyNames`).

The embedded schema object is the editing source; *schema/bench/*
holds the hosted copy, regenerated by *scripts/generate-bench-schema.ts*.
Four guards run as tests: structural/meta validation, example-fixture
validation (valid and invalid fixtures covering every scenario type),
drift between the embedded object and the published file, and git-based
immutability of already-published version files.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Add the normalization step that turns a schema-validated suite into the
resolved form the engine runs:

 -  `parseDuration()` and `parseRate()` parse the human-friendly duration
    (`30s`) and rate (`200/s`) units into milliseconds and requests per
    second, rejecting non-positive and overflowing magnitudes.
 -  `normalizeSuite()` applies suite defaults, coerces the top-level
    scalar-or-list fields to arrays, resolves the target (with a
    `--target` override), and determines the open- or closed-loop load
    model, inheriting compatible fields such as `arrival` and
    `maxInFlight` from the defaults while a scenario's `rate`/
    `concurrency` selects the model.

It also enforces the one cross-field rule the JSON Schema cannot express:
the buffered signing modes (`pipeline`, `presign`) pre-sign requests, so
they require the target's signature time window to be off; a
time-windowed target must use `signing: jit`.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Define the canonical benchmark report: the single result model from
which the terminal, JSON, and Markdown renderers all derive, so the
outputs can never drift apart.  JSON is the canonical machine form,
pinned by a published draft-2020-12 schema (schema/bench/report-v1.json).

The model splits `client` and `server` numbers by nesting so it is clear
which the load generator measured and which came from the target's stats
endpoint, bakes the unit into numeric keys (latencyMs, drainMs), turns
each expect assertion into an evaluated record, and carries first-class
environment/target/configHash reproducibility metadata plus an optional
serialized histogram.

The report schema is registered alongside the scenario schema, so the
existing structural, fixture, drift, and immutability guards now cover it
too; a valid report fixture is added.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Turn each scenario's `expect` block into evaluated records that gate a
run.  `parseAssertion()` parses a human assertion (">= 99%", "< 100ms",
"< 2s", ">= 500/s", "== 0") into an operator and a machine-clean
threshold: percentages become ratios, durations milliseconds, rates per
second.  `evaluateExpect()` looks each metric up by name (successRate,
throughputPerSec, errors.4xx/5xx/total, latency.*, signatureVerification.*,
queueDrain.*), checks the assertion's unit is compatible with the
metric's natural unit, and compares.  Equality is tolerant for float
metrics but exact for counts.  A `fail`-severity assertion gates the
build while `warn` only annotates, and a missing or unmeasured metric
fails cleanly.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Assemble the canonical report from measured scenario data and render it
in three forms from that single model:

 -  `buildScenarioResult()`/`buildReport()` turn resolved scenarios and
    their measurements into the report, evaluating each `expect` block,
    summarizing the load model, and computing the overall gate.
 -  `detectEnvironment()` and `configHash()` capture the reproducibility
    metadata (runtime, OS, CPU count, and a stable sha256 over the
    canonicalized configuration, honoring `toJSON()` so URLs hash by
    value).
 -  The JSON renderer is the canonical machine form (pinned by the
    report schema); the terminal-text and Markdown renderers derive from
    the same model.  A shared metric-unit registry keeps the evaluator
    and the renderers in agreement, so measured values display in the
    metric's own unit while an explicit assertion unit stays visible.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Add the client-side safety guard and the discovery that finds where to
deliver:

 -  `classifyTarget()` sorts a target into loopback/private/public from
    its host (IP-literal aware, IPv4-mapped IPv6 decoded), conservatively
    treating anything it cannot confirm as public.
 -  `assertTargetAllowed()` lets loopback/private targets and any target
    advertising benchmark mode run without friction, and refuses only a
    public target that does not advertise benchmark mode unless
    --allow-unsafe-target is given (mandatory, with no interactive
    prompt); --dry-run bypasses the gate since it only inspects.
 -  `probeBenchmarkMode()` reads the cooperative `stats` endpoint to
    detect benchmark mode and the target's Fedify version, never throwing.
 -  `discoverInbox()` resolves a handle or actor URI to its personal and
    shared inbox the way a remote peer would, building
    private-address-allowing loaders for loopback targets, and
    `selectInbox()` picks the inbox for the scenario's mode.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Stand up the benchmark's own synthetic remote peer.  An author picks
signature standards and the key set is derived: HTTP request signatures
and LD Signatures share one RSA pair, FEP-8b32 uses an Ed25519 pair.
`buildFleet()` expands the actor groups into members with generated keys,
and `spawnSyntheticServer()` serves each member as a normal ActivityPub
actor document with an embedded `publicKey` and `assertionMethod` over
plain loopback HTTP.

The target dereferences a signature's keyId during verification, so
serving exactly the document a real actor exposes lets verification
resolve the key the same way; a fixed actor set keeps this on a cold path
a warm-up window excludes.  A test confirms the served document parses
back into a verifiable actor whose keys resolve.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Sign inbox deliveries reusing the @fedify/fedify signers so the client
pays realistic crypto cost.  `signInboxDelivery()` applies the FEP-8b32
object proof and the LD Signature to the document, then the HTTP request
signature (cavage or rfc9421) to the final body.
`createActivityIdMinter()` mints a unique activity id per request,
satisfying Fedify's always-on inbox idempotency automatically.

`createSigningPipeline()` keeps RSA signing off the send critical path
with three lookahead modes: `jit`, `pipeline` (default; background
signers keep a bounded buffer filled and buffer starvation surfaces the
client as the bottleneck), and `presign`.  The pipeline cannot hang on a
stuck factory, drops transient sign failures, and fails fast on
deterministic ones.  Tests verify the produced cavage and rfc9421
requests pass Fedify's own verifyRequest against synthetic-server keys.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Drive load and turn the raw samples into client-side metrics.
`runLoad()` supports open-loop (a fixed arrival schedule, with latency
measured from each request's scheduled time — the coordinated-omission
correction — so a stalled target or maxInFlight backpressure shows up as
latency rather than being omitted) and closed-loop (N virtual users).
A fair slot-transferring semaphore enforces `maxInFlight` in both models
and reports backpressure as the saturation signal; arrivals are a lazy
generator (constant or seeded Poisson) and only in-flight dispatches are
retained, so memory stays flat on long runs.

`aggregateSamples()` excludes warm-up samples and produces request
counts, success rate, throughput over the measured window, latency
percentiles from the log-linear histogram, and errors grouped by kind,
status, and reason.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Wire the engine into runnable scenarios.  The stats client reads the
cooperative `stats` endpoint and projects the signature-verification
histogram and queue depth into the report's server section, robust to
malformed snapshots.  The inbox runner discovers the recipient inbox,
builds a signing factory over the synthetic fleet, drives the signing
pipeline and load generator, aggregates the client metrics, and attaches
the server metrics; the webfinger runner drives handle-resolution
lookups.  A registry dispatches by type and reports a clear error for the
scenario types that the format expresses but this version does not run.

`presign` signing now requires an open-loop load (a closed-loop run has
no fixed request count to pre-sign).  An end-to-end test stands up a real
`benchmarkMode` Fedify federation and confirms signed inbox deliveries
verify, the inbox listener runs, and server-side signature-verification
metrics are read back.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Implement `runBench`: load, validate, and normalize the suite (any
configuration error logs a friendly message and exits 2), preflight the
scenario runners so an unsupported type fails fast, classify and probe
the target, and apply the safety gate.  A `--dry-run` prints the plan and
sends nothing.  For a real run it builds the synthetic actor server once
when a signed scenario needs it, runs each scenario, assembles the
report, renders it to the chosen format (stdout or a file), and sets the
exit code to 0 when the gate passes and 1 otherwise.

The default exit sets `process.exitCode` so cleanup and output flushing
finish first.  Signed scenarios are refused against a public target,
since the synthetic actor server is only reachable on the client's
loopback.  Dependencies are injectable, and tests cover the passing and
failing gates, dry run, the unsafe-target and public-signed refusals, and
an invalid suite.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Wire the logic-less `${{ ... }}` template engine into the load pipeline:
`renderSuiteTemplates()` expands templates in a parsed suite with a
context exposing the target (host, hostname, port, origin, href,
protocol) plus the default helpers, and `runBench` runs it between
loading and validation.  This is what makes `recipient:
"http://${{ target.host }}/users/alice"` resolve to a concrete URL.

The target comes from `--target` or the suite's own `target`, neither of
which is templated.  Tests cover rendering and the end-to-end inbox run
now uses templating.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Extend the benchmarking manual with the client side: a getting-started
scenario suite, the actors and signature-standards model, `${{ }}`
templating, open- and closed-loop load with the signing modes, the
output formats and CI usage, the safety gate, and the http/loopback
caveats.  Add the @fedify/cli changelog entry for the new command.

fedify-dev#783
fedify-dev#744

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Address four behavioral gaps where the bench engine silently accepted
options it did not actually apply:

 -  Reject `runs` greater than 1 during normalization.  Repeated runs are
    not implemented yet, so accepting the field gave a single run while
    implying several.

 -  Fail a scenario that measured zero requests instead of letting every
    `expect` assertion pass vacuously, and reject a `warmup` that is not
    shorter than the `duration` (which would leave no measured window).

 -  Reject inbox `activity` options the runner cannot honor.  The runner
    always delivers a `Create` carrying an embedded `Note`, so a
    non-`Create` activity type, a non-`Note` `object.type`, or
    `embedObject: false` is now refused up front through a new optional
    `validate()` on the runner, called during preflight.  Scalar-or-list
    type fields are checked in full, not just their first element.

 -  Implement multi-recipient delivery in the inbox runner: every
    recipient's inbox is discovered, and deliveries (with the synthetic
    actors that sign them) are rotated across the recipients, modeling a
    server receiving from many peers into many local inboxes.

The scenario format and JSON Schema still express these options; only the
inbox/webfinger runners constrain what they execute in this version.

fedify-dev#783
fedify-dev#744

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
A malformed `expect` assertion was only parsed while evaluating
results, which happens after the entire benchmark load has been sent.
Worse, the run loop has no catch around result building, so the
resulting AssertionParseError escaped uncaught and crashed the command
instead of failing as a configuration error.

Add validateExpectBlock(), which parses every assertion in a scenario's
`expect` block, and run it in the preflight step (alongside runner
validation) before any probe or load.  A typo in a CI gate now exits 2
without sending traffic, with a message naming the offending metric.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
The cooperative `stats` endpoint is cumulative and has no reset, but the
inbox and webfinger runners read it once at the end, so the reported
server numbers (and any signatureVerification.* expectations) folded in
warm-up traffic and every earlier scenario in the suite.  Client samples
were already windowed; the server side was not, so the two disagreed.

Take a server snapshot at the measured-window boundary and diff it
against the end snapshot:

 -  stats-client.ts gains a raw `ServerSnapshot` (signature histogram and
    queue-depth gauge), `parseServerSnapshot`, `diffSnapshots` (subtracts
    bucket counts; the gauge is not cumulative, so the end value is kept),
    and `snapshotToMetrics`.  `fetchServerSnapshot` returns `null` only on
    transport or parse failure; an available-but-empty snapshot is
    non-null, so an unavailable baseline is never mistaken for an empty
    one.  Histogram subtraction requires identical bucket boundaries, and
    refuses (yields no signature metric) otherwise.

 -  runner.ts gains `withMeasuredWindowStart`, which gates every measured
    send on a one-shot boundary callback so the baseline is captured
    before any measured request reaches the target.

 -  The inbox and webfinger runners snapshot the baseline at the boundary
    and report server metrics only when both ends of the window were
    captured, instead of falling back to the cumulative snapshot.

A few warm-up requests still in flight at the boundary may be attributed
to the window; a hard drain would distort the coordinated-omission client
latency, so that bounded residue is accepted and documented.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
The scenario schema's `load` object required exactly one of `rate` or
`concurrency`, so a block that set only `arrival` or `maxInFlight` and
inherited its load model was rejected before normalization, even though
`resolveLoad()` already supports such partial overrides (inheriting the
model, or falling back to the default open-loop rate).

Relax the constraint to forbid only `rate` and `concurrency` together,
allowing either or neither.  This lets a suite write, for example,
`defaults: { load: { maxInFlight: 100 } }` or override just `arrival` on
one scenario.  The embedded schema literal and the published
schema/bench/scenario-v1.json are regenerated together (the v1 file is
new on this branch, so it is not yet immutable).

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
The synthetic actor/key server bound loopback and advertised
`127.0.0.1` actor and key IDs, which the target dereferences to verify
HTTP signatures.  A same-machine (loopback) target reaches it, but a
non-loopback target dereferences its own `127.0.0.1`, fails key lookup,
and rejects every signed delivery.  The command nonetheless allowed
signed scenarios against private targets, so they failed silently.

Add a `--advertise-host` option.  When set, the synthetic server binds
every interface (`0.0.0.0`, or `::` for an IPv6 host) and advertises the
given host in its actor, key, and base URLs, so a non-loopback target
can dereference them.  `resolveAdvertiseHost()` validates the value as a
bare host name, IPv4 address, or IPv6 literal (bracketing IPv6 for the
URL authority and binding the matching family), rejecting a scheme,
port, path, or other URL syntax with a clear configuration error.

Signed scenarios are now refused (exit 2) when the target is
non-loopback and no `--advertise-host` is given, instead of running and
failing on the target.  The documentation is updated accordingly.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
The `--user-agent` value was passed only to the document loader, so the
benchmark's main requests — the runners' inbox POSTs and WebFinger GETs,
the benchmark-mode probe, and the server stats reads — went out with the
runtime's default User-Agent.  A target that inspects, logs, or
rate-limits by User-Agent saw the wrong value, so the option was
silently ineffective for the traffic that matters.

Wrap the fetch implementation once with withUserAgent(), so every
benchmark request carries the configured User-Agent.  A prebuilt request
(the signed inbox delivery, a WebFinger GET) has the header set in place
rather than recloned, leaving the already-signed body and digest
untouched; the User-Agent is not part of the signed header set, so this
does not affect verification.  A User-Agent the caller already set is
left as-is.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
The text and Markdown renderers only surfaced server queue metrics when
a drain-latency histogram was present, with the depth shown merely as a
suffix to that line.  The current stats reader supplies
`queue.depthMax` without `drainMs`, so queue depth never appeared in the
human-readable output even though it was in the JSON model; the Markdown
form rendered no queue metrics at all.

Render queue depth on its own:

 -  text: keep the combined drain line (now only when it has at least one
    percentile), otherwise print a standalone `Server queue depth max`
    line whenever a depth is reported.
 -  Markdown: add a queue drain p95 row when present and a queue depth max
    row whenever a depth is reported.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
`new URL("localhost:3000")` parses as the `localhost:` scheme with an
empty host, a common typo for a missing `http://`.  Normalization
accepted it, so `--dry-run` succeeded while a real run would misclassify
the target or build an unsupported fetch URL.  Targets carrying
credentials (`http://user:pass@host`) were likewise accepted even though
`fetch` rejects them.

Reject, during normalization, any target whose protocol is not `http:`
or `https:`, whose host is empty, or that carries embedded credentials,
with a message pointing at the likely fix.  The probe and runners only
make bare HTTP(S) requests, so these never produce a working run.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
The safety gate classified only the suite `target`, but an `inbox`
scenario's actual signed-load destination is the discovered inbox (or an
explicit `inbox:` URL), which can differ from the target.  A loopback
`target` with a public `recipient`, or `inbox: https://prod.example/inbox`,
would send benchmark POST load to a public inbox with no gate at all,
bypassing the guard against accidentally benchmarking production.  The
synthetic-reachability rule was likewise only checked against the target
tier, not the destination that actually verifies signatures.

Gate each resolved inbox destination before any load reaches it:

 -  assertInboxDestinationAllowed() refuses a public destination unless it
    shares the gated target's origin while the target advertises benchmark
    mode (inheriting its gate), or --allow-unsafe-target is given; and
    refuses a non-loopback destination unless a reachable synthetic host
    was advertised (--advertise-host).  Origins are compared (scheme, host,
    effective port), so an http inbox does not inherit an https target.
 -  The inbox runner calls an injected destination gate for each resolved
    inbox before sending; the orchestrator maps a refusal to exit 2.

Discovery (a read) still runs, but no benchmark load is sent to an
ungated destination.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
The default fetch follows redirects, which let two safety checks be
bypassed.  A public target whose `stats` endpoint redirected to a host
serving benchmark-mode JSON was marked as advertising benchmark mode, so
the gate allowed load against it.  And a gated loopback, private, or
benchmark target that answered a WebFinger GET or a signed inbox POST
with a 307/308 could carry that load to an ungated public service,
slipping past the destination gate.

Make every benchmark request non-following:

 -  The benchmark-mode probe and the server stats read use
    `redirect: "manual"`, so a redirect is treated as "not advertised"
    and "unavailable" respectively rather than trusted.
 -  `sendRequest` re-wraps any non-manual request as `redirect: "manual"`
    and records a redirect (opaque or 3xx) as a failed send, so no signed
    load reaches the redirect target; the WebFinger and inbox requests are
    built with `redirect: "manual"` so the common path needs no re-clone.

fedify-dev#783

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive benchmarking toolchain for Fedify centered around the new fedify bench command. It allows driving ActivityPub-specific load, such as signed inbox deliveries and WebFinger lookups, against a cooperative target running in benchmark mode. The implementation includes scenario loading, validation against a published JSON Schema, safety gating to prevent unsafe benchmarking of public targets, a load generator supporting open-loop and closed-loop models, and report generation in text, JSON, and Markdown formats. Feedback on the changes highlights two key improvements: wrapping onMeasuredWindowStart in .then() to safely catch synchronous errors, and passing the Uint8Array body directly instead of body.buffer to avoid issues with pooled or sliced buffers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/scenarios/runner.ts Outdated
Comment thread packages/cli/src/bench/signing/signer.ts Outdated
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

dahlia added 2 commits June 5, 2026 20:51
withMeasuredWindowStart wrapped the callback as
Promise.resolve(onMeasuredWindowStart()), which runs it synchronously
before Promise.resolve, so a synchronous throw in the callback would
escape the promise chain instead of becoming a rejection.  Invoke it
through Promise.resolve().then(...), matching the signing pipeline's
pattern, so a sync throw rejects the gated send.

fedify-dev#791 (comment)

Assisted-by: Claude Code:claude-opus-4-8
signInboxDelivery passed body.buffer to signRequest.  body comes from
TextEncoder().encode() (an exact-fit view), so this was correct, but it
would include trailing bytes were body ever a view into a larger buffer,
breaking the digest.  Slice the exact view bytes instead.  signRequest's
body option is an ArrayBuffer, so passing the Uint8Array directly would
not type-check.

fedify-dev#791 (comment)

Assisted-by: Claude Code:claude-opus-4-8
@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented Jun 5, 2026

@codex review

@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented Jun 5, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench command to @fedify/cli for benchmarking federation workloads, along with runners for inbox and webfinger scenarios, a synthetic actor/key server, and JSON schemas for scenario suites and reports. The review feedback highlights two important issues: a bug in the template argument parser where escaped quotes inside string arguments are not handled correctly, and a memory efficiency concern where draining response bodies using response.arrayBuffer() could lead to out-of-memory errors under load, which can be resolved by canceling the body stream directly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/template/template.ts
Comment thread packages/cli/src/bench/scenarios/runner.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 180abef664

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cli/src/bench/server/synthetic.ts Outdated
Comment thread packages/cli/src/bench/action.ts
dahlia added 3 commits June 5, 2026 21:21
splitTopLevel did not track the backslash escape, so an escaped quote
inside a helper string argument was treated as a closing quote and split
the arguments wrongly; parseArg's regex also forbade any embedded quote.
Track the escape state when splitting and accept (then unescape) escaped
quotes when parsing the argument.

fedify-dev#791 (comment)

Assisted-by: Claude Code:claude-opus-4-8
resolveAdvertiseHost bound an advertised hostname to the IPv4 wildcard
(0.0.0.0).  If the hostname resolves to an AAAA record (or the target
prefers IPv6), the target dereferences the actor URLs over IPv6 with
nothing listening, so signed deliveries fail key lookup.  A hostname can
resolve to either family, so bind dual-stack (::); an IPv4 literal still
binds 0.0.0.0 and an IPv6 literal still binds ::.

fedify-dev#791 (comment)

Assisted-by: Claude Code:claude-opus-4-8
The --dry-run help promised to "resolve discovery", but the command
returns right after printing the normalized plan: it never contacts the
target, performs recipient discovery, or gates the resolved inbox, so a
bad recipient or off-target inbox can look valid in a dry run and only
fail in the real run.  Match the help (and the gate's comment) to what
dry-run actually does, consistent with the manual: print the plan
without contacting the target or sending load.

fedify-dev#791 (comment)

Assisted-by: Claude Code:claude-opus-4-8
@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented Jun 5, 2026

@codex review

@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented Jun 5, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench command to @fedify/cli for benchmarking federation workloads against a cooperative target. It adds comprehensive support for parsing, validating, and executing benchmarking scenarios—specifically inbox and webfinger runners—and generating detailed reports. Key features include open-loop and closed-loop load generation, a synthetic actor server for signature verification, and log-linear histogram aggregation. The reviewer feedback highlights two robustness improvements in stats-client.ts: defensively verifying matching bucket boundaries before merging histogram data points to avoid misaligned buckets, and filtering out null or undefined metrics in flattenMetrics to prevent a TypeError from silently failing the entire snapshot parsing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/metrics/stats-client.ts Outdated
Comment thread packages/cli/src/bench/metrics/stats-client.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf94455946

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cli/src/bench/command.ts Outdated
dahlia added 2 commits June 5, 2026 21:38
Two defensive gaps in parsing the target's stats JSON: a null or
undefined entry in a scope's metrics array made the whole parse throw
(caught, but silently dropping every server metric), and merging
histogram data points summed counts whenever the lengths matched without
checking that the bucket boundaries were identical, which would misalign
buckets and skew percentiles.

Filter out null metric entries so the rest still parse, and only sum
histogram points that share the exact same boundaries.

fedify-dev#791 (comment)
fedify-dev#791 (comment)

Assisted-by: Claude Code:claude-opus-4-8
--allow-unsafe-target was config-backed, so a system, user, or project
.fedify.toml with bench.allowUnsafeTarget = true would make every run
behave as if the flag were passed, silently disabling the safety gate and
letting load reach a public, non-benchmark target.  The override is meant
to be an explicit per-run acknowledgment, not a persisted default.

Make the flag a plain CLI flag (no config binding) and drop
allowUnsafeTarget from the bench config schema, so it can only be given on
the command line.

fedify-dev#791 (comment)

Assisted-by: Claude Code:claude-opus-4-8
@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented Jun 5, 2026

@codex review

@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented Jun 5, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench command for benchmarking Fedify federation workloads, including scenario validation, load generation, metrics aggregation, and a synthetic actor server. Feedback on the implementation highlights two potential runtime errors: a possible TypeError when attempting to mutate immutable headers on a Request object in withUserAgent, and another TypeError in the WebFinger runner when falling back to a schemeless host if recipients is empty.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/action.ts Outdated
Comment thread packages/cli/src/bench/scenarios/webfinger.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f6d30f268

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cli/src/bench/metrics/stats-client.ts
Comment thread packages/cli/src/bench/schema-paths.ts Outdated
dahlia added 3 commits June 5, 2026 22:04
withUserAgent set the User-Agent on a prebuilt Request in place.  If such
a request ever has immutable headers, set() throws a TypeError and the
send crashes.  Try the in-place set (the fast path for the requests this
tool builds, which have mutable headers) and fall back to a cloned
Request with merged headers if it throws.

fedify-dev#791 (comment)

Assisted-by: Claude Code:claude-opus-4-8
When a webfinger scenario has no recipients, the runner fell back to the
target's schemeless host (for example localhost:3000), which
convertUrlIfHandle cannot parse as a URL and would throw.  Fall back to
the target's full href, which is always a valid URL.

fedify-dev#791 (comment)

Assisted-by: Claude Code:claude-opus-4-8
import.meta.dirname is only available on Node >= 20.11, but the package
supports Node >= 20.0, so on Node 20.0 to 20.10 it is undefined and feeds
undefined into join(), aborting the schema and render tests before they
run.  Derive the directory from dirname(fileURLToPath(import.meta.url))
instead, which works across all supported Node versions.

fedify-dev#791 (comment)

Assisted-by: Claude Code:claude-opus-4-8
@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented Jun 5, 2026

@codex review

@dahlia
Copy link
Copy Markdown
Member Author

dahlia commented Jun 5, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench command to @fedify/cli for benchmarking federation workloads. It adds a comprehensive benchmarking suite, including scenario runners for inbox and webfinger, load generation, metrics aggregation, safety gating, and a synthetic actor/key server. It also publishes JSON Schemas for scenarios and reports. Feedback on the changes includes a minor formatting correction in schema/README.md to remove an unnecessary backslash in a file path to ensure standard Markdown rendering.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread schema/README.md
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@dahlia dahlia requested review from 2chanhaeng and sij411 June 5, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/cli CLI tools related

Development

Successfully merging this pull request may close these issues.

Benchmarking: fedify bench engine, scenario format, and JSON schema hosting

1 participant