Skip to content

johnpatrickwarren-oss/tessera

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

259 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tessera

Statistically-rigorous behavioral observation for AI training/inference clusters.

Tessera detects deviations in AI cluster behavior at the per-shard level and across the overarching cluster, surfacing issues before they cause impact. It uses the statistical-detector engine derived from DeploySignal — Family A/C/D/E detectors, Ville-bounded e-processes, hierarchical baseline pooling — applied to a fundamentally different operational scope: a running tightly-coupled AI cluster (100-10000 GPU shards in the exemplar case) rather than a single deployment decision gate.

Status

Phase 3 closed; v1 publication candidate (2026-05-20). 67+ rounds of iterative-spec-with-cold-eye-Reviewer development have shipped vendor coverage across the major AI compute substrates plus a bi-directional integration interface with DeploySignal.

Phase Scope Status
Phase 1 Engine vendoring + SCOPING-MEMO-v0.3 foundations Closed
Phase 2 Per-shard residual semantics + hierarchical e-value combination + e-BH FDR + freeze-hook Closed
Phase 3 SLICE 1 AWS Trainium + AWS Inferentia (Neuron Link topology) adapters Closed
Phase 3 SLICE 2 Google TPU/ICI adapter + fetchSnapshot(ctx) live-fetch interface across 5 adapters Closed
Phase 3 SLICE 3 DS integration interface contract + Tessera→DS feed + DS→Tessera event consumer + freeze-hook real-event factory Closed
Phase 4 (candidate) Engine npm extract (dedicated design cycle); real-cluster DCGM validation; methodology framework consolidation Pending

What Tessera does

Per-shard observation primitives:

  • TopologySnapshot ingestion from 6 vendor adapters (Slurm, Kubernetes, NVLink, AWS Neuron Trainium + Inferentia, Google TPU/ICI)
  • TopologySource.fetchSnapshot(ctx?) interface with sparse-data resilience
  • Per-shard residual semantics + topology-aware freeze-hook
  • Hierarchical e-value combination across shard/host/rack layers
  • e-BH FDR control over the per-shard verdict surface

DeploySignal integration:

  • HTTP API contract (TypeScript types + endpoint metadata) at engine/ds-integration/
  • Tessera→DS feed adapter: per-shard VerdictGroup observations → DS correlation layer
  • DS→Tessera event consumer + factory: real deploy-event-driven freeze-hook activation
  • Bi-directional contract eliminates engine duplication without requiring npm-package extraction (Phase 4 candidate)

What Tessera does NOT do

  • Hardware diagnosis. Tessera observes counter behavior; per-GPU fault attribution remains anti-scope (A10 carve-out preserved across all phases).
  • Real-cluster DCGM validation as of v1. Validated against synthetic fixtures derived from public Neuron SDK + JAX topology code + TPU v4/v5 papers. Real-cluster rental validation is a Phase 4 candidate (Path B selected at Phase 3 SLICE 1/2 gate per OQ-P3-9).
  • Customer telemetry consumption. A8/A11 inherited from Phase 1; only operator-controlled rental environments or synthetic fixtures are in-scope.

Tessera vs DeploySignal

DeploySignal Tessera
Scope One canary deployment → one verdict N shards of a running cluster → per-shard + cluster-wide observation
Stakeholder Production SRE / deployment owner Cluster oncall / AI infra operator
Output Proceed / extend / rollback decision Per-shard deviation attribution + fleet-event vs shard-fault distinction
Trigger Each deployment Continuous
Failure class Pre-existing-detector classes applied to canary metrics Same engine; per-shard SDC-class faults that DCGM/NVML don't catch; topology-localized common-mode failures; event-conditional drift attribution

Tessera is not a fork or extension of DeploySignal — it's a separate product that reuses the statistical engine. The two integrate via HTTP contract (engine/ds-integration/) rather than runtime code sharing.

Engine sourcing

Tessera vendors the load-bearing engine subset from DeploySignal at SHA 5a72371. Each vendored file carries a header noting:

  • Source: DeploySignal path + SHA
  • Sync policy: vendored-at-pin (byte-identical) or vendored-with-deltas (Tessera extensions added)

Engine npm extract (eliminating vendoring drift via shared package) is deferred to a Phase 4 dedicated design cycle. The R61 architectural-reality discovery surfaced that a clean extract requires resolving the types-barrel coupling between vendored-with-deltas surfaces and the detection algorithms — a project-close-magnitude decision that deserves its own design phase rather than absorption into a SLICE 3 wave.

Getting started

Requires Node ≥ 20 and pnpm ≥ 11.

git clone https://github.com/johnpatrickwarren-oss/tessera.git
cd tessera
pnpm install
pnpm test      # runs the full test suite (~440 tests)
pnpm build     # tsc compile

Quick demo

Tessera ships two demo surfaces — a CLI for terminal walk-through (R70) and a browser dashboard for clickable exploration (R71).

Browser dashboard

open demos/demo.html      # opens in default browser; no install / no server required

The dashboard pages through 8 pre-recorded scenarios (clean baseline, single-shard SDC drift, rack-localized common mode, event-conditional freeze, FDR control, hierarchical e-value combination, sparse-data resilience, and topology-spanning common mode) with Play / Pause / Reset / Speed controls, an audit-trail panel, a reasoning panel, and a suggested-next-actions panel. All scenarios are deterministic and regeneratable via pnpm build:demos. The dashboard ships as a single static HTML file with vanilla CSS/JS — no external dependencies, opens from file://.

The dashboard ships a Live mode toggle at the top of the page (R85). Switching to Live activates the parameter control panel (drift magnitude, window count, α threshold, target shard, topology size, detector families) and routes the Run button through a Web Worker that loads the engine bundle in-browser and streams per-window state back to the UI. Use the scrubber to replay the run at any speed; click Cancel to terminate mid-stream. See demos/DEMO-SCRIPT.md § Minute 10:00 – 12:00 for the live-mode walkthrough.

CLI scenarios

Run any of four canned scenarios in the terminal:

pnpm demo clean-baseline       # healthy fleet — no firings
pnpm demo sdc-drift            # silent SDC drift on shard-04 → Family A betting fires
pnpm demo common-mode-rack     # 3 shards on shared rack → 1 common-mode candidate
pnpm demo event-conditional    # firmware-push event → freeze-hook activates

Each scenario runs in under 30 seconds, produces deterministic ASCII output, and exercises one real engine surface against synthetic inputs (no live cluster needed). Source: tools/demo-scenario.ts.

Regenerating canned scenarios

pnpm build:demos        # regenerates demos/scenarios/*.json + demos/demo.html

Idempotent: re-running produces byte-identical files. The 8 scenario JSON files double as audit-inspectable evidence of what the dashboard shows. Source: tools/build-canned-demos.ts.

Methodology

Tessera was developed using the Anchor coordination methodology — a four-role pipeline (Architect → Implementer → Reviewer → Memorial-Updater) with cold-eye discipline, threshold-aware reinforcement accretion, and explicit ESCALATE patterns for spec/reality mismatches.

The full audit trail is preserved in this repo's commit history (every round's role-tagged commits, cold-eye Reviewer reports, Memorial-Updater outputs, and ESCALATE-resolution patterns are public). The coordination/ directory contains:

  • PRD.md — Product requirements (per-phase scope)
  • SCOPING-MEMO-v0.3.md — Engine vendoring policy + cross-cutting anti-scope
  • WAVE-PLAN-*.md — Coordinator wave plans (PRD decomposition + DAG analysis)
  • WAVE-GATE-*.md — Wave-close attestations
  • MEMORIAL.md — Cross-round violation + confirmation ledger
  • specs/Q-RNN-SPEC.md — Per-round Architect specifications

CLAUDE-*.md files at the repo root hold the per-role pipeline disciplines (CLAUDE-COMMON.md + CLAUDE-ARCHITECT.md + CLAUDE-IMPLEMENTER.md + CLAUDE-REVIEWER.md + CLAUDE-MEMORIAL.md + CLAUDE-COORDINATOR.md).

Layout

tessera/
├── README.md                     # This file
├── LICENSE                       # Apache 2.0
├── package.json                  # pnpm-managed (packageManager: pnpm@11.x)
├── pnpm-lock.yaml
├── tsconfig.json + tsconfig.test.json
├── CLAUDE-*.md                   # Anchor pipeline role disciplines
├── coordination/                 # PRD + specs + wave plans + memorial + reviews + logs
├── engine/                       # Statistical-detector engine (vendored from DS) + per-shard extensions
│   ├── core.ts
│   ├── detectors/                # Family A/C/D/E detector implementations
│   ├── topology/                 # Vendor adapters: slurm, k8s, nvlink, neuron, tpu, + base
│   ├── types/                    # Verdict + config + policy + audit schemas (Tessera-extended)
│   ├── events/                   # Cluster event feed + freeze-hook
│   ├── ds-integration/           # HTTP API contract + adapters (Tessera↔DS bi-directional)
│   ├── per-shard/                # Per-shard residual semantics
│   └── l0/, l1/, fleet/, o0/     # Layered analysis primitives
├── test/                         # 440+ tests (per-AC; per-round test files q01–q66)
├── scripts/                      # Pipeline scripts (run-pipeline.sh, verify-*.sh, finalize-round.sh)
├── run-pipeline.sh               # Anchor four-role pipeline orchestrator
└── tools/                        # Synthetic fixtures + topology injection harness

Coverage

Tessera R72 validates the engine against 6 failure types × 20 parameter variations = 120 cases. Generate the matrix with:

pnpm coverage

See coverage-matrices/R72-saturation-matrix.md for the human-readable summary; coverage-matrices/R72-saturation-matrix.json is the machine-readable data. The matrix is deterministic — re-running produces byte-identical output.

Type Detection floor Attribution floor
sdc-drift 16 / 20 ≥ 95%
common-mode-rack 20 / 20 ≥ 95%
event-conditional 20 / 20 ≥ 95%
fdr-multiple-testing 16 / 20 ≥ 95%
hierarchical-evalue 12 / 20 ≥ 95% (and ≥ 80% fleet-fires-before-per-shard)
topology-spanning-common-mode 16 / 20 ≥ 95%

Detection envelope (R77)

Tessera R77 characterizes the per-shard detector's detection probability across drift magnitude × window count × α threshold × detector family (Family A betting vs Family C ONS comparison). Generate the envelope matrix with:

pnpm detector-envelope

See coverage-matrices/R77-detection-envelope.md for the human-readable summary with detection curves; coverage-matrices/R77-detection-envelope-matrix.json is the machine-readable data (504 cells, 2520 trials).

At default settings (α=0.005, window_count=200, Family A): ≈100% detection for all drift magnitudes from 0.050 to 0.375. The transitional detection band is at window_count=30 with magnitude < 0.10. Family A outperforms Family C in the short-window/low-magnitude regime (the boundary cells where tuning choices matter most).

Operator tuning guidance: see scripts/detector-tuning-recommendation.md.

Topology-walk tuning envelope (R78)

Tessera R78 characterizes the tuning envelope of attributeCommonMode along two operator-visible dials — max_hop_distance and min_member_count — over 5 scenario classes × 30 cells × 5 trials. Generate the envelope matrix with:

pnpm topology-walk-tuning

See coverage-matrices/R78-topology-walk-tuning.md for the human-readable per-scenario summary; coverage-matrices/R78-topology-walk-tuning-matrix.json is the machine-readable data (30 cells, 150 trials).

Key findings: at the Tessera default max_hop_distance=1, the cooling_zone node is structurally unreachable (shard→rack→cz is hop=2). Lifting to max_hop_distance=2 catches all cross-rack CZ common-modes with no shadow-rack false-positives. max_hop_distance=3 introduces structural false-positive attribution — not recommended for 2-tier topologies.

Operator tuning guidance: see scripts/topology-walk-tuning-recommendation.md.

Baseline curation

Tessera R88 ships a one-command operator entry point that composes the baseline curation pipeline (Stage 2a per-shard MCD-Mahalanobis screening + Stage 2b FCP-1 fleet-correlated e-process) and produces a validated baseline corpus plus a human-readable report.

pnpm curate-baseline path/to/raw-baseline.json
# defaults to writing curated-baseline/ in the cwd
# add --out <dir> to change; --allow-high-drop to override the >15% HALT

The wrapper applies conservative defaults inherited from tools/curate-baseline-fleet-correlated.ts (α_fleet=1e-3, χ²ₚ=0.975, MCD α=0.75), runs an auto-validation pass (Family C detector quiescence on the curated baseline via Stage 2a/2b idempotency), and gates the exit code on drop rate:

Drop rate Headline Exit
< 5% Baseline ready 0
5–15% Baseline ready (with warning) 0
≥ 15% Heterogeneous corpus 1 (use --allow-high-drop to override)
validation failed Review needed 1 (never overridable)

Three artifacts land under <out-dir>/: the curated curated-baseline.json, the markdown curation-report.md, and the per-decision audit trail curation-decisions.jsonl (one JSON line per BaselineCurationDecision record — D11 Stage 2a, D12 Stage 2b, D13 Stage 3b wire format).

Source: tools/curate-baseline.ts.

Quick demo

Open demos/demo.html directly in any modern browser — no server required. Eight pre-recorded scenarios cover clean, drift, common-mode, event-conditional, FDR, hierarchical, sparse, and topology-spanning behaviors. Each runs deterministically from an LCG-seeded synthetic substrate.

Controls

  • Scrubber — drag the slider in the top controls to jump to any window (0 through 29). Scrubbing pauses playback automatically; release the slider to resume manual control.
  • Keyboardspace toggles play/pause; and step forward and backward one window; r resets the current scenario.
  • Speed — 1×, 2×, 4× playback (500ms / 250ms / 125ms per window).
  • Per-firing receipts — the provenance panel collapses individual firing receipts; click any receipt summary to expand its evidence JSON.

10-minute walkthrough

See demos/DEMO-SCRIPT.md for a minute-by-minute script that walks through clean-baseline → SDC-drift → common-mode-rack → event-conditional with talking points matched to the dashboard's per-tick state. Analogous to DeploySignal's DEMO-SCRIPT-10MIN.md.

License

Apache 2.0. See LICENSE.

Contact

John Warren · john.patrick.warren@gmail.com

About

Statistically-rigorous behavioral observation for AI training/inference clusters. Per-shard residuals + topology-aware freeze-hook + bi-directional DeploySignal integration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors