bradygaster · tamirdresher · Jun 14, 2026 · Jun 11, 2026 · Jun 14, 2026
diff --git a/cspell.json b/cspell.json
@@ -39,7 +39,9 @@
     "benleane", "TELMU", "Automator", "kedacore",
     "DEVBOX", "myaccount", "graphify", "Graphify", "graphifyy", "safishamsi", "jagilber",
     "benchmarkfn", "filterissuesbystream", "filterissuesbyworkstream", "Recognised", "recognised",
-    "squadified", "TOCTOU", "unflushed", "unparseable", "pluggability"
+    "squadified", "TOCTOU", "unflushed", "unparseable", "pluggability",
+    "PKCE", "MSAL", "AKIA", "runas",
+    "typoo"
   ],
   "dictionaries": ["en_US", "typescript", "node", "npm", "bash"],
   "allowCompoundWords": true

diff --git a/docs/src/content/docs/features/dual-mode-deployment.md b/docs/src/content/docs/features/dual-mode-deployment.md
@@ -0,0 +1,132 @@
+---
+title: Dual-Mode Deployment — Pod-Aware Capabilities
+description: Run Squad in either agent-per-node or squad-per-pod deployment modes with pod-specific machine capability manifests, controlled by SQUAD_POD_ID and SQUAD_DEPLOYMENT_MODE env vars.
+---
+
+# Dual-Mode Deployment — Pod-Aware Capabilities
+
+> ⚠️ **Experimental** — Squad is alpha software. APIs, commands, and behavior may change between releases.
+
+Dual-mode deployment extends [Capability Routing](/squad/docs/features/capability-routing/) to support both classic single-machine setups and modern containerized/Kubernetes deployments where multiple Squad pods may share an organization's workload — each with potentially different machine capabilities.
+
+It introduces two environment variables and a pod-specific manifest lookup pattern so the same Squad config can run identically in either deployment shape.
+
+---
+
+## The two deployment modes
+
+| Mode | What it means | Capability manifest |
+|------|---------------|---------------------|
+| **`agent-per-node`** (default) | One Squad instance per machine; the machine's capabilities are the squad's capabilities | `.squad/machine-capabilities.json` (shared) |
+| **`squad-per-pod`** | Multiple Squad pods may run on different machines/containers, each with potentially different capabilities | `.squad/machine-capabilities-{podId}.json` (pod-specific) with fallback chain |
+
+Choose the mode via the `SQUAD_DEPLOYMENT_MODE` environment variable:
+
+```bash
+# Classic single-machine setup (default)
+export SQUAD_DEPLOYMENT_MODE=agent-per-node
+
+# Kubernetes / multi-pod setup
+export SQUAD_DEPLOYMENT_MODE=squad-per-pod
+export SQUAD_POD_ID=worker-1
+```
+
+If neither is set, the SDK defaults to `agent-per-node` for backward compatibility.
+
+---
+
+## Environment variables
+
+### `SQUAD_DEPLOYMENT_MODE`
+
+| Value | Behavior |
+|-------|----------|
+| `agent-per-node` | Single shared `machine-capabilities.json` |
+| `squad-per-pod` | Pod-specific manifests with fallback chain |
+| (unset) | Same as `agent-per-node` |
+
+### `SQUAD_POD_ID`
+
+Pod identifier used to construct the pod-specific manifest path. Required when `SQUAD_DEPLOYMENT_MODE=squad-per-pod`; ignored otherwise.
+
+```bash
+SQUAD_POD_ID=worker-1          # → .squad/machine-capabilities-worker-1.json
+SQUAD_POD_ID=gpu-pool-node-3   # → .squad/machine-capabilities-gpu-pool-node-3.json
+```
+
+---
+
+## The fallback chain (squad-per-pod mode)
+
+When `SQUAD_DEPLOYMENT_MODE=squad-per-pod` AND `SQUAD_POD_ID` is set, the SDK looks up capabilities in this order:
+
+1. **`.squad/machine-capabilities-{podId}.json`** — pod-specific (highest priority)
+2. **`.squad/machine-capabilities.json`** — shared fallback for capabilities that apply to all pods
+3. **`~/.squad/machine-capabilities.json`** — user-home fallback (rarely useful in container deployments)
+4. **`null`** — opt-out; capability routing falls back to label-only routing
+
+The first manifest that exists is loaded; the search stops there (no merging). If you need different pods to see different capability sets, give each its own pod-specific file. If you need a shared baseline plus pod-specific additions, merge at the deployment-config level (Helm, Kustomize, etc.) — the SDK doesn't merge automatically.
+
+---
+
+## SDK programmatic access
+
+The new exports from `@bradygaster/squad-sdk/ralph/capabilities`:
+
+```typescript
+import {
+  getDeploymentMode,
+  getPodId,
+  type DeploymentMode,
+} from '@bradygaster/squad-sdk/ralph/capabilities';
+
+const mode: DeploymentMode = getDeploymentMode();  // 'agent-per-node' | 'squad-per-pod'
+const podId: string | undefined = getPodId();       // e.g. 'worker-1', or undefined
+```
+
+These are pure env-var readers. They don't cache or memoize — each call reads `process.env` directly so changes between reads are visible.
+
+---
+
+## Typical Kubernetes deployment shape
+
+In a KEDA-scaled deployment (see [KEDA Scaling](/squad/docs/features/keda-scaling/)), each scaled pod gets a unique `SQUAD_POD_ID` from the pod's name or hash:
+
+```yaml
+# Deployment env block
+env:
+  - name: SQUAD_DEPLOYMENT_MODE
+    value: squad-per-pod
+  - name: SQUAD_POD_ID
+    valueFrom:
+      fieldRef:
+        fieldPath: metadata.name
+```
+
+The pod's mounted volume contains per-pod manifests baked in by the image build or pulled from a ConfigMap, e.g.:
+
+```
+/app/.squad/
+├── machine-capabilities.json           # shared baseline (CPU, memory)
+├── machine-capabilities-gpu-pool-node-1.json   # extends baseline with GPU
+├── machine-capabilities-gpu-pool-node-2.json   # same shape
+└── machine-capabilities-cpu-pool-node-1.json   # no GPU declaration
+```
+
+Pods scheduled onto GPU nodes load a manifest declaring GPU capability; pods on CPU-only nodes get a manifest without GPU. Ralph's issue dispatcher routes `needs:gpu`-labeled work only to pods with the GPU capability.
+
+---
+
+## Limitations
+
+- **No automatic pod discovery.** The SDK reads env vars to know who it is; it doesn't enumerate sibling pods or coordinate work distribution. That's the deployment orchestrator's job (KEDA, scheduler).
+- **No central capability registry.** Pods don't publish their capabilities back to anything; each pod evaluates issues against its own loaded manifest independently. If you need a central view, your orchestrator must aggregate.
+- **Manifest changes require redeploy or restart.** The fallback lookup happens on capability resolution; manifest content is read from disk each time but the manifest *path* is decided by env vars set at process start.
+
+---
+
+## See also
+
+- [Capability Routing](/squad/docs/features/capability-routing/) — the broader machine-capability system
+- [KEDA Scaling](/squad/docs/features/keda-scaling/) — autoscaling Squad pods on demand
+- [Labels](/squad/docs/features/labels/) — `needs:*` label conventions used for capability matching
diff --git a/docs/src/content/docs/features/error-recovery.md b/docs/src/content/docs/features/error-recovery.md
@@ -0,0 +1,95 @@
+---
+title: Error Recovery — Standard Failure Patterns
+description: Built-in skill teaching agents to adapt when things fail — retry with backoff, fallback alternatives, diagnose-and-fix, and escalation patterns.
+---
+
+# Error Recovery — Standard Failure Patterns
+
+> ⚠️ **Experimental** — Squad is alpha software. APIs, commands, and behavior may change between releases.
+
+The `error-recovery` skill teaches every squad agent to **adapt** when something fails, not just report the failure. It ships as a built-in skill at `.copilot/skills/error-recovery/SKILL.md` and is available to every spawned agent.
+
+Without this skill, agents tend to encounter a failure (CI test red, API timeout, missing dependency) and stop. With it, they apply standard patterns to diagnose, retry, or escalate the right way.
+
+---
+
+## The five recovery patterns
+
+### 1. Retry with Backoff
+
+**When:** Transient failures — API timeouts, rate limits, network errors, temporary service unavailability.
+
+**Pattern:**
+1. Wait briefly, then retry (start at 2s, double each attempt)
+2. Maximum 3 retries before escalating
+3. Log each attempt with the error received
+
+**Example:** API call returns `429 Too Many Requests` → wait 2s → retry → wait 4s → retry → wait 8s → retry → escalate if still failing.
+
+### 2. Fallback Alternatives
+
+**When:** Primary tool or approach fails and an alternative exists.
+
+**Pattern:**
+1. Attempt primary approach
+2. On failure, identify alternative tool/method
+3. Try the alternative with the same intent
+4. Document which alternative was used and why
+
+**Example:** Primary CLI tool fails → fall back to direct API call for the same operation. Or: `gh pr comment` rate-limited → fall back to `gh api -X POST .../issues/{n}/comments`.
+
+### 3. Diagnose-and-Fix
+
+**When:** Build failures, test failures, linting errors — structured errors with actionable output.
+
+**Pattern:**
+1. Read the full error output carefully (not just the last line)
+2. Identify the root cause from error messages
+3. Attempt a targeted fix
+4. Re-run to verify the fix
+5. If 3 fix attempts fail, escalate with a diagnostic summary
+
+**Example:** TypeScript build fails with `Cannot find module '@x/y'` → check `package.json`, run `npm install`, re-run build.
+
+### 4. Reframe-and-Retry
+
+**When:** The approach itself is wrong (not just the implementation). User feedback like *"that won't work because..."* or *"try a different way"*.
+
+**Pattern:**
+1. Stop the current approach immediately
+2. Re-read the original task description
+3. Identify what assumption was wrong
+4. Propose 2 alternative approaches before picking one
+5. Get user confirmation if the cost of being wrong again is high
+
+### 5. Escalation
+
+**When:** Three attempts have failed, OR the failure is outside the agent's domain, OR fixing it would violate a team decision.
+
+**Pattern:**
+1. Stop attempting fixes
+2. Summarize: what was tried, what failed, what's known
+3. Surface to coordinator with a clear ask (*"need lead's call on architecture"* vs. *"need human approval"* vs. *"need access to X system"*)
+4. Document the escalation in `decisions/inbox/` if it's a recurring pattern
+
+---
+
+## When NOT to apply these patterns
+
+- **Don't retry on user-input errors.** If the user typed `gh repo create my-typo`, don't retry with `my-typoo`. Surface and ask.
+- **Don't fall back silently on security-sensitive operations.** If `git push origin main` fails because of branch protection, do NOT fall back to `--force`.
+- **Don't escalate without context.** *"It failed"* isn't an escalation; *"three attempts, each with `EACCES`, suggests user lacks write to `.squad/`, recommend chmod or different storage path"* is.
+
+---
+
+## Integration with Reviewer Rejection Protocol
+
+When the failure is a Reviewer rejection (a Reviewer agent rejects an artifact), the [Reviewer Rejection Protocol](/squad/docs/features/reviewer-protocol/) takes precedence. The original author is locked out and a different agent must own the revision. Error-recovery patterns apply within that constraint — the revision agent can use retry/fallback/diagnose patterns freely.
+
+---
+
+## See also
+
+- [Reflect](/squad/docs/features/reflect/) — learning from corrections
+- [Reviewer Protocol](/squad/docs/features/reviewer-protocol/) — when a Reviewer rejects work
+- [Skills](/squad/docs/features/skills/) — how built-in skills work
diff --git a/docs/src/content/docs/features/export-import.md b/docs/src/content/docs/features/export-import.md
@@ -31,6 +31,28 @@ Creates `squad-export.json` in the current directory — a portable snapshot of
 squad export --out ./backups/my-team.json
 ```
 
+### Push directly to a GitHub repository
+
+Instead of writing to a local file, you can push the export straight to a GitHub repo via the GitHub Contents API. This is the easiest way to back up your team to a private repo or share it with collaborators without sending a file.
+
+```bash
+# Export to a GitHub repo (uses default branch)
+squad export --repo myorg/squad-backups
+
+# Export to a specific branch
+squad export --repo myorg/squad-backups --branch nightly
+```
+
+Requirements:
+- GitHub CLI (`gh`) installed and authenticated with permission to push to the target repo
+- The repo must exist (the export does NOT create it)
+
+The export lands at the repo root as `squad-export.json` by default. Combine with `--out` to control the filename inside the repo:
+
+```bash
+squad export --repo myorg/squad-backups --out my-team-2026-06-11.json
+```
+
 ### What's included
 
 | Data | Included |
@@ -53,6 +75,25 @@ squad import squad-export.json
 
 Imports the snapshot into the current repo's `.squad/` directory.
 
+### Pull directly from a GitHub repository
+
+You can import a snapshot directly from a GitHub repo without downloading the file first:
+
+```bash
+# Import from default branch of a repo
+squad import --repo myorg/squad-backups
+
+# Import a specific filename or branch
+squad import --repo myorg/squad-backups --branch nightly
+squad import --repo myorg/squad-backups --out my-team-2026-06-11.json
+```
+
+Requirements:
+- GitHub CLI (`gh`) installed and authenticated with read access to the source repo
+- The export file must exist at the named path in the repo (default: `squad-export.json` at repo root)
+
+Use `--force` together with `--repo` for the same archive-then-replace behavior as the file-based import.
+
 ### Collision detection
 
 If `.squad/` already exists, Squad warns you and stops. To archive the existing team and replace it: