Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,9 @@
"benleane", "TELMU", "Automator", "kedacore",
"DEVBOX", "myaccount", "graphify", "Graphify", "graphifyy", "safishamsi", "jagilber",
"benchmarkfn", "filterissuesbystream", "filterissuesbyworkstream", "Recognised", "recognised",
"squadified", "TOCTOU", "unflushed", "unparseable", "pluggability"
"squadified", "TOCTOU", "unflushed", "unparseable", "pluggability",
"PKCE", "MSAL", "AKIA", "runas",
"typoo"
],
"dictionaries": ["en_US", "typescript", "node", "npm", "bash"],
"allowCompoundWords": true
Expand Down
132 changes: 132 additions & 0 deletions docs/src/content/docs/features/dual-mode-deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
title: Dual-Mode Deployment — Pod-Aware Capabilities
description: Run Squad in either agent-per-node or squad-per-pod deployment modes with pod-specific machine capability manifests, controlled by SQUAD_POD_ID and SQUAD_DEPLOYMENT_MODE env vars.
---

# Dual-Mode Deployment — Pod-Aware Capabilities

> ⚠️ **Experimental** — Squad is alpha software. APIs, commands, and behavior may change between releases.

Dual-mode deployment extends [Capability Routing](/squad/docs/features/capability-routing/) to support both classic single-machine setups and modern containerized/Kubernetes deployments where multiple Squad pods may share an organization's workload — each with potentially different machine capabilities.

It introduces two environment variables and a pod-specific manifest lookup pattern so the same Squad config can run identically in either deployment shape.

---

## The two deployment modes

| Mode | What it means | Capability manifest |
|------|---------------|---------------------|
| **`agent-per-node`** (default) | One Squad instance per machine; the machine's capabilities are the squad's capabilities | `.squad/machine-capabilities.json` (shared) |
| **`squad-per-pod`** | Multiple Squad pods may run on different machines/containers, each with potentially different capabilities | `.squad/machine-capabilities-{podId}.json` (pod-specific) with fallback chain |

Choose the mode via the `SQUAD_DEPLOYMENT_MODE` environment variable:

```bash
# Classic single-machine setup (default)
export SQUAD_DEPLOYMENT_MODE=agent-per-node

# Kubernetes / multi-pod setup
export SQUAD_DEPLOYMENT_MODE=squad-per-pod
export SQUAD_POD_ID=worker-1
```

If neither is set, the SDK defaults to `agent-per-node` for backward compatibility.

---

## Environment variables

### `SQUAD_DEPLOYMENT_MODE`

| Value | Behavior |
|-------|----------|
| `agent-per-node` | Single shared `machine-capabilities.json` |
| `squad-per-pod` | Pod-specific manifests with fallback chain |
| (unset) | Same as `agent-per-node` |

### `SQUAD_POD_ID`

Pod identifier used to construct the pod-specific manifest path. Required when `SQUAD_DEPLOYMENT_MODE=squad-per-pod`; ignored otherwise.

```bash
SQUAD_POD_ID=worker-1 # → .squad/machine-capabilities-worker-1.json
SQUAD_POD_ID=gpu-pool-node-3 # → .squad/machine-capabilities-gpu-pool-node-3.json
```

---

## The fallback chain (squad-per-pod mode)

When `SQUAD_DEPLOYMENT_MODE=squad-per-pod` AND `SQUAD_POD_ID` is set, the SDK looks up capabilities in this order:

1. **`.squad/machine-capabilities-{podId}.json`** — pod-specific (highest priority)
2. **`.squad/machine-capabilities.json`** — shared fallback for capabilities that apply to all pods
3. **`~/.squad/machine-capabilities.json`** — user-home fallback (rarely useful in container deployments)
4. **`null`** — opt-out; capability routing falls back to label-only routing

The first manifest that exists is loaded; the search stops there (no merging). If you need different pods to see different capability sets, give each its own pod-specific file. If you need a shared baseline plus pod-specific additions, merge at the deployment-config level (Helm, Kustomize, etc.) — the SDK doesn't merge automatically.

---

## SDK programmatic access

The new exports from `@bradygaster/squad-sdk/ralph/capabilities`:

```typescript
import {
getDeploymentMode,
getPodId,
type DeploymentMode,
} from '@bradygaster/squad-sdk/ralph/capabilities';

const mode: DeploymentMode = getDeploymentMode(); // 'agent-per-node' | 'squad-per-pod'
const podId: string | undefined = getPodId(); // e.g. 'worker-1', or undefined
```

These are pure env-var readers. They don't cache or memoize — each call reads `process.env` directly so changes between reads are visible.

---

## Typical Kubernetes deployment shape

In a KEDA-scaled deployment (see [KEDA Scaling](/squad/docs/features/keda-scaling/)), each scaled pod gets a unique `SQUAD_POD_ID` from the pod's name or hash:

```yaml
# Deployment env block
env:
- name: SQUAD_DEPLOYMENT_MODE
value: squad-per-pod
- name: SQUAD_POD_ID
valueFrom:
fieldRef:
fieldPath: metadata.name
```

The pod's mounted volume contains per-pod manifests baked in by the image build or pulled from a ConfigMap, e.g.:

```
/app/.squad/
├── machine-capabilities.json # shared baseline (CPU, memory)
├── machine-capabilities-gpu-pool-node-1.json # extends baseline with GPU
├── machine-capabilities-gpu-pool-node-2.json # same shape
└── machine-capabilities-cpu-pool-node-1.json # no GPU declaration
```

Pods scheduled onto GPU nodes load a manifest declaring GPU capability; pods on CPU-only nodes get a manifest without GPU. Ralph's issue dispatcher routes `needs:gpu`-labeled work only to pods with the GPU capability.

---

## Limitations

- **No automatic pod discovery.** The SDK reads env vars to know who it is; it doesn't enumerate sibling pods or coordinate work distribution. That's the deployment orchestrator's job (KEDA, scheduler).
- **No central capability registry.** Pods don't publish their capabilities back to anything; each pod evaluates issues against its own loaded manifest independently. If you need a central view, your orchestrator must aggregate.
- **Manifest changes require redeploy or restart.** The fallback lookup happens on capability resolution; manifest content is read from disk each time but the manifest *path* is decided by env vars set at process start.

---

## See also

- [Capability Routing](/squad/docs/features/capability-routing/) — the broader machine-capability system
- [KEDA Scaling](/squad/docs/features/keda-scaling/) — autoscaling Squad pods on demand
- [Labels](/squad/docs/features/labels/) — `needs:*` label conventions used for capability matching
95 changes: 95 additions & 0 deletions docs/src/content/docs/features/error-recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
title: Error Recovery — Standard Failure Patterns
description: Built-in skill teaching agents to adapt when things fail — retry with backoff, fallback alternatives, diagnose-and-fix, and escalation patterns.
---

# Error Recovery — Standard Failure Patterns

> ⚠️ **Experimental** — Squad is alpha software. APIs, commands, and behavior may change between releases.

The `error-recovery` skill teaches every squad agent to **adapt** when something fails, not just report the failure. It ships as a built-in skill at `.copilot/skills/error-recovery/SKILL.md` and is available to every spawned agent.

Without this skill, agents tend to encounter a failure (CI test red, API timeout, missing dependency) and stop. With it, they apply standard patterns to diagnose, retry, or escalate the right way.

---

## The five recovery patterns

### 1. Retry with Backoff

**When:** Transient failures — API timeouts, rate limits, network errors, temporary service unavailability.

**Pattern:**
1. Wait briefly, then retry (start at 2s, double each attempt)
2. Maximum 3 retries before escalating
3. Log each attempt with the error received

**Example:** API call returns `429 Too Many Requests` → wait 2s → retry → wait 4s → retry → wait 8s → retry → escalate if still failing.

### 2. Fallback Alternatives

**When:** Primary tool or approach fails and an alternative exists.

**Pattern:**
1. Attempt primary approach
2. On failure, identify alternative tool/method
3. Try the alternative with the same intent
4. Document which alternative was used and why

**Example:** Primary CLI tool fails → fall back to direct API call for the same operation. Or: `gh pr comment` rate-limited → fall back to `gh api -X POST .../issues/{n}/comments`.

### 3. Diagnose-and-Fix

**When:** Build failures, test failures, linting errors — structured errors with actionable output.

**Pattern:**
1. Read the full error output carefully (not just the last line)
2. Identify the root cause from error messages
3. Attempt a targeted fix
4. Re-run to verify the fix
5. If 3 fix attempts fail, escalate with a diagnostic summary

**Example:** TypeScript build fails with `Cannot find module '@x/y'` → check `package.json`, run `npm install`, re-run build.

### 4. Reframe-and-Retry

**When:** The approach itself is wrong (not just the implementation). User feedback like *"that won't work because..."* or *"try a different way"*.

**Pattern:**
1. Stop the current approach immediately
2. Re-read the original task description
3. Identify what assumption was wrong
4. Propose 2 alternative approaches before picking one
5. Get user confirmation if the cost of being wrong again is high

### 5. Escalation

**When:** Three attempts have failed, OR the failure is outside the agent's domain, OR fixing it would violate a team decision.

**Pattern:**
1. Stop attempting fixes
2. Summarize: what was tried, what failed, what's known
3. Surface to coordinator with a clear ask (*"need lead's call on architecture"* vs. *"need human approval"* vs. *"need access to X system"*)
4. Document the escalation in `decisions/inbox/` if it's a recurring pattern

---

## When NOT to apply these patterns

- **Don't retry on user-input errors.** If the user typed `gh repo create my-typo`, don't retry with `my-typoo`. Surface and ask.
- **Don't fall back silently on security-sensitive operations.** If `git push origin main` fails because of branch protection, do NOT fall back to `--force`.
- **Don't escalate without context.** *"It failed"* isn't an escalation; *"three attempts, each with `EACCES`, suggests user lacks write to `.squad/`, recommend chmod or different storage path"* is.

---

## Integration with Reviewer Rejection Protocol

When the failure is a Reviewer rejection (a Reviewer agent rejects an artifact), the [Reviewer Rejection Protocol](/squad/docs/features/reviewer-protocol/) takes precedence. The original author is locked out and a different agent must own the revision. Error-recovery patterns apply within that constraint — the revision agent can use retry/fallback/diagnose patterns freely.

---

## See also

- [Reflect](/squad/docs/features/reflect/) — learning from corrections
- [Reviewer Protocol](/squad/docs/features/reviewer-protocol/) — when a Reviewer rejects work
- [Skills](/squad/docs/features/skills/) — how built-in skills work
41 changes: 41 additions & 0 deletions docs/src/content/docs/features/export-import.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,28 @@ Creates `squad-export.json` in the current directory — a portable snapshot of
squad export --out ./backups/my-team.json
```

### Push directly to a GitHub repository

Instead of writing to a local file, you can push the export straight to a GitHub repo via the GitHub Contents API. This is the easiest way to back up your team to a private repo or share it with collaborators without sending a file.

```bash
# Export to a GitHub repo (uses default branch)
squad export --repo myorg/squad-backups

# Export to a specific branch
squad export --repo myorg/squad-backups --branch nightly
```

Requirements:
- GitHub CLI (`gh`) installed and authenticated with permission to push to the target repo
- The repo must exist (the export does NOT create it)

The export lands at the repo root as `squad-export.json` by default. Combine with `--out` to control the filename inside the repo:

```bash
squad export --repo myorg/squad-backups --out my-team-2026-06-11.json
```
Comment on lines +50 to +54

### What's included

| Data | Included |
Expand All @@ -53,6 +75,25 @@ squad import squad-export.json

Imports the snapshot into the current repo's `.squad/` directory.

### Pull directly from a GitHub repository

You can import a snapshot directly from a GitHub repo without downloading the file first:

```bash
# Import from default branch of a repo
squad import --repo myorg/squad-backups

# Import a specific filename or branch
squad import --repo myorg/squad-backups --branch nightly
squad import --repo myorg/squad-backups --out my-team-2026-06-11.json
Comment on lines +86 to +88
```

Requirements:
- GitHub CLI (`gh`) installed and authenticated with read access to the source repo
- The export file must exist at the named path in the repo (default: `squad-export.json` at repo root)

Comment on lines +91 to +94
Use `--force` together with `--repo` for the same archive-then-replace behavior as the file-based import.

### Collision detection

If `.squad/` already exists, Squad warns you and stops. To archive the existing team and replace it:
Expand Down
Loading
Loading