bun install # install dependencies
bun test # run free tests (browse + snapshot + skill validation)
bun run test:evals # run paid evals: LLM judge + E2E (diff-based, ~$4/run max)
bun run test:evals:all # run ALL paid evals regardless of diff
bun run test:e2e # run E2E tests only (diff-based, ~$3.85/run max)
bun run test:e2e:all # run ALL E2E tests regardless of diff
bun run eval:select # show which tests would run based on current diff
bun run dev <cmd> # run CLI in dev mode, e.g. bun run dev goto https://example.com
bun run build # gen docs + compile binaries
bun run gen:skill-docs # regenerate SKILL.md files from templates
bun run skill:check # health dashboard for all skills
bun run dev:skill # watch mode: auto-regen + validate on change
bun run eval:list # list all eval runs from ~/.gstack-dev/evals/
bun run eval:compare # compare two eval runs (auto-picks most recent)
bun run eval:summary # aggregate stats across all eval runstest:evals requires ANTHROPIC_API_KEY. Codex E2E tests (test/codex-e2e.test.ts)
use Codex's own auth from ~/.codex/ config — no OPENAI_API_KEY env var needed.
E2E tests stream progress in real-time (tool-by-tool via --output-format stream-json --verbose). Results are persisted to ~/.gstack-dev/evals/ with auto-comparison
against the previous run.
Diff-based test selection: test:evals and test:e2e auto-select tests based
on git diff against the base branch. Each test declares its file dependencies in
test/helpers/touchfiles.ts. Changes to global touchfiles (session-runner, eval-store,
llm-judge, gen-skill-docs) trigger all tests. Use EVALS_ALL=1 or the :all script
variants to force all tests. Run eval:select to preview which tests would run.
bun test # run before every commit — free, <2s
bun run test:evals # run before shipping — paid, diff-based (~$4/run max)bun test runs skill validation, gen-skill-docs quality checks, and browse
integration tests. bun run test:evals runs LLM-judge quality evals and E2E
tests via claude -p. Both must pass before creating a PR.
gstack/
├── browse/ # Headless browser CLI (Playwright)
│ ├── src/ # CLI + server + commands
│ │ ├── commands.ts # Command registry (single source of truth)
│ │ └── snapshot.ts # SNAPSHOT_FLAGS metadata array
│ ├── test/ # Integration tests + fixtures
│ └── dist/ # Compiled binary
├── scripts/ # Build + DX tooling
│ ├── gen-skill-docs.ts # Template → SKILL.md generator
│ ├── skill-check.ts # Health dashboard
│ └── dev-skill.ts # Watch mode
├── test/ # Skill validation + eval tests
│ ├── helpers/ # skill-parser.ts, session-runner.ts, llm-judge.ts, eval-store.ts
│ ├── fixtures/ # Ground truth JSON, planted-bug fixtures, eval baselines
│ ├── skill-validation.test.ts # Tier 1: static validation (free, <1s)
│ ├── gen-skill-docs.test.ts # Tier 1: generator quality (free, <1s)
│ ├── skill-llm-eval.test.ts # Tier 3: LLM-as-judge (~$0.15/run)
│ └── skill-e2e.test.ts # Tier 2: E2E via claude -p (~$3.85/run)
├── qa-only/ # /qa-only skill (report-only QA, no fixes)
├── plan-design-review/ # /plan-design-review skill (report-only design audit)
├── design-review/ # /design-review skill (design audit + fix loop)
├── ship/ # Ship workflow skill
├── review/ # PR review skill
├── plan-ceo-review/ # /plan-ceo-review skill
├── plan-eng-review/ # /plan-eng-review skill
├── office-hours/ # /office-hours skill (YC Office Hours — startup diagnostic + builder brainstorm)
├── investigate/ # /investigate skill (systematic root-cause debugging)
├── retro/ # Retrospective skill
├── document-release/ # /document-release skill (post-ship doc updates)
├── setup # One-time setup: build binary + symlink skills
├── SKILL.md # Generated from SKILL.md.tmpl (don't edit directly)
├── SKILL.md.tmpl # Template: edit this, run gen:skill-docs
└── package.json # Build scripts for browse
SKILL.md files are generated from .tmpl templates. To update docs:
- Edit the
.tmplfile (e.g.SKILL.md.tmplorbrowse/SKILL.md.tmpl) - Run
bun run gen:skill-docs(orbun run buildwhich does it automatically) - Commit both the
.tmpland generated.mdfiles
To add a new browse command: add it to browse/src/commands.ts and rebuild.
To add a snapshot flag: add it to SNAPSHOT_FLAGS in browse/src/snapshot.ts and rebuild.
Skills must NEVER hardcode framework-specific commands, file patterns, or directory structures. Instead:
- Read CLAUDE.md for project-specific config (test commands, eval commands, etc.)
- If missing, AskUserQuestion — let the user tell you or let gstack search the repo
- Persist the answer to CLAUDE.md so we never have to ask again
This applies to test commands, eval commands, deploy commands, and any other project-specific behavior. The project owns its config; gstack reads it.
SKILL.md.tmpl files are prompt templates read by Claude, not bash scripts. Each bash code block runs in a separate shell — variables do not persist between blocks.
Rules:
- Use natural language for logic and state. Don't use shell variables to pass state between code blocks. Instead, tell Claude what to remember and reference it in prose (e.g., "the base branch detected in Step 0").
- Don't hardcode branch names. Detect
main/master/etc dynamically viagh pr vieworgh repo view. Use{{BASE_BRANCH_DETECT}}for PR-targeting skills. Use "the base branch" in prose,<base>in code block placeholders. - Keep bash blocks self-contained. Each code block should work independently. If a block needs context from a previous step, restate it in the prose above.
- Express conditionals as English. Instead of nested
if/elif/elsein bash, write numbered decision steps: "1. If X, do Y. 2. Otherwise, do Z."
When you need to interact with a browser (QA, dogfooding, cookie setup), use the
/browse skill or run the browse binary directly via $B <command>. NEVER use
mcp__claude-in-chrome__* tools — they are slow, unreliable, and not what this
project uses.
When developing gstack, .claude/skills/gstack may be a symlink back to this
working directory (gitignored). This means skill changes are live immediately —
great for rapid iteration, risky during big refactors where half-written skills
could break other Claude Code sessions using gstack concurrently.
Check once per session: Run ls -la .claude/skills/gstack to see if it's a
symlink or a real copy. If it's a symlink to your working directory, be aware that:
- Template changes +
bun run gen:skill-docsimmediately affect all gstack invocations - Breaking changes to SKILL.md.tmpl files can break concurrent gstack sessions
- During large refactors, remove the symlink (
rm .claude/skills/gstack) so the global install at~/.claude/skills/gstack/is used instead
For plan reviews: When reviewing plans that modify skill templates or the gen-skill-docs pipeline, consider whether the changes should be tested in isolation before going live (especially if the user is actively using gstack in other windows).
Always bisect commits. Every commit should be a single logical change. When you've made multiple changes (e.g., a rename + a rewrite + new tests), split them into separate commits before pushing. Each commit should be independently understandable and revertable.
Examples of good bisection:
- Rename/move separate from behavior changes
- Test infrastructure (touchfiles, helpers) separate from test implementations
- Template changes separate from generated file regeneration
- Mechanical refactors separate from new features
When the user says "bisect commit" or "bisect and push," split staged/unstaged changes into logical commits and push.
CHANGELOG.md is for users, not contributors. Write it like product release notes:
- Lead with what the user can now do that they couldn't before. Sell the feature.
- Use plain language, not implementation details. "You can now..." not "Refactored the..."
- Never mention TODOS.md, internal tracking, eval infrastructure, or contributor-facing details. These are invisible to users and meaningless to them.
- Put contributor/internal changes in a separate "For contributors" section at the bottom.
- Every entry should make someone think "oh nice, I want to try that."
- No jargon: say "every question now tells you which project and branch you're in" not "AskUserQuestion format standardized across skill templates via preamble resolver."
When estimating or discussing effort, always show both human-team and CC+gstack time:
| Task type | Human team | CC+gstack | Compression |
|---|---|---|---|
| Boilerplate / scaffolding | 2 days | 15 min | ~100x |
| Test writing | 1 day | 15 min | ~50x |
| Feature implementation | 1 week | 30 min | ~30x |
| Bug fix + regression test | 4 hours | 15 min | ~20x |
| Architecture / design | 2 days | 4 hours | ~5x |
| Research / exploration | 1 day | 3 hours | ~3x |
Completeness is cheap. Don't recommend shortcuts when the complete implementation is a "lake" (achievable) not an "ocean" (multi-quarter migration). See the Completeness Principle in the skill preamble for the full philosophy.
Contributors can store long-range vision docs and design documents in ~/.gstack-dev/plans/.
These are local-only (not checked in). When reviewing TODOS.md, check plans/ for candidates
that may be ready to promote to TODOs or implement.
When an E2E eval fails during /ship or any other workflow, never claim "not
related to our changes" without proving it. These systems have invisible couplings —
a preamble text change affects agent behavior, a new helper changes timing, a
regenerated SKILL.md shifts prompt context.
Required before attributing a failure to "pre-existing":
- Run the same eval on main (or base branch) and show it fails there too
- If it passes on main but fails on the branch — it IS your change. Trace the blame.
- If you can't run on main, say "unverified — may or may not be related" and flag it as a risk in the PR body
"Pre-existing" without receipts is a lazy claim. Prove it or don't say it.
The active skill lives at ~/.claude/skills/gstack/. After making changes:
- Push your branch
- Fetch and reset in the skill directory:
cd ~/.claude/skills/gstack && git fetch origin && git reset --hard origin/main - Rebuild:
cd ~/.claude/skills/gstack && bun run build
Or copy the binary directly: cp browse/dist/browse ~/.claude/skills/gstack/browse/dist/browse