From f8e527ae0db26214f0b244d915a30ced98bf2cd2 Mon Sep 17 00:00:00 2001 From: Facundo Date: Mon, 6 Apr 2026 10:08:37 -0700 Subject: [PATCH] docs(launch): reposition README + cookbook + sanitization docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 5 code PRs (#2-#6) deliver the engineering for the launch. This PR delivers everything user-facing: the README pitch, the package metadata, the docs/ folder, and the CHANGELOG. This PR depends on PRs #2-#6 being merged first because the new README references commands those PRs add (sow sandbox, sow doctor , the --allow-unsafe flag, the sub-second reset). Land them first, rebase this against main, then merge. README.md Hero rewritten from "Safe test databases from production Postgres" to "Stop letting Claude touch your prod database". Body explains the anxiety-reduction pitch: a coding agent is about to do something database-adjacent and you feel that quiet pang. sow is the safety layer. New "Why sow" section. New "How It Works" diagram showing the template-DB shape (one container per connector, N branch DBs, reset in <1s). New "Cookbook" stub linking to docs/cookbook.md. New "Documentation" section with the docs/ index. packages/cli/package.json Description: "Stop letting Claude touch your prod database. PII-safe local Postgres sandbox for coding agents." Keywords: added ai-agents, coding-agents, claude-code, cursor, sandbox, mcp. packages/core/package.json Description: "sow core engine — analyze, sample, sanitize, and branch Postgres databases for safe coding-agent sandboxes" Keywords: added ai-agents, coding-agents, sandbox. packages/mcp/package.json Description corrected from "15 tools" to "22 tools" (the actual count in packages/mcp/src/index.ts) and repositioned: "sow MCP server — 22 tools for coding agents (Claude Code, Cursor, Codex) to safely manage Postgres sandboxes" Keywords: added claude-code, cursor, codex, coding-agents, sandbox. docs/sandbox.md (new) The sow sandbox flagship command — what it does, the flags, the .env.local backup/revert flow, when not to use it, and what's actually in the sandbox. docs/sanitization.md (new) What sow sanitizes (the PII type table), how JSONB walking works, the fail-closed gate, the --allow-unsafe escape hatch, custom rules via .sow.yml, what sow does NOT do (free-text NER, etc.), and the read-only-on-the-source guarantee. docs/cookbook.md (new) Three end-to-end workflows with concrete prompts: 1. Let Claude refactor your schema without fear 2. Let Cursor generate seed data for a new feature 3. Let your coding agent debug a failing migration Plus the "agent reset loop" pattern diagram, the MCP tool list, and operational tips (one long-running sandbox per project, checkpoints for known-good states, sow doctor as the inspection surface, principle of least privilege on the source DB user). CHANGELOG.md (new) Scaffold with three sections: - [Unreleased] documenting the planned PR #2-#6 features under Added/Changed - [0.1.14] documenting the SQL injection security fix that already shipped (PR #1, merged earlier in the session) - [0.1.13] one-line summary of the initial public release Test/build/lint all clean (89/89 tests, 3/3 packages built, no source code changed in this PR). Co-Authored-By: Claude Opus 4.6 (1M context) --- CHANGELOG.md | 117 ++++++++++++++++++++++++++ README.md | 76 ++++++++++++----- docs/cookbook.md | 167 +++++++++++++++++++++++++++++++++++++ docs/sandbox.md | 74 ++++++++++++++++ docs/sanitization.md | 159 +++++++++++++++++++++++++++++++++++ packages/cli/package.json | 10 ++- packages/core/package.json | 7 +- packages/mcp/package.json | 8 +- 8 files changed, 590 insertions(+), 28 deletions(-) create mode 100644 CHANGELOG.md create mode 100644 docs/cookbook.md create mode 100644 docs/sandbox.md create mode 100644 docs/sanitization.md diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..9028694 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,117 @@ +# Changelog + +All notable changes to sow are documented here. The format is loosely based on +[Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and the project follows +[Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [Unreleased] + +The next release lands the launch positioning ("Stop letting Claude touch your +prod database") plus the rest of the eng-review plan as five parallel PRs: + +### Added (planned) + +- **`sow sandbox`** — flagship zero-config command. Auto-detects your project's + Postgres source, samples + sanitizes, spins up a local sandbox, and patches + `.env.local` with the new `DATABASE_URL`. One command from clone to working + sandbox. (PR #4) +- **`sow env revert`** — restores `.env.local` from the `.env.local.sow.bak` + backup that `sow sandbox` writes. (PR #4) +- **JSONB sanitization.** sow now walks JSONB columns recursively and replaces + values whose key matches a PII pattern. Closes the biggest PII leak vector in + modern Postgres schemas. (PR #3) +- **Postgres type coverage.** Built-in transformers for `inet`, `cidr`, + `macaddr`, `macaddr8`, plus passthrough handling for `bytea`, `xml`, `money`, + `interval`, range types, array types, and custom enums. (PR #3) +- **`--allow-unsafe` flag.** sow's sanitizer is now fail-closed: it aborts + `sow connect` if it sees a Postgres type it can't verify. Pass `--allow-unsafe` + to NULL out unhandled columns instead. (PR #3) +- **`sow doctor `** — drill into a single connector's referential + integrity warnings. Surfaces orphaned FKs, transient read errors, and + sanitization warnings. (PR #6) +- **Tag-driven release workflow.** New `version-bump.yml` workflow lets you cut + a major/minor/patch/prerelease via the GitHub Actions UI; the existing + `release.yml` is now triggered only by tag pushes (not every merge to main). + Prevents accidental releases on README typos. (PR #5) + +### Changed (planned) + +- **`sow branch reset` is now sub-second** on a 10k-row schema. Refactored the + Docker provider to use Postgres template databases (one long-lived container + per connector, N branch databases inside). Old reset path was 5-15s; new path + is ~200-800ms. Enables tight agent reset loops (50 iterations in a minute). + (PR #2) +- **Sampler integrity warnings** — the referential-integrity pass now collects + structured warnings (`parent_fetch_failed`, `parent_not_found`, + `child_fetch_failed`, `implicit_ref_fetch_failed`) instead of silently + swallowing them in `catch {}` blocks. Surfaced via `sow doctor `. + (PR #6) +- **Implicit reference resolution is now batched.** The sampler used to fire + one query per (source_table, source_column) pair when resolving implicit FKs; + it now collects missing ids by target table across all sources and fires one + `IN (...)` query per target. ~10x reduction in `sow connect` round-trips on a + 50-table schema. (PR #6) +- **Skip-list for implicit references is now dynamic.** The old hardcoded + English-only `["id", "user_id", "owner_id", "created_by"]` set is replaced + with a dynamic check against the actual formal Relationships from the + schema. Works for non-English column names and unusual FK layouts. (PR #6) +- **MCP tool count corrected.** Package descriptions now correctly state 22 + tools (was: incorrectly listed as 15). +- **README repositioned** around "Stop letting Claude touch your prod database" + with new sections on the agent reset loop, the cookbook of three workflows, + and a docs index. + +## [0.1.14] — 2026-04-06 + +### Fixed + +- **SQL injection across the sampler and branching layer (security).** A class + of bugs where dynamic SQL was built by string-interpolating values from + sampled source data has been closed. Seven call sites parameterized: + - `packages/core/src/sampler/referential.ts` — three formal-FK and + implicit-reference call sites (regression: a text PK like `O'Brien` used + to crash silently and drop the parent row) + - `packages/core/src/branching/manager.ts:getBranchSample` — the `table` + argument from user/agent input is now `quoteIdent`-quoted, the `limit` is + bound via `$1` + - `packages/core/src/branching/providers/supabase.ts:fetchAuthUserMappings` + — the `IN (...)` clause now uses `$1, $2, ...` placeholders, batched at + 1000 ids per query, with UUID-shape pre-filter + - `packages/core/src/branching/supabase.ts` — eight RLS DDL and auth-user + INSERT/DELETE sites now use parameterized values and `quoteIdent` + identifiers +- **`packages/core/src/adapters/postgres.ts`** — the `query()` method's + `params` argument was previously declared in the interface but silently + dropped at runtime (`_params?: unknown[]`). Now actually passes through to + `postgres@3`'s `sql.unsafe(query, parameters)` for real bind-parameter + safety. +- **Fail-safe RLS setup in the Supabase provider.** A previous structure + could DISABLE row-level security on a table when a transient introspection + error occurred during sandbox setup. RLS introspection now lives in its own + per-table try block that `continue`s on error rather than falling into the + policy-disable fallback path. +- **Identifier quoting helper** — new `packages/core/src/sql/identifiers.ts` + exports `quoteIdent()`, the SQL-standard double-quote escape used wherever + table or column names are interpolated into dynamic SQL. Throws on empty + identifiers and embedded NUL bytes. +- **`sow branch sample` limit clamping** — accepts `LIMIT 0` (a valid request + for an empty result set), falls back to the documented default of 5 for + non-finite inputs, and clamps the upper bound at 100. + +### Tests + +- 89 unit tests passing. 10 new regression tests in + `packages/core/src/sampler/referential.test.ts` covering `quoteIdent` + edge cases, the `O'Brien` single-quote regression, composite FK + parameterization, and hostile-payload defense. +- Cross-model adversarial review (Claude + Codex) — both passes clean, + Codex structured P1 gate passed. + +## [0.1.13] — earlier + +Initial public release. Functional CLI, MCP server, Docker-backed branches, +deterministic PII sanitization, schema introspection, edge-case sampling, +checkpoint save/load, branch diff. Auto-detection from env files and the +common ORMs (Prisma, Drizzle, Knex, TypeORM, Sequelize, Docker Compose). +Provider hints for Supabase, Neon, Vercel Postgres, and Railway. + diff --git a/README.md b/README.md index 32e60fd..368f27a 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ ╚══════╝ ╚═════╝ ╚══╝╚══╝ ``` -**Safe test databases from production Postgres.** +**Stop letting Claude touch your prod database.** [![GitHub stars](https://img.shields.io/github/stars/Bugsterapp/sow)](https://github.com/Bugsterapp/sow) [![npm version](https://img.shields.io/npm/v/@sowdb/cli)](https://www.npmjs.com/package/@sowdb/cli) @@ -20,35 +20,41 @@ -sow connects to your production Postgres, samples representative data with edge cases, replaces all PII with realistic fakes, and gives you isolated database branches that start in seconds. 100% local, zero API calls, zero cost. +You're using Claude Code or Cursor against a real codebase with a real database. Every time the agent is about to do something database-adjacent, you feel that quiet pang of "wait, should I let it do that?" + +sow is the safety layer. One command points it at your prod Postgres, samples the data, scrubs every PII column with realistic fakes, and gives your coding agent a sandboxed local copy to hammer. Prod never gets touched. The sandbox runs in seconds, resets in under one. 100% local. Zero API calls. Zero cost. Never writes to your source database. ## Install & First Use ```bash npm install -g @sowdb/cli -sow connect postgresql://user:pass@host:5432/mydb -sow branch create my-feature -# -> postgresql://sow:sow@localhost:54320/sow +cd your-project +sow sandbox ``` -## Why sow? +`sow sandbox` auto-detects your database from your project's env files, samples it, sanitizes PII, and patches `.env.local` with a safe `DATABASE_URL`. Now any coding agent on your laptop talks to the sandbox instead of prod. + +## Why sow -- **PII Safe** — All personal data is detected and replaced with realistic fakes. -- **Agent-First** — MCP server, `--json` mode, SKILL.md for agent context. -- **Fast** — First snapshot in 30-60s. Branches in ~5s. Resets in ~1s. -- **Checkpoints** — Save and restore branch state instantly. -- **Diff** — See exactly what changed: rows added, deleted, modified, schema changes. -- **Deterministic** — Same seed produces identical output every time. -- **Read-Only** — sow never writes to your source database. -- **Auto-Detect** — Scans .env files, Prisma, Drizzle, Knex, TypeORM, Sequelize, Docker Compose. +- **Built for coding agents.** MCP server with 22 tools, `--json` mode for every command, `SKILL.md` for agent context, deterministic seeds so bugs reproduce across sessions. +- **PII-safe by default.** Detects emails, phones, names, addresses, SSNs, JSONB-embedded fields. Fail-closed: aborts if it sees a Postgres type it can't verify, with `--allow-unsafe` to override explicitly. +- **Reset in under 1 second.** Postgres template-database backed. Your agent can try a destructive change, verify the result, reset, try again — 50 iterations in a minute. +- **Zero config.** Auto-detects env files, Prisma, Drizzle, Knex, TypeORM, Sequelize, Docker Compose. Identifies Supabase, Neon, Vercel Postgres, and Railway projects. +- **Read-only on the source.** sow never writes to your production database. Parameterized queries, identifier escaping, and a security-audited code path verified by both Claude and Codex adversarial review. +- **100% local.** No cloud round-trip, no third party holding your sanitized data, no account, no API key. The sandbox lives on your laptop. ## Quick Start ```bash +# Zero-config: detect your DB, sample, sanitize, patch .env.local +sow sandbox + +# Or do it explicitly sow connect postgresql://user:pass@host:5432/mydb # analyze, sample, sanitize sow branch create my-feature # isolated Postgres in ~5s DATABASE_URL=postgresql://sow:sow@localhost:54320/sow npm run dev -sow branch diff my-feature # see what changed +sow branch reset my-feature # back to seed state in <1s +sow branch diff my-feature # see what your agent changed sow branch delete my-feature # clean up ``` @@ -56,10 +62,10 @@ sow branch delete my-feature # clean up ```bash npm install -g @sowdb/mcp -sow mcp --agent cursor # or claude-code, windsurf, codex +sow mcp --agent claude-code # or cursor, windsurf, codex ``` -Or add manually to your MCP config: +Or add to your MCP config manually: ```json { @@ -75,26 +81,50 @@ Install the agent skill for context: npx skills add Bugsterapp/sow ``` +The MCP server exposes 22 tools: `sow_sandbox`, `sow_connect`, `sow_detect`, `sow_branch_create`, `sow_branch_reset`, `sow_branch_diff`, `sow_branch_save`, `sow_branch_load`, `sow_branch_exec`, `sow_branch_users`, `sow_branch_tables`, `sow_branch_sample`, and more. Every tool returns structured JSON. Agents drive the full sample → branch → exec → diff → reset loop without a human in the middle. + ## How It Works ``` -Production DB sow Pipeline Local Branches +Production DB sow Pipeline Local Sandbox ┌──────────┐ ┌──────────────────────┐ ┌──────────────┐ │ Schema │ │ 1. Analyze │ │ Branch A │ - │ Stats │────>│ 2. Sample (N rows) │────>│ :54320 │ + │ Stats │────>│ 2. Sample (N rows) │────>│ :54320/A │ │ Data │ │ 3. Sanitize PII │ │ │ │ (read │ │ 4. Save snapshot │ │ Branch B │ - │ only) │ │ (~2 MB) │ │ :54321 │ - └──────────┘ └──────────────────────┘ └──────────────┘ - Provider-managed + │ only) │ │ (~2 MB) │ │ :54320/B │ + └──────────┘ └──────────────────────┘ │ │ + │ Branch C │ + │ :54320/C │ + └──────────────┘ + One container + per connector, + N branch DBs, + reset in <1s. ``` +## Cookbook + +Three workflows that show the full agent loop. See [`docs/cookbook.md`](docs/cookbook.md) for the prompts and full walkthrough. + +1. **Let Claude refactor your schema without fear** — `sow sandbox`, then ask Claude to add a column, drop an index, rename a table. Verify, reset, try a different approach. +2. **Let Cursor generate seed data for a new feature** — point your agent at the sandbox and ask for "100 realistic users with orders." Inspect with `sow branch sample`. Reset and ask for a different distribution. +3. **Let your coding agent debug a failing migration** — replay your last migration on the sandbox. If it fails, reset and try a fix. No prod risk. + +## Documentation + +- [`docs/sandbox.md`](docs/sandbox.md) — the `sow sandbox` flagship command, flags, and `.env.local` patching with backup/revert +- [`docs/sanitization.md`](docs/sanitization.md) — what sow sanitizes, the fail-closed gate, JSONB handling, and the `--allow-unsafe` flag +- [`docs/cookbook.md`](docs/cookbook.md) — three end-to-end workflows for coding agents +- [`CHANGELOG.md`](CHANGELOG.md) — release history +- [`CONTRIBUTING.md`](CONTRIBUTING.md) — building from source, running tests, the lane structure + ## sow Cloud — coming soon sow CLI is free, open source, and works 100% locally. Always will be. -sow Cloud is for teams: shared connectors, CI/CD without Docker-in-Docker, compliance (data never touches dev laptops), and a team dashboard. +sow Cloud is for teams: shared connectors, CI/CD without Docker-in-Docker, compliance (sanitized data never touches dev laptops), and a team dashboard. [Join the waitlist →](https://tally.so/r/0QvzZN) diff --git a/docs/cookbook.md b/docs/cookbook.md new file mode 100644 index 0000000..135bc11 --- /dev/null +++ b/docs/cookbook.md @@ -0,0 +1,167 @@ +# Cookbook + +Three end-to-end workflows that show what sow actually unlocks. Every workflow assumes you've installed sow and have a project with a Postgres database. + +```bash +npm install -g @sowdb/cli +cd your-project +sow sandbox +``` + +After that, your `.env.local` has `DATABASE_URL` pointing at the local sandbox. Your coding agent reads it like any other env var. + +--- + +## 1. Let Claude refactor your schema without fear + +**The scenario.** You want Claude Code to add a column, drop an unused index, rename a poorly-named table. The kind of work you'd never let an agent do against prod, but the kind that's safe and useful in a sandbox. + +**The setup.** + +```bash +cd your-project +sow sandbox +``` + +`.env.local` now has `DATABASE_URL=postgresql://sow:sow@localhost:54320/sow_sandbox`. Your existing migration tooling (Prisma, Drizzle, Knex, raw SQL — doesn't matter) reads from there. + +**The prompt.** Open Claude Code in the project. Ask it: + +> Look at the current schema in our Prisma file. The `user_profiles.bio_text` column is going unused. Add a migration to drop it, then run the migration against the sandbox to verify it works. If it breaks something, tell me what. + +**What happens.** + +1. Claude reads `prisma/schema.prisma`, identifies the `bio_text` column. +2. Claude runs `npx prisma migrate dev --name drop_user_profiles_bio_text` against your sandbox. +3. The migration executes against the local sandbox Postgres. Prod is untouched. +4. Claude reports back: "Migration ran cleanly. Verified by running `prisma migrate status`. All existing tests still pass." + +**If it breaks something:** + +```bash +sow branch reset sandbox # back to seed state in <1s +``` + +Now Claude can try a different approach with a clean slate. Five iterations in a minute. Without sow, every "let me try a different migration" round-trip would either be against a stale local copy (data drift) or against staging (pollution). + +--- + +## 2. Let Cursor generate seed data for a new feature + +**The scenario.** You're shipping a new "team workspaces" feature. You need realistic test data: 100 users, ~30 teams, each user belonging to 1-3 teams, with realistic email distributions and signup dates spread over 6 months. + +Writing this seed script by hand is tedious. Letting an agent do it against the *real* user table in staging is unsafe (it pollutes the table for everyone else, and the real users have constraints you don't want to violate). + +**The setup.** + +```bash +sow sandbox +``` + +**The prompt.** Open Cursor in the project. Ask it: + +> Look at the `users`, `teams`, and `team_memberships` tables in our schema. Write a SQL script that inserts 100 realistic users, 30 teams, and team memberships such that each user belongs to 1-3 teams. Use realistic email distributions and spread signup dates over the last 6 months. Run it against the sandbox using `sow branch exec`. + +**What happens.** + +1. Cursor reads the schema, understands the foreign key relationships. +2. Cursor writes `seeds/team_workspaces.sql` with the inserts. +3. Cursor runs `sow branch exec sandbox --file seeds/team_workspaces.sql`. +4. Sandbox now has 100 users + 30 teams + ~200 memberships. Real users in staging are untouched. + +**Inspect what got created:** + +```bash +sow branch sample sandbox users +sow branch sample sandbox teams +sow branch tables sandbox # row counts for every table +``` + +**Don't like the distribution?** + +```bash +sow branch reset sandbox +``` + +And ask Cursor to try a different approach. + +--- + +## 3. Let your coding agent debug a failing migration + +**The scenario.** Your last migration broke something in CI. You don't know exactly what — it ran fine locally, fails on staging. You want to replay it against a sandbox built from the actual prod schema (not your stale local copy) and have the agent figure out what's wrong. + +**The setup.** + +```bash +cd your-project +sow sandbox # samples from prod, gives you a fresh sandbox +``` + +The sandbox now has the *current* prod schema, not the schema you had locally last week. + +**The prompt.** Open Claude Code: + +> Our migration `2026_04_06_add_team_workspaces.sql` is failing in CI but I can't reproduce it locally. Run it against the sandbox using `sow branch exec` and tell me the exact error. Then fix the migration so it works. + +**What happens.** + +1. Claude runs `sow branch exec sandbox --file db/migrations/2026_04_06_add_team_workspaces.sql`. +2. Postgres returns the actual error (e.g. `ERROR: column "user_id" referenced in foreign key constraint does not exist`). +3. Claude reads the migration, sees the bug (maybe a typo, maybe a missing prerequisite column). +4. Claude proposes a fix and runs it: `sow branch reset sandbox && sow branch exec sandbox --file db/migrations/2026_04_06_add_team_workspaces.sql`. +5. Iterates until the migration runs cleanly. + +**Verify what changed:** + +```bash +sow branch diff sandbox +``` + +Shows you exactly which tables, columns, indexes, and rows the migration touched. You see the same diff Claude saw. + +--- + +## Pattern: the agent reset loop + +Every workflow above follows the same loop: + +``` +┌──────────────────────────────────────────┐ +│ 1. Agent does something destructive │ +│ sow branch exec sandbox ... │ +│ │ +│ 2. Agent verifies the result │ +│ sow branch diff sandbox │ +│ sow branch sample sandbox │ +│ │ +│ 3. Wrong? Reset and try again │ +│ sow branch reset sandbox (~200ms) │ +│ │ +│ 4. Right? Move on, prod still untouched │ +└──────────────────────────────────────────┘ +``` + +The reset is the magic. Without it, "let me try a different approach" means "let me clobber my stale local copy and hope I remember to refresh it." With it, every attempt starts from a clean, sanitized, prod-shaped database. + +## MCP tools your agent can call directly + +If your agent supports MCP (Claude Code, Cursor, Windsurf, Codex), `sow mcp --agent ` configures it to call sow's tools directly without any shell-out. The 22 tools cover the full loop: + +- `sow_sandbox` — the flagship zero-config flow +- `sow_detect`, `sow_connect`, `sow_connector_list/refresh/delete` +- `sow_branch_create/list/info/delete/reset/diff/exec/sample/tables/users/env` +- `sow_branch_save/load` (named checkpoints — like git commits for your sandbox) + +Every tool returns structured JSON. Every tool is idempotent where it can be. Every tool is documented so the agent picks the right one without prompting. + +## Tips + +**Keep one long-running sandbox per project.** Don't `sow branch delete sandbox` between sessions — the reset is fast, the recreate is fast, but reusing keeps the connector and Docker container warm. + +**Use checkpoints for "known good states."** Mid-debug, run `sow branch save sandbox before-fix`. After a few attempts, `sow branch load sandbox before-fix` brings you back. Like `git stash` for databases. + +**Use `sow doctor sandbox` if something feels off.** It surfaces sanitization warnings, integrity warnings, and snapshot stats so you can tell whether the sandbox shape matches prod. + +**Don't `sow connect` with a wide-permission user.** Even though sow is read-only, the principle of least privilege applies. Create a read-only Postgres user just for sow. + diff --git a/docs/sandbox.md b/docs/sandbox.md new file mode 100644 index 0000000..8af1119 --- /dev/null +++ b/docs/sandbox.md @@ -0,0 +1,74 @@ +# `sow sandbox` — the flagship command + +`sow sandbox` is the one-command zero-config flow. Run it inside any project that has a Postgres database, and you get a local sanitized sandbox with `DATABASE_URL` already wired up. + +```bash +cd your-project +sow sandbox +``` + +That's it. Your coding agent (Claude Code, Cursor, Codex, anything that reads `DATABASE_URL` from the environment or `.env.local`) now talks to a local Postgres copy with PII scrubbed. Prod is untouched. + +## What it does, in order + +1. **Detects your source database.** Scans `.env`, `.env.local`, Prisma `schema.prisma`, Drizzle config, Knex config, TypeORM config, Sequelize config, `docker-compose.yml`, and `package.json` for a `DATABASE_URL` or equivalent. Identifies Supabase, Neon, Vercel Postgres, and Railway projects via the env vars they use. +2. **Reuses an existing connector if one is set up,** or runs `sow connect` against the detected URL. The connect step samples representative rows (default 200 per table, with edge cases), scrubs every PII column with deterministic Faker output, and saves a snapshot to `~/.sow/snapshots//init.sql`. +3. **Creates a branch** named `sandbox` (override with `--name`). On first run for this connector, this spins up a long-lived Docker Postgres container holding a frozen seed database plus your branch database. On subsequent runs, branches are cloned from the seed in under 1 second. +4. **Patches `.env.local`** with the new `DATABASE_URL` and `SOW_BRANCH=sandbox`. Other variables in the file are preserved. A backup is written to `.env.local.sow.bak` so you can revert. +5. **Prints the connection string** and a one-line confirmation: + ``` + ✓ Sandbox ready at :54320/sow_sandbox + DATABASE_URL=postgresql://sow:sow@localhost:54320/sow_sandbox + Patched .env.local (backup: .env.local.sow.bak) + ``` + +Run your dev server normally — `npm run dev`, `bun dev`, whatever you already use — and your app reads from the sandbox. + +## Flags + +| Flag | Default | Purpose | +|---|---|---| +| `[url]` (positional) | auto-detected | Override the source connection string | +| `--name ` | `sandbox` | Branch name | +| `--env-file ` | `.env.local` | Which env file to patch | +| `--no-env-file` | off | Skip the env patch — just print the URL | +| `--yes` / `-y` | off | Skip the interactive confirmation prompt | +| `--max-rows ` | 200 | Rows per table during sampling | +| `--seed ` | 42 | Reproducibility seed | +| `--full` | off | Copy all rows instead of sampling | +| `--no-sanitize` | off | Skip PII sanitization (NOT recommended) | +| `--allow-unsafe` | off | Allow Postgres types sow doesn't recognize (see [`sanitization.md`](sanitization.md)) | +| `--json` | off | JSON output for agent consumption | +| `--quiet` / `-q` | off | Minimal output | + +## Reverting + +If you want to undo the `.env.local` patch and restore the original file: + +```bash +sow env revert +``` + +This reads `.env.local.sow.bak` and writes it back to `.env.local`, then deletes the backup. + +## Re-running + +Running `sow sandbox` again when a sandbox already exists: + +- Reuses the existing connector (no re-sampling) +- Reuses the existing branch (no re-creation) +- Re-patches `.env.local` if needed (skipped if already correct) +- Exits in under a second + +If you want a fresh sandbox with new sampled data, run `sow connector refresh sandbox` first. + +## When NOT to use `sow sandbox` + +- You want to create *multiple* differently-named branches (use `sow branch create ` directly) +- You want to point at a specific non-detected source URL once and don't want it stored as a connector (use `sow connect ` then `sow branch create`) +- You're running in CI and don't want the `.env.local` patch (use `sow connect && sow branch create dev --env-file ci.env --yes`) + +## What's actually in the sandbox + +Run `sow doctor sandbox` to see snapshot stats and any sanitization warnings. Run `sow branch tables sandbox` to list tables with row counts. Run `sow branch sample sandbox
` to peek at a table's first few rows (the values are sanitized — emails are Faker emails, names are Faker names, etc., but the *shape* matches your real data). + diff --git a/docs/sanitization.md b/docs/sanitization.md new file mode 100644 index 0000000..c7eeb9a --- /dev/null +++ b/docs/sanitization.md @@ -0,0 +1,159 @@ +# Sanitization + +sow's job is to give your coding agent a database that *looks like prod* but contains *zero real PII*. This document explains exactly what sow scrubs, what it doesn't, and how to add custom rules. + +## What gets sanitized automatically + +sow runs every column through two detectors before sampling: + +1. **Type-based detection.** Some Postgres types are inherently sensitive: `inet`, `cidr`, `macaddr`, `macaddr8`. sow has built-in transformers for each. +2. **Name-based detection.** Column names are matched against patterns for these PII categories: + +| Category | Example column names | Transformer | +|---|---|---| +| Email | `email`, `email_address`, `user_email` | Faker `internet.email()` | +| Phone | `phone`, `phone_number`, `mobile`, `cell` | Faker `phone.number()` | +| Name | `first_name`, `last_name`, `full_name`, `name` | Faker `person.firstName/lastName` | +| Address | `address`, `street`, `street_address` | Faker `location.streetAddress()` | +| SSN | `ssn`, `social_security_number` | Faker formatted SSN | +| Credit card | `credit_card`, `card_number`, `cc_number` | Faker `finance.creditCardNumber()` | +| IP address | `ip`, `ip_address` | Faker IPv4 or IPv6 | +| MAC address | `mac`, `mac_address` | Faker `internet.mac()` | +| URL | `url`, `website` | Faker `internet.url()` | +| UUID | `id`, `*_id` (when type is `uuid`) | Faker `string.uuid()` | +| Date of birth | `dob`, `date_of_birth`, `birthday` | Faker `date.birthdate()` ±30 days | +| Password hash | `password`, `password_hash`, `encrypted_password` | bcrypt hash of `password123` | +| Free text | `bio`, `description`, `notes` (when no other rule applies) | Faker `lorem.paragraph()` | + +Every transformer is **deterministic**: the same input value always produces the same fake output. This means foreign keys stay consistent across tables — if `users.email = "alice@corp.com"` is referenced by `audit_log.actor_email`, both get the same Faker replacement. + +## JSONB columns + +JSONB is the most common PII leak vector in modern Postgres schemas. A `profiles.metadata::jsonb` column might contain: + +```json +{ + "email": "alice@corp.com", + "phone": "+1-555-0100", + "preferences": { "theme": "dark", "newsletter": true }, + "contact": { "billing_email": "alice@corp.com" } +} +``` + +sow walks the JSONB structure recursively and replaces values whose **key** matches a PII pattern. The example above becomes: + +```json +{ + "email": "", + "phone": "", + "preferences": { "theme": "dark", "newsletter": true }, + "contact": { "billing_email": "" } +} +``` + +Scalar JSONB (a bare string, number, or null) is passed through. Arrays of objects are walked element-by-element. Invalid JSON passes through unchanged with a warning. + +## The fail-closed gate + +sow refuses to sanitize a column whose Postgres type it doesn't have an explicit handler for. If your schema has: + +- A `tsvector` column (full-text search) +- A custom enum type +- An `hstore` column +- A `pg_lsn` or other system type + +...sow will abort `sow connect` with a clear error: + +``` +Sanitization aborted — 2 columns have types sow cannot verify: + - audit.tags (tsvector) — no tsvector handler configured + - users.role (user_role) — custom enum type + +These columns would be copied to the sandbox AS-IS, potentially leaking +PII that exists in them. Pass --allow-unsafe to skip sanitization of +these columns (they will be NULLed out in the branch). + +To add explicit handling, edit .sow.yml: + sanitize: + rules: + - table: audit + column: tags + type: free_text +``` + +This is the **fail-closed default** — anxiety reduction is the whole pitch, and silently passing unknown types through would break it. + +## The `--allow-unsafe` escape hatch + +When you know what you're doing and want to proceed anyway: + +```bash +sow connect --allow-unsafe postgres://... +sow sandbox --allow-unsafe +``` + +With this flag, columns of unknown types are **NULLed out** in the sandbox (not passed through!). The user is saying "I know there may be gaps; strip those columns to NULL rather than leaking them." A warning summary is printed and surfaces in `sow doctor `. + +## Custom rules + +You can override or extend the built-in detection with a `.sow.yml` file in your project root: + +```yaml +sanitize: + enabled: true + rules: + # Sanitize a column the auto-detector missed + - table: audit + column: actor_email + type: email + + # Use a custom transformer for a custom enum + - table: users + column: role + type: passthrough # don't touch — this is fine to copy as-is + + # Treat a tsvector column as free-text + - table: posts + column: search_index + type: free_text + + # Skip these columns entirely (they will appear in the sandbox unchanged) + skip_columns: + - users.created_at + - users.id +``` + +## Inspecting what was sanitized + +After `sow connect`, sow records every PII column it detected and every rule it applied in `~/.sow/snapshots//metadata.json`. To see them: + +```bash +sow doctor +``` + +Output includes: +- Column count, row count, snapshot size +- PII columns detected (with the type sow assigned to each) +- Any sanitization warnings (e.g. JSONB columns that failed to parse) +- Any referential integrity warnings from the sampler (FK relationships that couldn't be fully resolved) + +## What sow does NOT do + +These are out-of-scope by design: + +- **Free-text PII detection.** sow does NOT scan a free-text field for embedded emails, phone numbers, or names. The whole field is replaced with Lorem Ipsum if it matches the `free_text` pattern. This is a known limitation — see the design doc TODO. +- **Schema-level auditing.** sow doesn't tell you "your schema is leaking PII" or grade your data classification. It scrubs what it sees. +- **Encryption.** Sanitization is replacement, not encryption. The sandbox is plaintext by design (your local agent needs to read it). +- **Cloud relay.** sow runs 100% locally. PII never leaves your laptop. There is no "send to sow Cloud for processing" path. + +## Read-only on the source + +sow's source database access is **strictly read-only in intent and effect**: + +- All SQL is parameterized via `$1, $2, ...` placeholders. No string interpolation. +- All identifiers (table and column names) are quoted via the SQL standard escape (`quoteIdent`). +- The connector code path was security-audited by both Claude and Codex adversarial review (see the v0.1.14 security fix in the changelog). +- sow never issues `INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `TRUNCATE`, or any DDL against the source database. + +If you point sow at a database with read-only credentials, it will still work. We recommend it. + diff --git a/packages/cli/package.json b/packages/cli/package.json index bcfdba0..1122011 100644 --- a/packages/cli/package.json +++ b/packages/cli/package.json @@ -1,7 +1,7 @@ { "name": "@sowdb/cli", "version": "0.1.14", - "description": "Safe test databases from production Postgres", + "description": "Stop letting Claude touch your prod database. PII-safe local Postgres sandbox for coding agents.", "type": "module", "license": "MIT", "repository": { @@ -17,7 +17,13 @@ "docker", "sow", "pii", - "sanitize" + "sanitize", + "ai-agents", + "coding-agents", + "claude-code", + "cursor", + "sandbox", + "mcp" ], "bin": { "sow": "dist/cli.js" diff --git a/packages/core/package.json b/packages/core/package.json index 376749b..027b6e9 100644 --- a/packages/core/package.json +++ b/packages/core/package.json @@ -1,7 +1,7 @@ { "name": "@sowdb/core", "version": "0.1.14", - "description": "sow core engine: analyze, sample, sanitize, and branch Postgres databases", + "description": "sow core engine — analyze, sample, sanitize, and branch Postgres databases for safe coding-agent sandboxes", "type": "module", "license": "MIT", "repository": { @@ -15,7 +15,10 @@ "database", "sanitize", "pii", - "sow" + "sow", + "ai-agents", + "coding-agents", + "sandbox" ], "main": "./dist/index.js", "types": "./dist/index.d.ts", diff --git a/packages/mcp/package.json b/packages/mcp/package.json index 49bce66..4e7c5d2 100644 --- a/packages/mcp/package.json +++ b/packages/mcp/package.json @@ -1,7 +1,7 @@ { "name": "@sowdb/mcp", "version": "0.1.14", - "description": "sow MCP server: 15 tools for AI agents to manage test database branches", + "description": "sow MCP server — 22 tools for coding agents (Claude Code, Cursor, Codex) to safely manage Postgres sandboxes", "type": "module", "license": "MIT", "repository": { @@ -14,6 +14,12 @@ "postgres", "test-data", "ai-agent", + "ai-agents", + "coding-agents", + "claude-code", + "cursor", + "codex", + "sandbox", "sow" ], "main": "./dist/index.js",