diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..9028694 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,117 @@ +# Changelog + +All notable changes to sow are documented here. The format is loosely based on +[Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and the project follows +[Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [Unreleased] + +The next release lands the launch positioning ("Stop letting Claude touch your +prod database") plus the rest of the eng-review plan as five parallel PRs: + +### Added (planned) + +- **`sow sandbox`** — flagship zero-config command. Auto-detects your project's + Postgres source, samples + sanitizes, spins up a local sandbox, and patches + `.env.local` with the new `DATABASE_URL`. One command from clone to working + sandbox. (PR #4) +- **`sow env revert`** — restores `.env.local` from the `.env.local.sow.bak` + backup that `sow sandbox` writes. (PR #4) +- **JSONB sanitization.** sow now walks JSONB columns recursively and replaces + values whose key matches a PII pattern. Closes the biggest PII leak vector in + modern Postgres schemas. (PR #3) +- **Postgres type coverage.** Built-in transformers for `inet`, `cidr`, + `macaddr`, `macaddr8`, plus passthrough handling for `bytea`, `xml`, `money`, + `interval`, range types, array types, and custom enums. (PR #3) +- **`--allow-unsafe` flag.** sow's sanitizer is now fail-closed: it aborts + `sow connect` if it sees a Postgres type it can't verify. Pass `--allow-unsafe` + to NULL out unhandled columns instead. (PR #3) +- **`sow doctor `** — drill into a single connector's referential + integrity warnings. Surfaces orphaned FKs, transient read errors, and + sanitization warnings. (PR #6) +- **Tag-driven release workflow.** New `version-bump.yml` workflow lets you cut + a major/minor/patch/prerelease via the GitHub Actions UI; the existing + `release.yml` is now triggered only by tag pushes (not every merge to main). + Prevents accidental releases on README typos. (PR #5) + +### Changed (planned) + +- **`sow branch reset` is now sub-second** on a 10k-row schema. Refactored the + Docker provider to use Postgres template databases (one long-lived container + per connector, N branch databases inside). Old reset path was 5-15s; new path + is ~200-800ms. Enables tight agent reset loops (50 iterations in a minute). + (PR #2) +- **Sampler integrity warnings** — the referential-integrity pass now collects + structured warnings (`parent_fetch_failed`, `parent_not_found`, + `child_fetch_failed`, `implicit_ref_fetch_failed`) instead of silently + swallowing them in `catch {}` blocks. Surfaced via `sow doctor `. + (PR #6) +- **Implicit reference resolution is now batched.** The sampler used to fire + one query per (source_table, source_column) pair when resolving implicit FKs; + it now collects missing ids by target table across all sources and fires one + `IN (...)` query per target. ~10x reduction in `sow connect` round-trips on a + 50-table schema. (PR #6) +- **Skip-list for implicit references is now dynamic.** The old hardcoded + English-only `["id", "user_id", "owner_id", "created_by"]` set is replaced + with a dynamic check against the actual formal Relationships from the + schema. Works for non-English column names and unusual FK layouts. (PR #6) +- **MCP tool count corrected.** Package descriptions now correctly state 22 + tools (was: incorrectly listed as 15). +- **README repositioned** around "Stop letting Claude touch your prod database" + with new sections on the agent reset loop, the cookbook of three workflows, + and a docs index. + +## [0.1.14] — 2026-04-06 + +### Fixed + +- **SQL injection across the sampler and branching layer (security).** A class + of bugs where dynamic SQL was built by string-interpolating values from + sampled source data has been closed. Seven call sites parameterized: + - `packages/core/src/sampler/referential.ts` — three formal-FK and + implicit-reference call sites (regression: a text PK like `O'Brien` used + to crash silently and drop the parent row) + - `packages/core/src/branching/manager.ts:getBranchSample` — the `table` + argument from user/agent input is now `quoteIdent`-quoted, the `limit` is + bound via `$1` + - `packages/core/src/branching/providers/supabase.ts:fetchAuthUserMappings` + — the `IN (...)` clause now uses `$1, $2, ...` placeholders, batched at + 1000 ids per query, with UUID-shape pre-filter + - `packages/core/src/branching/supabase.ts` — eight RLS DDL and auth-user + INSERT/DELETE sites now use parameterized values and `quoteIdent` + identifiers +- **`packages/core/src/adapters/postgres.ts`** — the `query()` method's + `params` argument was previously declared in the interface but silently + dropped at runtime (`_params?: unknown[]`). Now actually passes through to + `postgres@3`'s `sql.unsafe(query, parameters)` for real bind-parameter + safety. +- **Fail-safe RLS setup in the Supabase provider.** A previous structure + could DISABLE row-level security on a table when a transient introspection + error occurred during sandbox setup. RLS introspection now lives in its own + per-table try block that `continue`s on error rather than falling into the + policy-disable fallback path. +- **Identifier quoting helper** — new `packages/core/src/sql/identifiers.ts` + exports `quoteIdent()`, the SQL-standard double-quote escape used wherever + table or column names are interpolated into dynamic SQL. Throws on empty + identifiers and embedded NUL bytes. +- **`sow branch sample` limit clamping** — accepts `LIMIT 0` (a valid request + for an empty result set), falls back to the documented default of 5 for + non-finite inputs, and clamps the upper bound at 100. + +### Tests + +- 89 unit tests passing. 10 new regression tests in + `packages/core/src/sampler/referential.test.ts` covering `quoteIdent` + edge cases, the `O'Brien` single-quote regression, composite FK + parameterization, and hostile-payload defense. +- Cross-model adversarial review (Claude + Codex) — both passes clean, + Codex structured P1 gate passed. + +## [0.1.13] — earlier + +Initial public release. Functional CLI, MCP server, Docker-backed branches, +deterministic PII sanitization, schema introspection, edge-case sampling, +checkpoint save/load, branch diff. Auto-detection from env files and the +common ORMs (Prisma, Drizzle, Knex, TypeORM, Sequelize, Docker Compose). +Provider hints for Supabase, Neon, Vercel Postgres, and Railway. + diff --git a/README.md b/README.md index 32e60fd..368f27a 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ ╚══════╝ ╚═════╝ ╚══╝╚══╝ ``` -**Safe test databases from production Postgres.** +**Stop letting Claude touch your prod database.** [![GitHub stars](https://img.shields.io/github/stars/Bugsterapp/sow)](https://github.com/Bugsterapp/sow) [![npm version](https://img.shields.io/npm/v/@sowdb/cli)](https://www.npmjs.com/package/@sowdb/cli) @@ -20,35 +20,41 @@ -sow connects to your production Postgres, samples representative data with edge cases, replaces all PII with realistic fakes, and gives you isolated database branches that start in seconds. 100% local, zero API calls, zero cost. +You're using Claude Code or Cursor against a real codebase with a real database. Every time the agent is about to do something database-adjacent, you feel that quiet pang of "wait, should I let it do that?" + +sow is the safety layer. One command points it at your prod Postgres, samples the data, scrubs every PII column with realistic fakes, and gives your coding agent a sandboxed local copy to hammer. Prod never gets touched. The sandbox runs in seconds, resets in under one. 100% local. Zero API calls. Zero cost. Never writes to your source database. ## Install & First Use ```bash npm install -g @sowdb/cli -sow connect postgresql://user:pass@host:5432/mydb -sow branch create my-feature -# -> postgresql://sow:sow@localhost:54320/sow +cd your-project +sow sandbox ``` -## Why sow? +`sow sandbox` auto-detects your database from your project's env files, samples it, sanitizes PII, and patches `.env.local` with a safe `DATABASE_URL`. Now any coding agent on your laptop talks to the sandbox instead of prod. + +## Why sow -- **PII Safe** — All personal data is detected and replaced with realistic fakes. -- **Agent-First** — MCP server, `--json` mode, SKILL.md for agent context. -- **Fast** — First snapshot in 30-60s. Branches in ~5s. Resets in ~1s. -- **Checkpoints** — Save and restore branch state instantly. -- **Diff** — See exactly what changed: rows added, deleted, modified, schema changes. -- **Deterministic** — Same seed produces identical output every time. -- **Read-Only** — sow never writes to your source database. -- **Auto-Detect** — Scans .env files, Prisma, Drizzle, Knex, TypeORM, Sequelize, Docker Compose. +- **Built for coding agents.** MCP server with 22 tools, `--json` mode for every command, `SKILL.md` for agent context, deterministic seeds so bugs reproduce across sessions. +- **PII-safe by default.** Detects emails, phones, names, addresses, SSNs, JSONB-embedded fields. Fail-closed: aborts if it sees a Postgres type it can't verify, with `--allow-unsafe` to override explicitly. +- **Reset in under 1 second.** Postgres template-database backed. Your agent can try a destructive change, verify the result, reset, try again — 50 iterations in a minute. +- **Zero config.** Auto-detects env files, Prisma, Drizzle, Knex, TypeORM, Sequelize, Docker Compose. Identifies Supabase, Neon, Vercel Postgres, and Railway projects. +- **Read-only on the source.** sow never writes to your production database. Parameterized queries, identifier escaping, and a security-audited code path verified by both Claude and Codex adversarial review. +- **100% local.** No cloud round-trip, no third party holding your sanitized data, no account, no API key. The sandbox lives on your laptop. ## Quick Start ```bash +# Zero-config: detect your DB, sample, sanitize, patch .env.local +sow sandbox + +# Or do it explicitly sow connect postgresql://user:pass@host:5432/mydb # analyze, sample, sanitize sow branch create my-feature # isolated Postgres in ~5s DATABASE_URL=postgresql://sow:sow@localhost:54320/sow npm run dev -sow branch diff my-feature # see what changed +sow branch reset my-feature # back to seed state in <1s +sow branch diff my-feature # see what your agent changed sow branch delete my-feature # clean up ``` @@ -56,10 +62,10 @@ sow branch delete my-feature # clean up ```bash npm install -g @sowdb/mcp -sow mcp --agent cursor # or claude-code, windsurf, codex +sow mcp --agent claude-code # or cursor, windsurf, codex ``` -Or add manually to your MCP config: +Or add to your MCP config manually: ```json { @@ -75,26 +81,50 @@ Install the agent skill for context: npx skills add Bugsterapp/sow ``` +The MCP server exposes 22 tools: `sow_sandbox`, `sow_connect`, `sow_detect`, `sow_branch_create`, `sow_branch_reset`, `sow_branch_diff`, `sow_branch_save`, `sow_branch_load`, `sow_branch_exec`, `sow_branch_users`, `sow_branch_tables`, `sow_branch_sample`, and more. Every tool returns structured JSON. Agents drive the full sample → branch → exec → diff → reset loop without a human in the middle. + ## How It Works ``` -Production DB sow Pipeline Local Branches +Production DB sow Pipeline Local Sandbox ┌──────────┐ ┌──────────────────────┐ ┌──────────────┐ │ Schema │ │ 1. Analyze │ │ Branch A │ - │ Stats │────>│ 2. Sample (N rows) │────>│ :54320 │ + │ Stats │────>│ 2. Sample (N rows) │────>│ :54320/A │ │ Data │ │ 3. Sanitize PII │ │ │ │ (read │ │ 4. Save snapshot │ │ Branch B │ - │ only) │ │ (~2 MB) │ │ :54321 │ - └──────────┘ └──────────────────────┘ └──────────────┘ - Provider-managed + │ only) │ │ (~2 MB) │ │ :54320/B │ + └──────────┘ └──────────────────────┘ │ │ + │ Branch C │ + │ :54320/C │ + └──────────────┘ + One container + per connector, + N branch DBs, + reset in <1s. ``` +## Cookbook + +Three workflows that show the full agent loop. See [`docs/cookbook.md`](docs/cookbook.md) for the prompts and full walkthrough. + +1. **Let Claude refactor your schema without fear** — `sow sandbox`, then ask Claude to add a column, drop an index, rename a table. Verify, reset, try a different approach. +2. **Let Cursor generate seed data for a new feature** — point your agent at the sandbox and ask for "100 realistic users with orders." Inspect with `sow branch sample`. Reset and ask for a different distribution. +3. **Let your coding agent debug a failing migration** — replay your last migration on the sandbox. If it fails, reset and try a fix. No prod risk. + +## Documentation + +- [`docs/sandbox.md`](docs/sandbox.md) — the `sow sandbox` flagship command, flags, and `.env.local` patching with backup/revert +- [`docs/sanitization.md`](docs/sanitization.md) — what sow sanitizes, the fail-closed gate, JSONB handling, and the `--allow-unsafe` flag +- [`docs/cookbook.md`](docs/cookbook.md) — three end-to-end workflows for coding agents +- [`CHANGELOG.md`](CHANGELOG.md) — release history +- [`CONTRIBUTING.md`](CONTRIBUTING.md) — building from source, running tests, the lane structure + ## sow Cloud — coming soon sow CLI is free, open source, and works 100% locally. Always will be. -sow Cloud is for teams: shared connectors, CI/CD without Docker-in-Docker, compliance (data never touches dev laptops), and a team dashboard. +sow Cloud is for teams: shared connectors, CI/CD without Docker-in-Docker, compliance (sanitized data never touches dev laptops), and a team dashboard. [Join the waitlist →](https://tally.so/r/0QvzZN) diff --git a/docs/cookbook.md b/docs/cookbook.md new file mode 100644 index 0000000..135bc11 --- /dev/null +++ b/docs/cookbook.md @@ -0,0 +1,167 @@ +# Cookbook + +Three end-to-end workflows that show what sow actually unlocks. Every workflow assumes you've installed sow and have a project with a Postgres database. + +```bash +npm install -g @sowdb/cli +cd your-project +sow sandbox +``` + +After that, your `.env.local` has `DATABASE_URL` pointing at the local sandbox. Your coding agent reads it like any other env var. + +--- + +## 1. Let Claude refactor your schema without fear + +**The scenario.** You want Claude Code to add a column, drop an unused index, rename a poorly-named table. The kind of work you'd never let an agent do against prod, but the kind that's safe and useful in a sandbox. + +**The setup.** + +```bash +cd your-project +sow sandbox +``` + +`.env.local` now has `DATABASE_URL=postgresql://sow:sow@localhost:54320/sow_sandbox`. Your existing migration tooling (Prisma, Drizzle, Knex, raw SQL — doesn't matter) reads from there. + +**The prompt.** Open Claude Code in the project. Ask it: + +> Look at the current schema in our Prisma file. The `user_profiles.bio_text` column is going unused. Add a migration to drop it, then run the migration against the sandbox to verify it works. If it breaks something, tell me what. + +**What happens.** + +1. Claude reads `prisma/schema.prisma`, identifies the `bio_text` column. +2. Claude runs `npx prisma migrate dev --name drop_user_profiles_bio_text` against your sandbox. +3. The migration executes against the local sandbox Postgres. Prod is untouched. +4. Claude reports back: "Migration ran cleanly. Verified by running `prisma migrate status`. All existing tests still pass." + +**If it breaks something:** + +```bash +sow branch reset sandbox # back to seed state in <1s +``` + +Now Claude can try a different approach with a clean slate. Five iterations in a minute. Without sow, every "let me try a different migration" round-trip would either be against a stale local copy (data drift) or against staging (pollution). + +--- + +## 2. Let Cursor generate seed data for a new feature + +**The scenario.** You're shipping a new "team workspaces" feature. You need realistic test data: 100 users, ~30 teams, each user belonging to 1-3 teams, with realistic email distributions and signup dates spread over 6 months. + +Writing this seed script by hand is tedious. Letting an agent do it against the *real* user table in staging is unsafe (it pollutes the table for everyone else, and the real users have constraints you don't want to violate). + +**The setup.** + +```bash +sow sandbox +``` + +**The prompt.** Open Cursor in the project. Ask it: + +> Look at the `users`, `teams`, and `team_memberships` tables in our schema. Write a SQL script that inserts 100 realistic users, 30 teams, and team memberships such that each user belongs to 1-3 teams. Use realistic email distributions and spread signup dates over the last 6 months. Run it against the sandbox using `sow branch exec`. + +**What happens.** + +1. Cursor reads the schema, understands the foreign key relationships. +2. Cursor writes `seeds/team_workspaces.sql` with the inserts. +3. Cursor runs `sow branch exec sandbox --file seeds/team_workspaces.sql`. +4. Sandbox now has 100 users + 30 teams + ~200 memberships. Real users in staging are untouched. + +**Inspect what got created:** + +```bash +sow branch sample sandbox users +sow branch sample sandbox teams +sow branch tables sandbox # row counts for every table +``` + +**Don't like the distribution?** + +```bash +sow branch reset sandbox +``` + +And ask Cursor to try a different approach. + +--- + +## 3. Let your coding agent debug a failing migration + +**The scenario.** Your last migration broke something in CI. You don't know exactly what — it ran fine locally, fails on staging. You want to replay it against a sandbox built from the actual prod schema (not your stale local copy) and have the agent figure out what's wrong. + +**The setup.** + +```bash +cd your-project +sow sandbox # samples from prod, gives you a fresh sandbox +``` + +The sandbox now has the *current* prod schema, not the schema you had locally last week. + +**The prompt.** Open Claude Code: + +> Our migration `2026_04_06_add_team_workspaces.sql` is failing in CI but I can't reproduce it locally. Run it against the sandbox using `sow branch exec` and tell me the exact error. Then fix the migration so it works. + +**What happens.** + +1. Claude runs `sow branch exec sandbox --file db/migrations/2026_04_06_add_team_workspaces.sql`. +2. Postgres returns the actual error (e.g. `ERROR: column "user_id" referenced in foreign key constraint does not exist`). +3. Claude reads the migration, sees the bug (maybe a typo, maybe a missing prerequisite column). +4. Claude proposes a fix and runs it: `sow branch reset sandbox && sow branch exec sandbox --file db/migrations/2026_04_06_add_team_workspaces.sql`. +5. Iterates until the migration runs cleanly. + +**Verify what changed:** + +```bash +sow branch diff sandbox +``` + +Shows you exactly which tables, columns, indexes, and rows the migration touched. You see the same diff Claude saw. + +--- + +## Pattern: the agent reset loop + +Every workflow above follows the same loop: + +``` +┌──────────────────────────────────────────┐ +│ 1. Agent does something destructive │ +│ sow branch exec sandbox ... │ +│ │ +│ 2. Agent verifies the result │ +│ sow branch diff sandbox │ +│ sow branch sample sandbox │ +│ │ +│ 3. Wrong? Reset and try again │ +│ sow branch reset sandbox (~200ms) │ +│ │ +│ 4. Right? Move on, prod still untouched │ +└──────────────────────────────────────────┘ +``` + +The reset is the magic. Without it, "let me try a different approach" means "let me clobber my stale local copy and hope I remember to refresh it." With it, every attempt starts from a clean, sanitized, prod-shaped database. + +## MCP tools your agent can call directly + +If your agent supports MCP (Claude Code, Cursor, Windsurf, Codex), `sow mcp --agent ` configures it to call sow's tools directly without any shell-out. The 22 tools cover the full loop: + +- `sow_sandbox` — the flagship zero-config flow +- `sow_detect`, `sow_connect`, `sow_connector_list/refresh/delete` +- `sow_branch_create/list/info/delete/reset/diff/exec/sample/tables/users/env` +- `sow_branch_save/load` (named checkpoints — like git commits for your sandbox) + +Every tool returns structured JSON. Every tool is idempotent where it can be. Every tool is documented so the agent picks the right one without prompting. + +## Tips + +**Keep one long-running sandbox per project.** Don't `sow branch delete sandbox` between sessions — the reset is fast, the recreate is fast, but reusing keeps the connector and Docker container warm. + +**Use checkpoints for "known good states."** Mid-debug, run `sow branch save sandbox before-fix`. After a few attempts, `sow branch load sandbox before-fix` brings you back. Like `git stash` for databases. + +**Use `sow doctor sandbox` if something feels off.** It surfaces sanitization warnings, integrity warnings, and snapshot stats so you can tell whether the sandbox shape matches prod. + +**Don't `sow connect` with a wide-permission user.** Even though sow is read-only, the principle of least privilege applies. Create a read-only Postgres user just for sow. + diff --git a/docs/sandbox.md b/docs/sandbox.md new file mode 100644 index 0000000..8af1119 --- /dev/null +++ b/docs/sandbox.md @@ -0,0 +1,74 @@ +# `sow sandbox` — the flagship command + +`sow sandbox` is the one-command zero-config flow. Run it inside any project that has a Postgres database, and you get a local sanitized sandbox with `DATABASE_URL` already wired up. + +```bash +cd your-project +sow sandbox +``` + +That's it. Your coding agent (Claude Code, Cursor, Codex, anything that reads `DATABASE_URL` from the environment or `.env.local`) now talks to a local Postgres copy with PII scrubbed. Prod is untouched. + +## What it does, in order + +1. **Detects your source database.** Scans `.env`, `.env.local`, Prisma `schema.prisma`, Drizzle config, Knex config, TypeORM config, Sequelize config, `docker-compose.yml`, and `package.json` for a `DATABASE_URL` or equivalent. Identifies Supabase, Neon, Vercel Postgres, and Railway projects via the env vars they use. +2. **Reuses an existing connector if one is set up,** or runs `sow connect` against the detected URL. The connect step samples representative rows (default 200 per table, with edge cases), scrubs every PII column with deterministic Faker output, and saves a snapshot to `~/.sow/snapshots//init.sql`. +3. **Creates a branch** named `sandbox` (override with `--name`). On first run for this connector, this spins up a long-lived Docker Postgres container holding a frozen seed database plus your branch database. On subsequent runs, branches are cloned from the seed in under 1 second. +4. **Patches `.env.local`** with the new `DATABASE_URL` and `SOW_BRANCH=sandbox`. Other variables in the file are preserved. A backup is written to `.env.local.sow.bak` so you can revert. +5. **Prints the connection string** and a one-line confirmation: + ``` + ✓ Sandbox ready at :54320/sow_sandbox + DATABASE_URL=postgresql://sow:sow@localhost:54320/sow_sandbox + Patched .env.local (backup: .env.local.sow.bak) + ``` + +Run your dev server normally — `npm run dev`, `bun dev`, whatever you already use — and your app reads from the sandbox. + +## Flags + +| Flag | Default | Purpose | +|---|---|---| +| `[url]` (positional) | auto-detected | Override the source connection string | +| `--name ` | `sandbox` | Branch name | +| `--env-file ` | `.env.local` | Which env file to patch | +| `--no-env-file` | off | Skip the env patch — just print the URL | +| `--yes` / `-y` | off | Skip the interactive confirmation prompt | +| `--max-rows ` | 200 | Rows per table during sampling | +| `--seed ` | 42 | Reproducibility seed | +| `--full` | off | Copy all rows instead of sampling | +| `--no-sanitize` | off | Skip PII sanitization (NOT recommended) | +| `--allow-unsafe` | off | Allow Postgres types sow doesn't recognize (see [`sanitization.md`](sanitization.md)) | +| `--json` | off | JSON output for agent consumption | +| `--quiet` / `-q` | off | Minimal output | + +## Reverting + +If you want to undo the `.env.local` patch and restore the original file: + +```bash +sow env revert +``` + +This reads `.env.local.sow.bak` and writes it back to `.env.local`, then deletes the backup. + +## Re-running + +Running `sow sandbox` again when a sandbox already exists: + +- Reuses the existing connector (no re-sampling) +- Reuses the existing branch (no re-creation) +- Re-patches `.env.local` if needed (skipped if already correct) +- Exits in under a second + +If you want a fresh sandbox with new sampled data, run `sow connector refresh sandbox` first. + +## When NOT to use `sow sandbox` + +- You want to create *multiple* differently-named branches (use `sow branch create ` directly) +- You want to point at a specific non-detected source URL once and don't want it stored as a connector (use `sow connect ` then `sow branch create`) +- You're running in CI and don't want the `.env.local` patch (use `sow connect && sow branch create dev --env-file ci.env --yes`) + +## What's actually in the sandbox + +Run `sow doctor sandbox` to see snapshot stats and any sanitization warnings. Run `sow branch tables sandbox` to list tables with row counts. Run `sow branch sample sandbox
` to peek at a table's first few rows (the values are sanitized — emails are Faker emails, names are Faker names, etc., but the *shape* matches your real data). + diff --git a/docs/sanitization.md b/docs/sanitization.md new file mode 100644 index 0000000..c7eeb9a --- /dev/null +++ b/docs/sanitization.md @@ -0,0 +1,159 @@ +# Sanitization + +sow's job is to give your coding agent a database that *looks like prod* but contains *zero real PII*. This document explains exactly what sow scrubs, what it doesn't, and how to add custom rules. + +## What gets sanitized automatically + +sow runs every column through two detectors before sampling: + +1. **Type-based detection.** Some Postgres types are inherently sensitive: `inet`, `cidr`, `macaddr`, `macaddr8`. sow has built-in transformers for each. +2. **Name-based detection.** Column names are matched against patterns for these PII categories: + +| Category | Example column names | Transformer | +|---|---|---| +| Email | `email`, `email_address`, `user_email` | Faker `internet.email()` | +| Phone | `phone`, `phone_number`, `mobile`, `cell` | Faker `phone.number()` | +| Name | `first_name`, `last_name`, `full_name`, `name` | Faker `person.firstName/lastName` | +| Address | `address`, `street`, `street_address` | Faker `location.streetAddress()` | +| SSN | `ssn`, `social_security_number` | Faker formatted SSN | +| Credit card | `credit_card`, `card_number`, `cc_number` | Faker `finance.creditCardNumber()` | +| IP address | `ip`, `ip_address` | Faker IPv4 or IPv6 | +| MAC address | `mac`, `mac_address` | Faker `internet.mac()` | +| URL | `url`, `website` | Faker `internet.url()` | +| UUID | `id`, `*_id` (when type is `uuid`) | Faker `string.uuid()` | +| Date of birth | `dob`, `date_of_birth`, `birthday` | Faker `date.birthdate()` ±30 days | +| Password hash | `password`, `password_hash`, `encrypted_password` | bcrypt hash of `password123` | +| Free text | `bio`, `description`, `notes` (when no other rule applies) | Faker `lorem.paragraph()` | + +Every transformer is **deterministic**: the same input value always produces the same fake output. This means foreign keys stay consistent across tables — if `users.email = "alice@corp.com"` is referenced by `audit_log.actor_email`, both get the same Faker replacement. + +## JSONB columns + +JSONB is the most common PII leak vector in modern Postgres schemas. A `profiles.metadata::jsonb` column might contain: + +```json +{ + "email": "alice@corp.com", + "phone": "+1-555-0100", + "preferences": { "theme": "dark", "newsletter": true }, + "contact": { "billing_email": "alice@corp.com" } +} +``` + +sow walks the JSONB structure recursively and replaces values whose **key** matches a PII pattern. The example above becomes: + +```json +{ + "email": "", + "phone": "", + "preferences": { "theme": "dark", "newsletter": true }, + "contact": { "billing_email": "" } +} +``` + +Scalar JSONB (a bare string, number, or null) is passed through. Arrays of objects are walked element-by-element. Invalid JSON passes through unchanged with a warning. + +## The fail-closed gate + +sow refuses to sanitize a column whose Postgres type it doesn't have an explicit handler for. If your schema has: + +- A `tsvector` column (full-text search) +- A custom enum type +- An `hstore` column +- A `pg_lsn` or other system type + +...sow will abort `sow connect` with a clear error: + +``` +Sanitization aborted — 2 columns have types sow cannot verify: + - audit.tags (tsvector) — no tsvector handler configured + - users.role (user_role) — custom enum type + +These columns would be copied to the sandbox AS-IS, potentially leaking +PII that exists in them. Pass --allow-unsafe to skip sanitization of +these columns (they will be NULLed out in the branch). + +To add explicit handling, edit .sow.yml: + sanitize: + rules: + - table: audit + column: tags + type: free_text +``` + +This is the **fail-closed default** — anxiety reduction is the whole pitch, and silently passing unknown types through would break it. + +## The `--allow-unsafe` escape hatch + +When you know what you're doing and want to proceed anyway: + +```bash +sow connect --allow-unsafe postgres://... +sow sandbox --allow-unsafe +``` + +With this flag, columns of unknown types are **NULLed out** in the sandbox (not passed through!). The user is saying "I know there may be gaps; strip those columns to NULL rather than leaking them." A warning summary is printed and surfaces in `sow doctor `. + +## Custom rules + +You can override or extend the built-in detection with a `.sow.yml` file in your project root: + +```yaml +sanitize: + enabled: true + rules: + # Sanitize a column the auto-detector missed + - table: audit + column: actor_email + type: email + + # Use a custom transformer for a custom enum + - table: users + column: role + type: passthrough # don't touch — this is fine to copy as-is + + # Treat a tsvector column as free-text + - table: posts + column: search_index + type: free_text + + # Skip these columns entirely (they will appear in the sandbox unchanged) + skip_columns: + - users.created_at + - users.id +``` + +## Inspecting what was sanitized + +After `sow connect`, sow records every PII column it detected and every rule it applied in `~/.sow/snapshots//metadata.json`. To see them: + +```bash +sow doctor +``` + +Output includes: +- Column count, row count, snapshot size +- PII columns detected (with the type sow assigned to each) +- Any sanitization warnings (e.g. JSONB columns that failed to parse) +- Any referential integrity warnings from the sampler (FK relationships that couldn't be fully resolved) + +## What sow does NOT do + +These are out-of-scope by design: + +- **Free-text PII detection.** sow does NOT scan a free-text field for embedded emails, phone numbers, or names. The whole field is replaced with Lorem Ipsum if it matches the `free_text` pattern. This is a known limitation — see the design doc TODO. +- **Schema-level auditing.** sow doesn't tell you "your schema is leaking PII" or grade your data classification. It scrubs what it sees. +- **Encryption.** Sanitization is replacement, not encryption. The sandbox is plaintext by design (your local agent needs to read it). +- **Cloud relay.** sow runs 100% locally. PII never leaves your laptop. There is no "send to sow Cloud for processing" path. + +## Read-only on the source + +sow's source database access is **strictly read-only in intent and effect**: + +- All SQL is parameterized via `$1, $2, ...` placeholders. No string interpolation. +- All identifiers (table and column names) are quoted via the SQL standard escape (`quoteIdent`). +- The connector code path was security-audited by both Claude and Codex adversarial review (see the v0.1.14 security fix in the changelog). +- sow never issues `INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `TRUNCATE`, or any DDL against the source database. + +If you point sow at a database with read-only credentials, it will still work. We recommend it. + diff --git a/packages/cli/package.json b/packages/cli/package.json index bcfdba0..1122011 100644 --- a/packages/cli/package.json +++ b/packages/cli/package.json @@ -1,7 +1,7 @@ { "name": "@sowdb/cli", "version": "0.1.14", - "description": "Safe test databases from production Postgres", + "description": "Stop letting Claude touch your prod database. PII-safe local Postgres sandbox for coding agents.", "type": "module", "license": "MIT", "repository": { @@ -17,7 +17,13 @@ "docker", "sow", "pii", - "sanitize" + "sanitize", + "ai-agents", + "coding-agents", + "claude-code", + "cursor", + "sandbox", + "mcp" ], "bin": { "sow": "dist/cli.js" diff --git a/packages/core/package.json b/packages/core/package.json index 376749b..027b6e9 100644 --- a/packages/core/package.json +++ b/packages/core/package.json @@ -1,7 +1,7 @@ { "name": "@sowdb/core", "version": "0.1.14", - "description": "sow core engine: analyze, sample, sanitize, and branch Postgres databases", + "description": "sow core engine — analyze, sample, sanitize, and branch Postgres databases for safe coding-agent sandboxes", "type": "module", "license": "MIT", "repository": { @@ -15,7 +15,10 @@ "database", "sanitize", "pii", - "sow" + "sow", + "ai-agents", + "coding-agents", + "sandbox" ], "main": "./dist/index.js", "types": "./dist/index.d.ts", diff --git a/packages/mcp/package.json b/packages/mcp/package.json index 49bce66..4e7c5d2 100644 --- a/packages/mcp/package.json +++ b/packages/mcp/package.json @@ -1,7 +1,7 @@ { "name": "@sowdb/mcp", "version": "0.1.14", - "description": "sow MCP server: 15 tools for AI agents to manage test database branches", + "description": "sow MCP server — 22 tools for coding agents (Claude Code, Cursor, Codex) to safely manage Postgres sandboxes", "type": "module", "license": "MIT", "repository": { @@ -14,6 +14,12 @@ "postgres", "test-data", "ai-agent", + "ai-agents", + "coding-agents", + "claude-code", + "cursor", + "codex", + "sandbox", "sow" ], "main": "./dist/index.js",