MCP-native experiment knowledge system for coding agents.
This is not a generic agent memory product and not another eval dashboard. The project studies how to turn coding-agent work into reusable experimental knowledge:
hypothesis
-> test design
-> agent run
-> observed failures
-> metric movement
-> interpretation
-> intervention
-> next experiment
-> accumulated policy
The main agent-facing artifact is a work brief: a compact, source-backed packet of known risks, dependency-specific issue knowledge, prior failures, and approved interventions that an agent should read before editing code.
The main research artifact is a matrix report: repeated agent runs that separate final task success, protocol compliance, clean-pass rate, red-green churn, forbidden edits, and policy candidates.
Public homepage:
https://netsky-lab.github.io/agent-experiment-os/
Research prototype. The repo already contains:
- MCP server for agent pre-work protocols;
- Postgres + pgvector-backed knowledge retrieval;
- source-backed wiki pages with
dependsOnedges; - agent-facing
agent_work_context.v1; - agent-facing
agent_presentation_contract.v1withmust_load,dependsOn, decision rules, known failures, and evidence boundaries; - Codex matrix runners for Drizzle version traps and Python API drift;
- run/event metrics, churn drill-downs, reports, and review workflows;
- matrix comparison read models for protocol compliance vs execution quality;
- strict adapter completion gates that require pre-work, dependency loading, verification, and final-answer recording before a gated run can complete;
- issue-ingestion review boundaries that keep GitHub claims as evidence-only until local verification and human review.
- Next.js frontend scaffold for the product dashboard.
- Thesis
- Architecture
- Agent adapter layer
- Agent work context
- Codex quickstart
- Codex MCP contract
- Backend API contract
- Local stack
- Product dashboard
- Experiment methodology
- Public matrix report
- Issue evidence model
- Issue evidence security
- Security notes
- Knowledge wiki model
- MCP dependency flow
- Research agenda
- Roadmap
- v0.1.0 research preview notes
- v0.1.1 research preview notes
Start the full local stack:
make upThis starts Postgres, runs migrations, seeds wiki/demo knowledge, starts the FastAPI backend, and starts the Next.js dashboard.
Open:
http://127.0.0.1:3000
On hosts where Docker publishes ports but HTTP requests hang, keep the stack running and start the fallback proxy:
NEXT_PUBLIC_EXPERIMENT_OS_API_URL=http://127.0.0.1:8091 docker compose up -d --force-recreate frontend
make dev-proxyThen open:
http://127.0.0.1:3019
Run tests:
make testReset the seeded demo dataset:
make seed-resetBuild the production-style stack:
make up-prodRun only the API manually:
make apiOpen the legacy static product UI after the API starts:
http://127.0.0.1:8080/app/
Run the Next.js dashboard directly from the frontend package:
cd frontend
NEXT_PUBLIC_EXPERIMENT_OS_API_URL=http://127.0.0.1:8080 npm run devIf the API URL is omitted, the dashboard uses a built-in research-preview dataset so the UI can still be reviewed without a running backend.
Run the MCP server:
make mcpRun the next matrix:
make api-drift-matrixCheck the database connection from the Docker network:
docker compose run --rm app uv run experiment-os db checkRun a deterministic experiment fixture:
docker compose run --rm app uv run experiment-os experiments run-drizzle-fixtureRun a shell-command agent condition and capture transcript artifacts:
docker compose run --rm app uv run experiment-os experiments run-shell \
--condition-id condition.001-drizzle-brief-assisted \
--command "echo drizzle-orm@1.0.0-beta.22 && echo 'rg migration drizzle/migrations' && echo 'npm run db:generate passed'" \
--workdir /workspaceRun a Codex CLI condition through codex exec:
docker compose run --rm app uv run experiment-os experiments run-codex \
--condition-id condition.001-drizzle-brief-assisted \
--prompt "Fix the Drizzle migration default-value issue with minimal changes." \
--workdir /workspace \
--sandbox workspace-write \
--approval-policy neverRun Codex against the disposable Drizzle toy fixture:
docker compose run --rm app uv run experiment-os experiments run-codex-toy \
--condition-id condition.001-drizzle-brief-assisted \
--sandbox workspace-write \
--approval-policy neverThe fixture is copied from fixtures/drizzle-toy-repo into ignored artifacts/workdirs/...
before execution, and the run writes transcript/report artifacts under artifacts/<run-id>/.
Run baseline vs brief-assisted Codex conditions:
docker compose run --rm app uv run experiment-os experiments run-codex-toy-comparisonRun the version-trap fixture where issue evidence conflicts with local package versions:
docker compose run --rm app uv run experiment-os experiments run-codex-version-trapRun baseline vs brief-assisted Codex conditions on the version-trap fixture:
docker compose run --rm app uv run experiment-os experiments run-codex-version-trap-comparisonRun Codex with Experiment OS mounted as an MCP server for the task:
docker compose run --rm app uv run experiment-os experiments run-codex-mcp-version-trapRun a repeated baseline/static-brief/MCP-brief matrix:
docker compose run --rm app uv run experiment-os experiments run-codex-version-trap-matrix \
--repeat-count 3 \
--sandbox danger-full-accessRun the harder version-trap matrix with a stricter oracle:
docker compose run --rm app uv run experiment-os experiments run-codex-version-trap-hard-matrix \
--repeat-count 3 \
--sandbox danger-full-accessProgress events are written as JSONL to stderr. The final JSON includes the matrix summary and, by
default, a markdown result artifact under experiments/001-drizzle-brief/results/.
Run the Python API-drift matrix:
docker compose run --rm app uv run experiment-os experiments run-codex-api-drift-matrix \
--repeat-count 3 \
--sandbox danger-full-access \
--approval-policy neverRun the nested API-drift matrix:
docker compose run --rm app uv run experiment-os experiments run-codex-api-drift-nested-matrix \
--repeat-count 3 \
--sandbox danger-full-access \
--approval-policy neverRun the non-saturated API-drift matrix where issue evidence is intentionally misleading:
docker compose run --rm app uv run experiment-os experiments run-codex-api-drift-misleading-matrix \
--repeat-count 3 \
--sandbox danger-full-access \
--approval-policy neverRegister Experiment OS as a Codex MCP server:
codex mcp add experiment-os -- docker compose -f "$(pwd)/docker-compose.yml" run --rm app uv run experiment-os mcp serveSmoke-check the MCP presentation contract:
docker compose run --rm app uv run experiment-os demo mcp-smokeRun a model matrix by repeating --model:
docker compose run --rm app uv run experiment-os experiments run-codex-version-trap-hard-matrix \
--model gpt-5.4-mini \
--model gpt-5.4Search local knowledge with full-text + pgvector retrieval:
docker compose run --rm app uv run experiment-os knowledge search "drizzle migration default"Ingest GitHub issues as source snapshots and agent-readable source pages:
docker compose run --rm app uv run experiment-os issues ingest \
--repo drizzle-team/drizzle-orm \
--query "migration default" \
--limit 3Use a local GitHub-search JSON payload for reproducible issue-ingestion tests:
docker compose run --rm app uv run experiment-os issues ingest \
--repo openai/openai-python \
--query "responses migration" \
--input-json research/issues/openai-python-responses-search.jsonRun several issue-ingestion jobs from a config:
docker compose run --rm app uv run experiment-os issues batch \
--config research/issues/issue-ingestion-batch.example.jsonRefresh issue-derived knowledge from GitHub. If GITHUB_TOKEN is present it is used for the API call:
docker compose run --rm app uv run experiment-os issues refresh \
--repo openai/openai-python \
--query "responses migration" \
--limit 5Check whether issue-derived version evidence matches the local project:
docker compose run --rm app uv run experiment-os issues version-alignment \
--page-id claim.github-issue.drizzle-team.drizzle-orm.5661.versions \
--local-version drizzle-orm=0.44.5Prune local test pages from a dev database:
docker compose run --rm app uv run experiment-os db prune-test-pagesRun the MCP server over stdio:
docker compose run --rm app uv run experiment-os mcp serveRun the MCP server over streamable HTTP:
docker compose run --rm app uv run experiment-os mcp serve --transport streamable-httpRun the dashboard/backend HTTP API:
docker compose run --rm --service-ports app uv run experiment-os api serve --host 0.0.0.0 --port 8080Useful UI/read-model endpoints:
GET /experimentsGET /experiments/{experiment_id}/matrixGET /experiments/{experiment_id}/matrix/compare?left_matrix_id=...&right_matrix_id=...GET /experiments/{experiment_id}/matrix/regression?left_matrix_id=...&right_matrix_id=...POST /experiments/{experiment_id}/statusGET /experiments/{experiment_id}/protocol-complianceGET /experiments/{experiment_id}/churn?matrix_id=...GET /runs/{run_id}GET /runs/{run_id}/completion-contractGET /runs/{run_id}/next-required-actionGET /runs/{run_id}/churnGET /briefs/{brief_id}/agent-work-contextGET /briefs/{brief_id}/presentation-previewGET /wiki/graphGET /knowledge/staleGET /knowledge/duplicatesGET /pages/{page_id}/provenancePOST /issue-knowledge/{page_id}/version-alignmentPOST /issue-knowledge/ingestGET /policy-candidatesGET /ui/contract
Write endpoints are open by default for local research. Set EXPERIMENT_OS_API_KEY to require x-api-key on mutations, and pass the same value to the dashboard as NEXT_PUBLIC_EXPERIMENT_OS_API_KEY for local protected demos.
GET /ui/bootstrap
On machines where Docker port publishing works normally, the same check can also run from the host:
uv run experiment-os db checkThe v0 backend is split by responsibility:
src/experiment_os/db/- SQLAlchemy ORM models.src/experiment_os/domain/- Pydantic input/output schemas.src/experiment_os/repositories/- database access.src/experiment_os/retrieval/- full-text + pgvector retrieval.src/experiment_os/services/- application use cases, split into contracts, matrix comparison, regression, provenance, issue ingestion, review, and dashboard read models.src/experiment_os/mcp_server/- MCP transport adapter.src/experiment_os/cli.py- developer CLI only.
The MCP tools and CLI commands share the same service layer.
fixtures/drizzle-version-trap-repo- easy version trap, now mostly solved by current Codex.fixtures/drizzle-version-trap-hard-repo- stricter Drizzle oracle with one correct schema edit.fixtures/python-api-drift-repo- second-domain scaffold for Python SDK/API drift.fixtures/python-api-drift-nested-repo- API-drift fixture where the correct adapter is behind a nested module.fixtures/python-api-drift-hard-nested-repo- harder router fixture with dependency-upgrade bait.
The first flat-vs-nested API-drift comparison saturated final pass rate, so the useful signal moved to execution quality:
- gated Codex reached full protocol compliance but produced red-green churn;
- gated OpenCode reached full protocol compliance with clean passes;
- final pass rate, protocol compliance, and clean pass rate must be tracked separately.
See:
experiments/002-python-api-drift/results/2026-04-28-matrix-api-drift-ab282831e5a2.mdexperiments/002-python-api-drift/results/2026-04-28-matrix-api-drift-nested-3bafb86522e5.mdexperiments/002-python-api-drift/results/2026-04-28-flat-vs-nested-matrix-comparison.md