Agent Experiment OS

MCP-native experiment knowledge system for coding agents.

This is not a generic agent memory product and not another eval dashboard. The project studies how to turn coding-agent work into reusable experimental knowledge:

hypothesis
-> test design
-> agent run
-> observed failures
-> metric movement
-> interpretation
-> intervention
-> next experiment
-> accumulated policy

The main agent-facing artifact is a work brief: a compact, source-backed packet of known risks, dependency-specific issue knowledge, prior failures, and approved interventions that an agent should read before editing code.

The main research artifact is a matrix report: repeated agent runs that separate final task success, protocol compliance, clean-pass rate, red-green churn, forbidden edits, and policy candidates.

Public homepage:

https://netsky-lab.github.io/agent-experiment-os/

Current Status

Research prototype. The repo already contains:

MCP server for agent pre-work protocols;
Postgres + pgvector-backed knowledge retrieval;
source-backed wiki pages with dependsOn edges;
agent-facing agent_work_context.v1;
agent-facing agent_presentation_contract.v1 with must_load, dependsOn, decision rules, known failures, and evidence boundaries;
Codex matrix runners for Drizzle version traps and Python API drift;
run/event metrics, churn drill-downs, reports, and review workflows;
matrix comparison read models for protocol compliance vs execution quality;
strict adapter completion gates that require pre-work, dependency loading, verification, and final-answer recording before a gated run can complete;
issue-ingestion review boundaries that keep GitHub claims as evidence-only until local verification and human review.
Next.js frontend scaffold for the product dashboard.

Research Corpus

Design Docs

Quickstart

Start the full local stack:

make up

This starts Postgres, runs migrations, seeds wiki/demo knowledge, starts the FastAPI backend, and starts the Next.js dashboard.

Open:

http://127.0.0.1:3000

On hosts where Docker publishes ports but HTTP requests hang, keep the stack running and start the fallback proxy:

NEXT_PUBLIC_EXPERIMENT_OS_API_URL=http://127.0.0.1:8091 docker compose up -d --force-recreate frontend
make dev-proxy

Then open:

http://127.0.0.1:3019

Run tests:

make test

Reset the seeded demo dataset:

make seed-reset

Build the production-style stack:

make up-prod

Run only the API manually:

make api

Open the legacy static product UI after the API starts:

http://127.0.0.1:8080/app/

Run the Next.js dashboard directly from the frontend package:

cd frontend
NEXT_PUBLIC_EXPERIMENT_OS_API_URL=http://127.0.0.1:8080 npm run dev

If the API URL is omitted, the dashboard uses a built-in research-preview dataset so the UI can still be reviewed without a running backend.

Run the MCP server:

make mcp

Run the next matrix:

make api-drift-matrix

Local Development

Check the database connection from the Docker network:

docker compose run --rm app uv run experiment-os db check

Run a deterministic experiment fixture:

docker compose run --rm app uv run experiment-os experiments run-drizzle-fixture

Run a shell-command agent condition and capture transcript artifacts:

docker compose run --rm app uv run experiment-os experiments run-shell \
  --condition-id condition.001-drizzle-brief-assisted \
  --command "echo drizzle-orm@1.0.0-beta.22 && echo 'rg migration drizzle/migrations' && echo 'npm run db:generate passed'" \
  --workdir /workspace

Run a Codex CLI condition through codex exec:

docker compose run --rm app uv run experiment-os experiments run-codex \
  --condition-id condition.001-drizzle-brief-assisted \
  --prompt "Fix the Drizzle migration default-value issue with minimal changes." \
  --workdir /workspace \
  --sandbox workspace-write \
  --approval-policy never

Run Codex against the disposable Drizzle toy fixture:

docker compose run --rm app uv run experiment-os experiments run-codex-toy \
  --condition-id condition.001-drizzle-brief-assisted \
  --sandbox workspace-write \
  --approval-policy never

The fixture is copied from fixtures/drizzle-toy-repo into ignored artifacts/workdirs/... before execution, and the run writes transcript/report artifacts under artifacts/<run-id>/.

Run baseline vs brief-assisted Codex conditions:

docker compose run --rm app uv run experiment-os experiments run-codex-toy-comparison

Run the version-trap fixture where issue evidence conflicts with local package versions:

docker compose run --rm app uv run experiment-os experiments run-codex-version-trap

Run baseline vs brief-assisted Codex conditions on the version-trap fixture:

docker compose run --rm app uv run experiment-os experiments run-codex-version-trap-comparison

Run Codex with Experiment OS mounted as an MCP server for the task:

docker compose run --rm app uv run experiment-os experiments run-codex-mcp-version-trap

Run a repeated baseline/static-brief/MCP-brief matrix:

docker compose run --rm app uv run experiment-os experiments run-codex-version-trap-matrix \
  --repeat-count 3 \
  --sandbox danger-full-access

Run the harder version-trap matrix with a stricter oracle:

docker compose run --rm app uv run experiment-os experiments run-codex-version-trap-hard-matrix \
  --repeat-count 3 \
  --sandbox danger-full-access

Progress events are written as JSONL to stderr. The final JSON includes the matrix summary and, by default, a markdown result artifact under experiments/001-drizzle-brief/results/.

Run the Python API-drift matrix:

docker compose run --rm app uv run experiment-os experiments run-codex-api-drift-matrix \
  --repeat-count 3 \
  --sandbox danger-full-access \
  --approval-policy never

Run the nested API-drift matrix:

docker compose run --rm app uv run experiment-os experiments run-codex-api-drift-nested-matrix \
  --repeat-count 3 \
  --sandbox danger-full-access \
  --approval-policy never

Run the non-saturated API-drift matrix where issue evidence is intentionally misleading:

docker compose run --rm app uv run experiment-os experiments run-codex-api-drift-misleading-matrix \
  --repeat-count 3 \
  --sandbox danger-full-access \
  --approval-policy never

Register Experiment OS as a Codex MCP server:

codex mcp add experiment-os -- docker compose -f "$(pwd)/docker-compose.yml" run --rm app uv run experiment-os mcp serve

Smoke-check the MCP presentation contract:

docker compose run --rm app uv run experiment-os demo mcp-smoke

Run a model matrix by repeating --model:

docker compose run --rm app uv run experiment-os experiments run-codex-version-trap-hard-matrix \
  --model gpt-5.4-mini \
  --model gpt-5.4

Search local knowledge with full-text + pgvector retrieval:

docker compose run --rm app uv run experiment-os knowledge search "drizzle migration default"

Ingest GitHub issues as source snapshots and agent-readable source pages:

docker compose run --rm app uv run experiment-os issues ingest \
  --repo drizzle-team/drizzle-orm \
  --query "migration default" \
  --limit 3

Use a local GitHub-search JSON payload for reproducible issue-ingestion tests:

docker compose run --rm app uv run experiment-os issues ingest \
  --repo openai/openai-python \
  --query "responses migration" \
  --input-json research/issues/openai-python-responses-search.json

Run several issue-ingestion jobs from a config:

docker compose run --rm app uv run experiment-os issues batch \
  --config research/issues/issue-ingestion-batch.example.json

Refresh issue-derived knowledge from GitHub. If GITHUB_TOKEN is present it is used for the API call:

docker compose run --rm app uv run experiment-os issues refresh \
  --repo openai/openai-python \
  --query "responses migration" \
  --limit 5

Check whether issue-derived version evidence matches the local project:

docker compose run --rm app uv run experiment-os issues version-alignment \
  --page-id claim.github-issue.drizzle-team.drizzle-orm.5661.versions \
  --local-version drizzle-orm=0.44.5

Prune local test pages from a dev database:

docker compose run --rm app uv run experiment-os db prune-test-pages

Run the MCP server over stdio:

docker compose run --rm app uv run experiment-os mcp serve

Run the MCP server over streamable HTTP:

docker compose run --rm app uv run experiment-os mcp serve --transport streamable-http

Run the dashboard/backend HTTP API:

docker compose run --rm --service-ports app uv run experiment-os api serve --host 0.0.0.0 --port 8080

Useful UI/read-model endpoints:

GET /experiments
GET /experiments/{experiment_id}/matrix
GET /experiments/{experiment_id}/matrix/compare?left_matrix_id=...&right_matrix_id=...
GET /experiments/{experiment_id}/matrix/regression?left_matrix_id=...&right_matrix_id=...
POST /experiments/{experiment_id}/status
GET /experiments/{experiment_id}/protocol-compliance
GET /experiments/{experiment_id}/churn?matrix_id=...
GET /runs/{run_id}
GET /runs/{run_id}/completion-contract
GET /runs/{run_id}/next-required-action
GET /runs/{run_id}/churn
GET /briefs/{brief_id}/agent-work-context
GET /briefs/{brief_id}/presentation-preview
GET /wiki/graph
GET /knowledge/stale
GET /knowledge/duplicates
GET /pages/{page_id}/provenance
POST /issue-knowledge/{page_id}/version-alignment
POST /issue-knowledge/ingest
GET /policy-candidates
GET /ui/contract

Write endpoints are open by default for local research. Set EXPERIMENT_OS_API_KEY to require x-api-key on mutations, and pass the same value to the dashboard as NEXT_PUBLIC_EXPERIMENT_OS_API_KEY for local protected demos.

GET /ui/bootstrap

On machines where Docker port publishing works normally, the same check can also run from the host:

uv run experiment-os db check

Code Layout

The v0 backend is split by responsibility:

src/experiment_os/db/ - SQLAlchemy ORM models.
src/experiment_os/domain/ - Pydantic input/output schemas.
src/experiment_os/repositories/ - database access.
src/experiment_os/retrieval/ - full-text + pgvector retrieval.
src/experiment_os/services/ - application use cases, split into contracts, matrix comparison, regression, provenance, issue ingestion, review, and dashboard read models.
src/experiment_os/mcp_server/ - MCP transport adapter.
src/experiment_os/cli.py - developer CLI only.

The MCP tools and CLI commands share the same service layer.

Experimental Fixtures

fixtures/drizzle-version-trap-repo - easy version trap, now mostly solved by current Codex.
fixtures/drizzle-version-trap-hard-repo - stricter Drizzle oracle with one correct schema edit.
fixtures/python-api-drift-repo - second-domain scaffold for Python SDK/API drift.
fixtures/python-api-drift-nested-repo - API-drift fixture where the correct adapter is behind a nested module.
fixtures/python-api-drift-hard-nested-repo - harder router fixture with dependency-upgrade bait.

Current R&D Signal

The first flat-vs-nested API-drift comparison saturated final pass rate, so the useful signal moved to execution quality:

gated Codex reached full protocol compliance but produced red-green churn;
gated OpenCode reached full protocol compliance with clean passes;
final pass rate, protocol compliance, and clean pass rate must be tracked separately.

See:

experiments/002-python-api-drift/results/2026-04-28-matrix-api-drift-ab282831e5a2.md
experiments/002-python-api-drift/results/2026-04-28-matrix-api-drift-nested-3bafb86522e5.md
experiments/002-python-api-drift/results/2026-04-28-flat-vs-nested-matrix-comparison.md

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
docker/postgres/init		docker/postgres/init
docs		docs
experiments		experiments
fixtures		fixtures
frontend		frontend
migrations		migrations
research		research
src/experiment_os		src/experiment_os
tests		tests
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.frontend		Dockerfile.frontend
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Experiment OS

Current Status

Research Corpus

Design Docs

Quickstart

Local Development

Code Layout

Experimental Fixtures

Current R&D Signal

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Experiment OS

Current Status

Research Corpus

Design Docs

Quickstart

Local Development

Code Layout

Experimental Fixtures

Current R&D Signal

About

Topics

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages