vmbench is a formal VM benchmark and inspectable reasoning runtime for testing whether language models can follow machine-like execution semantics on synthetic tasks.
It gives you:
- a toy ISA and reference VM
- synthetic dataset generation for execution tasks
- bounded evaluation and stage gating
- an inspectable search runtime with verification and budget control
- SFT export and bounded training/eval helpers
This repo started as a research workbench around AI-native execution ideas.
The strongest reusable outcome is not a new programming language inside the model. It is a practical benchmark/runtime stack for asking a narrower question:
Can a model follow formal machine semantics on synthetic execution tasks?
The current public story is:
- formal VM benchmark
- verifier-guided search runtime
- MCP-first tool surface
Not:
- proof that a latent internal executor emerged in the weights
- a general-purpose reasoning system
- a productized wrapper around every research branch in the repo
Primary surface:
- MCP-first runtime and benchmark tool layer
Secondary surface:
- CLI
Optional surface:
- agent / IDE integrations
Stable v1
- MCP server
- CLI
- machine-readable outputs
- smoke-tested Docker/local flows
- release-facing docs
Experimental
- bounded training/eval helpers
- advanced compare/report workflows
- branch-specific research probes
Internal-only
- negative-result
next2_*branches - exploratory reports and ad hoc research scaffolding
- reference_vm.py
- formal VM/parser/executor
- dataset_pipeline.py
- main benchmark dataset generator
- baseline_trainer.py
- local inference/eval harness
- curriculum_gate.py
- stage gate evaluation
- sft_export.py
- SFT export from benchmark datasets
The runtime surface is meant to make reasoning visible rather than hidden. A user can:
- run a benchmark and get a compact machine-readable summary
- inspect the next-step choice with ranked candidates and verifier outcomes
- solve one record under a fixed node budget and see the path and explored-node count
- ask for a structured failure explanation
- compare
none,heuristic, andlearnedpolicies on the same record - gate a run and compare two runs stage-by-stage
The key difference from a normal black-box LLM answer is that the user can see:
- which candidates existed
- which one was verified
- how much budget was spent
- why a run failed
Generate the core benchmark dataset:
python3 vmbench_cli.py generateRun a local baseline against the generated dataset:
python3 vmbench_cli.py eval \
--model llama3.2:latest \
--host http://127.0.0.1:11434Host contract:
vmbench_cli.pyruns on the host, sohttp://127.0.0.1:11434is the normal local default.- The Dockerized MCP server automatically remaps
localhostand127.0.0.1tohttp://host.docker.internal:11434for Ollama-backed baseline runs.
Inspect the manifest, repo map, and available commands with structured JSON output:
python3 vmbench_cli.py statusEvaluate curriculum gates on a summary.json file:
python3 vmbench_cli.py gate \
--summary reports/baseline/<run>/summary.jsonCompare two summary.json files:
python3 vmbench_cli.py compare \
--base-summary /path/to/base-summary.json \
--candidate-summary /path/to/candidate-summary.jsonExport benchmark data into prompt/completion SFT format:
python3 vmbench_cli.py export-sftAll CLI commands return structured JSON: status (success/error), payload, and error details when applicable.
generate- generate benchmark datasets
eval- run local baseline evaluation
gate- score a summary against curriculum gates
compare- compare two summary files stage-by-stage
export-sft- export SFT-ready prompt/completion datasets
repo-map- print the current product map path
status- show manifest, repo map, and command overview with machine-readable JSON
The MCP server is the primary public interface. It exposes both benchmark/tooling flows and runtime-inspection flows.
Core benchmark flows:
vmbench_manifestvmbench_generate_datasetvmbench_run_baselinevmbench_run_checkpoint_evalvmbench_gate_summaryvmbench_export_sftvmbench_compare_reportsvmbench_status
Reasoning-runtime flows:
vmbench_choose_next_stepvmbench_solve_with_budgetvmbench_explain_failurevmbench_compare_policiesvmbench_demo_reasoning_runtimevmbench_generate_demo_payload
All tools return structured envelopes with status, payload, and error.
- REPO_PRODUCT_MAP.md
- product-oriented map of the repo
- PRODUCTIZATION_PLAN.md
- productization work plan
- VMBENCH_EXECUTION_PLAN.md
- active milestone execution plan with agent ownership
- datasets/mvp
- main benchmark datasets
- training
- bounded fine-tuning / eval tooling
- reports
- research and validation outputs
This repo contains a usable benchmark/runtime core.
It also contains a large amount of experimental research history, especially around the failed next_2_steps unlock branch.
The current product direction is:
formal VM benchmark + dataset factory + eval toolkit + inspectable runtime
not:
proof of internal multi-step execution inside the model
- The multi-step
next_2_stepsbranch is currently a negative result; gains depend on benchmark structure and ranking/verification design. - Results are on synthetic execution tasks and the current ISA/dataset family; performance may not generalize to other task distributions or ISAs.
- The learned ranker may partly compress heuristic behavior rather than discover a materially different general policy.
- Shortcut structure, especially candidate-source information, still matters on harder cases.
- MCP exposes runtime behavior and tool use; it does not prove the model’s internal chain-of-thought or latent execution.
- Many reports in the repo are research artifacts; not every helper in the tree belongs to the public product surface.
- keep a clean product surface
- keep MCP and CLI contracts aligned
- strengthen failure analysis, harder benchmarks, and budget curves
- archive research-heavy branches out of the default user path