vmbench

vmbench is a formal VM benchmark and inspectable reasoning runtime for testing whether language models can follow machine-like execution semantics on synthetic tasks.

It gives you:

a toy ISA and reference VM
synthetic dataset generation for execution tasks
bounded evaluation and stage gating
an inspectable search runtime with verification and budget control
SFT export and bounded training/eval helpers

What This Is

This repo started as a research workbench around AI-native execution ideas.

The strongest reusable outcome is not a new programming language inside the model. It is a practical benchmark/runtime stack for asking a narrower question:

Can a model follow formal machine semantics on synthetic execution tasks?

The current public story is:

formal VM benchmark
verifier-guided search runtime
MCP-first tool surface

Not:

proof that a latent internal executor emerged in the weights
a general-purpose reasoning system
a productized wrapper around every research branch in the repo

Product Shape

Primary surface:

MCP-first runtime and benchmark tool layer

Secondary surface:

CLI

Optional surface:

agent / IDE integrations

Public Contract

Stable v1

MCP server
CLI
machine-readable outputs
smoke-tested Docker/local flows
release-facing docs

Experimental

bounded training/eval helpers
advanced compare/report workflows
branch-specific research probes

Internal-only

negative-result next2_* branches
exploratory reports and ad hoc research scaffolding

Core Pieces

reference_vm.py
- formal VM/parser/executor
dataset_pipeline.py
- main benchmark dataset generator
baseline_trainer.py
- local inference/eval harness
curriculum_gate.py
- stage gate evaluation
sft_export.py
- SFT export from benchmark datasets

Runtime UX

The runtime surface is meant to make reasoning visible rather than hidden. A user can:

run a benchmark and get a compact machine-readable summary
inspect the next-step choice with ranked candidates and verifier outcomes
solve one record under a fixed node budget and see the path and explored-node count
ask for a structured failure explanation
compare none, heuristic, and learned policies on the same record
gate a run and compare two runs stage-by-stage

The key difference from a normal black-box LLM answer is that the user can see:

which candidates existed
which one was verified
how much budget was spent
why a run failed

Quickstart

Generate the core benchmark dataset:

python3 vmbench_cli.py generate

Run a local baseline against the generated dataset:

python3 vmbench_cli.py eval \
  --model llama3.2:latest \
  --host http://127.0.0.1:11434

Host contract:

vmbench_cli.py runs on the host, so http://127.0.0.1:11434 is the normal local default.
The Dockerized MCP server automatically remaps localhost and 127.0.0.1 to http://host.docker.internal:11434 for Ollama-backed baseline runs.

Inspect the manifest, repo map, and available commands with structured JSON output:

python3 vmbench_cli.py status

Evaluate curriculum gates on a summary.json file:

python3 vmbench_cli.py gate \
  --summary reports/baseline/<run>/summary.json

Compare two summary.json files:

python3 vmbench_cli.py compare \
  --base-summary /path/to/base-summary.json \
  --candidate-summary /path/to/candidate-summary.json

Export benchmark data into prompt/completion SFT format:

python3 vmbench_cli.py export-sft

All CLI commands return structured JSON: status (success/error), payload, and error details when applicable.

CLI Commands

generate
- generate benchmark datasets
eval
- run local baseline evaluation
gate
- score a summary against curriculum gates
compare
- compare two summary files stage-by-stage
export-sft
- export SFT-ready prompt/completion datasets
repo-map
- print the current product map path
status
- show manifest, repo map, and command overview with machine-readable JSON

MCP Runtime Surface

The MCP server is the primary public interface. It exposes both benchmark/tooling flows and runtime-inspection flows.

Core benchmark flows:

vmbench_manifest
vmbench_generate_dataset
vmbench_run_baseline
vmbench_run_checkpoint_eval
vmbench_gate_summary
vmbench_export_sft
vmbench_compare_reports
vmbench_status

Reasoning-runtime flows:

vmbench_choose_next_step
vmbench_solve_with_budget
vmbench_explain_failure
vmbench_compare_policies
vmbench_demo_reasoning_runtime
vmbench_generate_demo_payload

All tools return structured envelopes with status, payload, and error.

Repository Layout

REPO_PRODUCT_MAP.md
- product-oriented map of the repo
PRODUCTIZATION_PLAN.md
- productization work plan
VMBENCH_EXECUTION_PLAN.md
- active milestone execution plan with agent ownership
datasets/mvp
- main benchmark datasets
training
- bounded fine-tuning / eval tooling
reports
- research and validation outputs

Current Status

This repo contains a usable benchmark/runtime core.

It also contains a large amount of experimental research history, especially around the failed next_2_steps unlock branch.

The current product direction is:

formal VM benchmark + dataset factory + eval toolkit + inspectable runtime

not:

proof of internal multi-step execution inside the model

Limitations

The multi-step next_2_steps branch is currently a negative result; gains depend on benchmark structure and ranking/verification design.
Results are on synthetic execution tasks and the current ISA/dataset family; performance may not generalize to other task distributions or ISAs.
The learned ranker may partly compress heuristic behavior rather than discover a materially different general policy.
Shortcut structure, especially candidate-source information, still matters on harder cases.
MCP exposes runtime behavior and tool use; it does not prove the model’s internal chain-of-thought or latent execution.
Many reports in the repo are research artifacts; not every helper in the tree belongs to the public product surface.

Next Product Steps

keep a clean product surface
keep MCP and CLI contracts aligned
strengthen failure analysis, harder benchmarks, and budget curves
archive research-heavy branches out of the default user path

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
datasets/mvp		datasets/mvp
docs		docs
fasm-handbook @ 8b2bb77		fasm-handbook @ 8b2bb77
presentation		presentation
tools		tools
training		training
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.whitepaper.md.swp		.whitepaper.md.swp
ISA_SPEC.md		ISA_SPEC.md
LICENSE		LICENSE
PAPERS_TO_EXTRACT_FOR_VMBENCH.md		PAPERS_TO_EXTRACT_FOR_VMBENCH.md
PRODUCTIZATION_PLAN.md		PRODUCTIZATION_PLAN.md
QA.md		QA.md
README.md		README.md
REPO_PRODUCT_MAP.md		REPO_PRODUCT_MAP.md
VMBENCH_EXECUTION_PLAN.md		VMBENCH_EXECUTION_PLAN.md
VMBENCH_SEARCH_ASSIST_PLAN.md		VMBENCH_SEARCH_ASSIST_PLAN.md
baseline_trainer.py		baseline_trainer.py
branch_ranker.py		branch_ranker.py
candidate_generator.py		candidate_generator.py
curriculum_gate.py		curriculum_gate.py
dataset_pipeline.py		dataset_pipeline.py
latent_probe_dataset.py		latent_probe_dataset.py
learned_branch_ranker.py		learned_branch_ranker.py
mcp_experiment_schema.json		mcp_experiment_schema.json
ordered_conditionals_dataset.py		ordered_conditionals_dataset.py
reference_vm.py		reference_vm.py
repair_dataset.py		repair_dataset.py
search_runner.py		search_runner.py
search_trace_export.py		search_trace_export.py
sft_export.py		sft_export.py
synthesis_dataset.py		synthesis_dataset.py
test_baseline_trainer.py		test_baseline_trainer.py
test_branch_ranker.py		test_branch_ranker.py
test_candidate_generator.py		test_candidate_generator.py
test_collect_next2_diagnostics.py		test_collect_next2_diagnostics.py
test_curriculum_gate.py		test_curriculum_gate.py
test_dataset_pipeline.py		test_dataset_pipeline.py
test_first_token_mode_probe.py		test_first_token_mode_probe.py
test_latent_probe_dataset.py		test_latent_probe_dataset.py
test_learned_branch_ranker.py		test_learned_branch_ranker.py
test_next2_chained_export.py		test_next2_chained_export.py
test_next2_chained_probe.py		test_next2_chained_probe.py
test_next2_delta_export.py		test_next2_delta_export.py
test_next2_effect_export.py		test_next2_effect_export.py
test_next2_effect_target_anchor_export.py		test_next2_effect_target_anchor_export.py
test_next2_factorized_export.py		test_next2_factorized_export.py
test_next2_slots_export.py		test_next2_slots_export.py
test_paths.py		test_paths.py
test_phase2_freeze.py		test_phase2_freeze.py
test_prefix_forced_probe.py		test_prefix_forced_probe.py
test_quick_stage_probe.py		test_quick_stage_probe.py
test_reference_vm.py		test_reference_vm.py
test_repair_dataset.py		test_repair_dataset.py
test_search_benchmark_tools.py		test_search_benchmark_tools.py
test_search_ranker_pipeline.py		test_search_ranker_pipeline.py
test_search_trace_export.py		test_search_trace_export.py
test_sft_export.py		test_sft_export.py
test_synthesis_dataset.py		test_synthesis_dataset.py
test_training_env.py		test_training_env.py
test_training_scaffold.py		test_training_scaffold.py
test_verification_modes.py		test_verification_modes.py
test_vm_transition_verifier.py		test_vm_transition_verifier.py
text.md		text.md
vm_transition_verifier.py		vm_transition_verifier.py
vmbench_cli.py		vmbench_cli.py
vmbench_demo_runtime.py		vmbench_demo_runtime.py
vmbench_product_surface.py		vmbench_product_surface.py
whitepaper.md		whitepaper.md
whitepaper.pdf		whitepaper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vmbench

What This Is

Product Shape

Public Contract

Core Pieces

Runtime UX

Quickstart

CLI Commands

MCP Runtime Surface

Repository Layout

Current Status

Limitations

Next Product Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vmbench

What This Is

Product Shape

Public Contract

Core Pieces

Runtime UX

Quickstart

CLI Commands

MCP Runtime Surface

Repository Layout

Current Status

Limitations

Next Product Steps

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages