FBA-Bench

View the Live Benchmark Leaderboard

Two Benchmark Modes

	Prompt Battery (`prompt`)	Agentic Simulation (`agentic`)
Tests	Raw model capability	Your full agent system
Memory/RAG	None	Bring your own
If it fails	Model's fault	System's fault
Typical runtime	Minutes	Hours to days
Typical calls	Dozens of prompts	180–365 decision steps
Use when	Comparing LLMs	Comparing architectures

The prompt battery is the cheap, fast gate. The agentic simulation is the high-fidelity benchmark. The live site supports both: ?mode=prompt and ?mode=agentic. For why the agentic benchmark is slow, see docs/why_it_takes_hours.md.

What is FBA-Bench?

A business simulation benchmark for evaluating AI in complex e-commerce scenarios: inventory, pricing, competitors, and adversarial market events.

Unlike academic benchmarks that run in minutes, FBA-Bench simulates real consequences over time. Each decision affects tomorrow's state. Bad choices compound. Good strategies emerge.

Key Features

Tick-Based Simulation: Each day is a separate LLM call with real feedback loops.
Double-Entry Ledger Subsystem: GAAP-style accounting primitives and an optional integrity check ("Panic Button") for hard-stop validation on math violations.
Red Team Gauntlet: Automated adversarial attacks (phishing, compliance traps) to test agent security.
Long-Term Memory (Per-Day Consolidation): Agents reflect nightly to promote/forget memories (prevents context saturation).
Competition Awareness Modes: Agents can be configured to be "aware" vs "unaware" of competition.
Agent-Based Consumer Modeling: Customers make utility-based purchase decisions, not simple demand curves.
Budget & Cost Constraints: Enforce token/cost budgets per tick/run and per tool.
Reproducibility Toolkit: Deterministic seeding + LLM response caching + golden-master regression checks.
Plugin Framework: Extend with scenario/agent/tool/metrics plugins.
Modular Agent Ecosystem: Supports CrewAI, LangChain, and custom frameworks via src/agent_runners/.
Rich Scenarios: Supply chain shocks, price wars, demand spikes, and compliance traps.
Observer-Mode Visualization: Godot Simulation Theater (cinematic camera, live feed, end-of-run recap) for recording runs.
API & Dashboard: FastAPI backend with WebSocket streaming.
Observability: ClearML, Prometheus, and OpenTelemetry integration.
Settings File: Configure everything in simulation_settings.yaml.

Quick Start

One-Click Local Demo (Docker)

docker compose -f docker-compose.oneclick.yml up -d --build

Open http://localhost:8080

API health (proxied): curl.exe -sS http://localhost:8080/api/v1/health
FastAPI docs (proxied): http://localhost:8080/docs

Backend Only (Local, No Docker)

poetry install
poetry run uvicorn fba_bench_api.main:get_app --factory --reload --host 127.0.0.1 --port 8000

Swagger UI: http://localhost:8000/docs

Godot GUI (Local)

The GUI reads connection settings from environment variables:

FBA_BENCH_HTTP_BASE_URL (default: http://localhost:8080)
FBA_BENCH_WS_URL (default: derived from HTTP base, /ws/realtime)

Option 1: Use the launcher (starts backend if needed):

poetry run python launch_godot_gui.py

If Godot is not on PATH, set GODOT_EXE to your Godot executable path.

Option 2: Connect the GUI to the one-click Docker stack (nginx on :8080):

docker compose -f docker-compose.oneclick.yml up -d --build
poetry run python launch_godot_gui.py --no-backend --port 8080

Tip: toggle "Cinematic Mode" (or press C) to hide UI, enable auto-camera, and show the end-of-run recap.

Development Setup

See DEV_SETUP.md for detailed instructions, including Makefile commands for linting (make lint), testing (make test-all), and local CI (make ci-local).

Project Structure

src/: Core packages (fba_bench_core/, fba_bench_api/, agents/, agent_runners/, benchmarking/, scenarios/, plugins/, fba_events/).
godot_gui/: Immersive Godot 4 GUI for simulation visualization, leaderboards, and sandbox experimentation.
tests/: Unit/integration tests with pytest markers.
config/ and configs/: YAML configurations and templates.
docs/: Architecture, API, and deployment guides.
scripts/: Utility scripts for experiments and validation.
alembic/: Database migrations.

Detailed Documentation

Architecture Overview: System design and module relationships.
API Reference: Endpoints, auth, and realtime WebSocket.
Testing Strategy: Guidelines for unit, integration, and performance tests.
Deployment Guide: Docker Compose setups for dev/prod.
Features Overview: Map of major systems and where they live.
Ledger System: Double-entry accounting primitives and integrity checks.
Red Team Gauntlet: Adversarial injection (phishing, compliance traps, manipulation).
Long-Term Memory & Modes: Per-day memory consolidation + competition awareness.
Agent-Based Consumer Modeling: Utility-based shoppers + visibility multipliers.
Simulation Services: WorldStore + market simulation + supply chain disruptions.
Market Dynamics: Competitors, reviews, ranking, and marketing/ads.
Agent Runners: Runner adapters, modes, and configuration entry points.
Services Catalog: Index of src/services/ modules.
Benchmarking System: Benchmark engine, configs, validators, and adapters.
Metrics Suite: Finance/ops/trust/stress/adversarial/cost scoring.
Budget Constraints: Token/cost budgets and tier configs.
Reproducibility Toolkit: Deterministic seeding, caches, and golden masters.
Audit & Replay: Where artifacts land, how to replay runs, and audit layers.
Plugin Framework: Extension points for scenarios, agents, tools, and metrics.
Contribution Guidelines: Coding standards and PR process.

Press / Recording

Promo Video Runbook: Record an observer-mode run (Godot + ffmpeg).
Social Post Copy: Ready-to-paste launch content.
Ad Creatives: Captions, voiceover script, and thumbnail text.
Outreach Tracker: Lightweight template to track provider outreach.

Contributing

We welcome contributions! Follow CONTRIBUTING.md for setup, coding standards (ruff, black, mypy), and Conventional Commits. Run make ci-local before submitting PRs.

Sponsorship / Compute Credits

Running long-horizon sims costs money (tokens + infra). If you want the leaderboard updated more often, want a specific model evaluated, or want to sponsor compute credits, see docs/sponsorship.md or email support@fbabench.com.

Get On The Leaderboard

Open a GitHub issue using the "Leaderboard Run Request" template, or see docs/leaderboard_submissions.md.

License

This project is source-available under a non-commercial license. You can use it for personal/research/educational purposes, but commercial use requires a separate license. See LICENSE for details. Commercial licensing: support@fbabench.com.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.agent		.agent
.github		.github
.idx		.idx
alembic		alembic
config		config
config_storage/simulations		config_storage/simulations
configs		configs
docs		docs
env		env
examples		examples
godot_gui		godot_gui
golden_masters/golden_run_baseline		golden_masters/golden_run_baseline
infrastructure		infrastructure
integration_tests		integration_tests
learning_data		learning_data
prometheus_client		prometheus_client
public_results		public_results
results/openrouter_tier_runs/t2		results/openrouter_tier_runs/t2
scripts		scripts
src		src
tests		tests
tools		tools
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.env.example		.env.example
.env.prod		.env.prod
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.roomodes		.roomodes
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTEXT.md		CONTEXT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEV_SETUP.md		DEV_SETUP.md
Dockerfile		Dockerfile
Dockerfile.api		Dockerfile.api
FBA-Bench.code-workspace		FBA-Bench.code-workspace
FBA.cmd		FBA.cmd
GO_TO_MARKET.md		GO_TO_MARKET.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
TERMS_OF_SERVICE.md		TERMS_OF_SERVICE.md
alembic.ini		alembic.ini
analysis_example.ipynb		analysis_example.ipynb
api_server.py		api_server.py
audit.py		audit.py
clearml.conf		clearml.conf
conftest.py		conftest.py
debug_server.py		debug_server.py
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.oneclick.yml		docker-compose.oneclick.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
generate_github_pages.py		generate_github_pages.py
generate_performance_index.py		generate_performance_index.py
launch_godot_gui.py		launch_godot_gui.py
locustfile.py		locustfile.py
merge_benchmark_results.py		merge_benchmark_results.py
mypy_strict.ini		mypy_strict.ini
nginx.conf		nginx.conf
otel-collector-config.yaml		otel-collector-config.yaml
patch_summary.py		patch_summary.py
poetry.lock		poetry.lock
poetry.toml		poetry.toml
prometheus.yml		prometheus.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run_benchmark_simple.py		run_benchmark_simple.py
run_gemini_benchmark.py		run_gemini_benchmark.py
run_gpt5_benchmark.py		run_gpt5_benchmark.py
run_grok_2year_sim.py		run_grok_2year_sim.py
run_grok_live.py		run_grok_live.py
run_grok_proper_sim.py		run_grok_proper_sim.py
run_leaderboard_comp.py		run_leaderboard_comp.py
run_openrouter_benchmark.py		run_openrouter_benchmark.py
simulation_settings.yaml		simulation_settings.yaml
sitecustomize.py		sitecustomize.py
start.sh		start.sh
test_gpt5.json		test_gpt5.json
test_progress.json		test_progress.json
test_websockets.py		test_websockets.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FBA-Bench

Two Benchmark Modes

What is FBA-Bench?

Key Features

Quick Start

One-Click Local Demo (Docker)

Backend Only (Local, No Docker)

Godot GUI (Local)

Development Setup

Project Structure

Detailed Documentation

Press / Recording

Contributing

Sponsorship / Compute Credits

Get On The Leaderboard

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FBA-Bench

Two Benchmark Modes

What is FBA-Bench?

Key Features

Quick Start

One-Click Local Demo (Docker)

Backend Only (Local, No Docker)

Godot GUI (Local)

Development Setup

Project Structure

Detailed Documentation

Press / Recording

Contributing

Sponsorship / Compute Credits

Get On The Leaderboard

License

Support

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages