Enterprise organizations spend $5M-$50M+ annually on cloud compute, yet procurement decisions remain dominated by static reservation policies and manual spot-bidding rules. Industry analyses consistently show that 10-40% of cloud spend is wasted through over-provisioning, poorly timed purchases, and failure to exploit spot-market volatility. Closing this gap requires procurement agents capable of strategic reasoning: anticipating competitor behavior, forming purchasing coalitions, and adapting to market disruptions in real time. Nexus provides the training ground for developing these capabilities.
Training LLM agents for complex multi-party interactions requires environments that are simultaneously rich enough to produce emergent strategy and structured enough to yield measurable reward signals. Compute cluster resource allocation, where teams compete for GPU time, memory, and bandwidth under budget constraints and shifting demand provides a natural testbed. Agents must reason about hidden information (opponents' private job queues), form and dissolve coalitions, negotiate prices in a double-auction market, and manage deadlines, all while being monitored by a supervisory agent tasked with detecting collusion and market manipulation.
Nexus is designed around three complementary tracks that map to open problems in LLM agent research:
- Multi-Agent Negotiation. Agents interact through structured actions (bids, offers, coalition proposals, free-text messages) and must develop theory-of-mind to predict opponents' behavior from partial observations.
- Fleet AI / Scalable Oversight. A privileged supervisor agent observes all actions and must detect anomalous behavior (collusion, hoarding, free-riding) using interpretable machine learning probes, forming a closed feedback loop with the agents it monitors.
- Multi-Actor Management (Halluminate). A CTO agent issues high-level directives to semi-autonomous worker agents that may misinterpret or partially execute instructions, requiring the CTO to learn each worker's reliability profile.
The environment's core technical contributions include:
- Monte Carlo Tree Search (MCTS). Negotiation planning under hidden information, with UCB1 balancing exploration of new tactics against exploitation of proven strategies.
- Mixture-of-Experts (MoE) Coalition Voting. Coalition members vote on resource splits using confidence-weighted per-option voting, with an elimination pre-pass acting as a veto round.
- CART Ensemble Probes. Real-time behavioral anomaly detection where the supervisor probes agent behavioral streams (trade patterns, messages) for collusion signals via interpretable feature importance.
- Market Dynamics. Commission, price impact, and portfolio accounting create realistic market physics, where large trades move prices, commissions discourage churn, and Sharpe ratio provides risk-adjusted evaluation.
- Closed-Loop Oversight Mechanism. Supervisor detects anomaly, issues structured feedback, agent's BehavioralPolicy adjusts strategy weights, behavior changes, and the supervisor observes the new behavior.
Trained agents exhibit theory-of-mind reasoning (anticipating competitor bids, forming strategic coalitions, and detecting deceptive signaling) behaviors that emerge from Group Relative Policy Optimization (GRPO) training rather than hand-coded rules. Nexus integrates with the OpenEnv 0.2.1 framework via a ProxyAgent HTTP bridge and supports GRPO fine-tuning of Qwen2.5-1.5B with Unsloth 4-bit quantization and LoRA.
Nexus is a turn-based simulation of a shared compute cluster where multiple LLM agents each manage a team's workload. Agents must negotiate, cooperate, compete, and form coalitions to allocate scarce resources (GPU, CPU, memory, bandwidth) across competing job queues, all under partial observability and monitored by an oversight agent.
This scenario exemplifies compute-allocation negotiations with scalable oversight and multi-actor management.
- Theory-of-mind: Agents must model opponents' hidden job queues and budgets to negotiate effectively.
- Emergent strategy: Coalition formation, bluffing, and reputation dynamics emerge naturally.
- Scalable oversight: The supervisor agent must detect collusion, resource hoarding, and inefficiency across N agents, a direct Fleet AI challenge.
- Multi-actor management: In Halluminate mode, a single "CTO agent" orchestrates multiple worker agents toward organizational goals.
ClusterState:
resources:
gpu_units: 100 # refreshed each round
cpu_units: 200
memory_gb: 512
bandwidth_gbps: 50
round: int # current tick (1..max_rounds)
max_rounds: 50
market: # public auction board
active_bids: []
completed_trades: []
global_events: [] # random disruptions (outages, surges)
Each agent sees:
- Own state (full): job queue, budget, reputation score, resource holdings
- Others (partial): only their public bids, reputation scores, and past trade history
- Cluster (partial): total remaining resources, but NOT individual allocations of others
AgentState:
id: str
team_name: str
budget: float # virtual currency
reputation: float # 0-100, affects negotiation leverage
resource_holdings:
gpu, cpu, memory, bandwidth # currently held
job_queue: List[Job] # private, not visible to others
completed_jobs: List[Job]
score: float # cumulative reward
Job:
id: str
description: str # natural language task description
resource_requirements:
gpu, cpu, memory, bandwidth # minimum needed
deadline: int # must complete by this round
reward: float # points earned on completion
priority: "low" | "medium" | "high" | "critical"
collaborative: bool # if True, can be split across agents
penalty_on_miss: float # score deducted if deadline missed
Each round, agents submit a structured action (one of):
| Action | Description |
|---|---|
allocate(job_id) |
Assign held resources to run a job |
bid(resource_type, quantity, price) |
Place a public bid to buy resources |
offer(resource_type, quantity, price) |
Offer to sell resources |
accept_bid(bid_id) |
Accept another agent's bid |
propose_coalition(agent_ids, job_id) |
Propose splitting a collaborative job |
accept_coalition(proposal_id) |
Join a proposed coalition |
reject_coalition(proposal_id) |
Decline a coalition proposal |
send_message(agent_id, text) |
Free-text negotiation message |
pass |
Do nothing this round |
Agents can take up to 3 actions per round (encouraging prioritization).
1. EVENT PHASE → Random events (GPU outage, demand surge, bonus job drops)
2. OBSERVE PHASE → Each agent receives their observation (own state + public info)
3. NEGOTIATE PHASE → Agents exchange messages, bids, and coalition proposals (2 sub-rounds)
4. ACTION PHASE → Agents submit final actions (allocate, trade, etc.)
5. EXECUTE PHASE → Engine resolves all actions, updates state
6. SCORE PHASE → Completed jobs earn rewards, missed deadlines incur penalties
7. OVERSIGHT PHASE → Supervisor agent analyzes the round and flags anomalies
Observations are rendered as structured natural language:
=== ROUND 5 of 50 ===
CLUSTER STATUS:
Available: 45 GPU | 120 CPU | 256 GB RAM | 30 Gbps BW
Event: "GPU cluster sector B offline, 20 GPU units unavailable this round"
YOUR STATE (Team Alpha):
Budget: $1,240 | Reputation: 78/100 | Score: 450
Holdings: 12 GPU | 30 CPU | 64 GB RAM | 10 Gbps BW
Jobs:
[J-12] "Train recommendation model" needs 20 GPU, 40 CPU, 128 GB — deadline round 8 — reward $500 — COLLABORATIVE
[J-15] "Run batch inference" needs 5 GPU, 10 CPU, 32 GB — deadline round 6 — reward $150
MARKET:
BID #201 by Team Beta: wants 10 GPU @ $80 each
OFFER #202 by Team Gamma: selling 20 CPU @ $15 each
MESSAGES:
Team Beta → You: "Want to split J-12? I have 15 GPU available."
REPUTATION BOARD:
Team Alpha: 78 | Team Beta: 85 | Team Gamma: 62 | Team Delta: 71
A dedicated Supervisor Agent runs in parallel, receiving a privileged view:
- Full visibility into all agent actions, messages, and resource flows
- Aggregated statistics (Gini coefficient of resources, trade volume, coalition patterns)
- Historical behavior patterns per agent
| Action | Description |
|---|---|
flag(agent_id, reason) |
Flag suspicious behavior for review |
explain(agent_id, summary) |
Generate natural language explanation of an agent's strategy |
alert(description) |
Raise a system-wide alert about market manipulation |
recommend(action) |
Suggest policy changes (price floors, resource caps) |
report() |
Produce an end-of-round oversight summary |
- Collusion: Two agents repeatedly trading at off-market prices
- Resource hoarding: Agent holding resources they can't use before deadline
- Free-riding: Agent in coalition not contributing fair share
- Market manipulation: Artificially inflating/deflating resource prices
- Deadline gaming: Deliberately missing low-penalty jobs to focus on high-reward ones at others' expense
In this variant, a single CTO Agent manages 3-4 Worker Agents:
CTO Agent (the LLM being trained)
├── Worker Alpha (semi-autonomous agent)
├── Worker Beta (semi-autonomous agent)
└── Worker Gamma (semi-autonomous agent)
- The CTO sees the full team state but must issue high-level directives (not direct actions)
- Workers have their own simple policies and may misinterpret, delay, or partially execute directives
- The CTO must discover each worker's strengths/biases and adapt its management style
- Other teams in the cluster are NPC agents with scripted strategies
direct(worker_id, instruction) → "Alpha, bid for 10 GPUs under $90 each"
reassign(worker_id, job_id) → Move a worker to a different job
set_priority(worker_id, policy) → "Focus on high-reward jobs, ignore low-priority"
query(worker_id, question) → "Beta, what's your current resource utilization?"
reward = (
+ job_completion_rewards # primary: complete jobs
- deadline_miss_penalties # penalty: miss deadlines
+ 0.1 * reputation_delta # small bonus for reputation growth
- 0.05 * idle_resource_cost # holding unused resources costs money
+ coalition_bonus # bonus for successful coalition jobs
)reward = (
+ correct_flag_bonus # correctly identified bad behavior
- false_flag_penalty # flagged an innocent agent
+ explanation_quality_score # rated by held-out evaluator
+ anomaly_detection_recall # caught X% of injected anomalies
)reward = (
+ team_total_score # overall team performance
- directive_overhead # penalty for excessive micromanagement
+ worker_utilization_bonus # workers stayed busy and effective
- miscommunication_penalty # directives that were misunderstood
)nexus/
├── PLAN.md # This file (in repo root)
├── pyproject.toml # Package config
├── nexus/
│ ├── __init__.py
│ ├── state.py # Resource, Job, AgentState, ClusterState, MarketState
│ ├── config.py # NexusConfig, presets (tiny/standard/oversight/multi_actor)
│ ├── actions.py # ActionType enum, Action, parse_llm_output, validate
│ ├── observations.py # render(state, agent_id) -> str, render_supervisor
│ ├── market.py # ResourceMarket: bid/offer matching, impact, commission
│ ├── coalitions.py # CoalitionManager: MoE voting on splits
│ ├── events.py # EventGenerator: disruptions, job spawns
│ ├── rewards.py # RewardComputer: weighted multi-signal + Sharpe ratio
│ ├── engine.py # SimulationEngine: 7-phase round loop
│ ├── oversight.py # SupervisorInterface, BehaviorProbe (CART ensemble)
│ ├── multi_actor.py # CTOInterface, WorkerAgent, DirectiveParser
│ ├── journal.py # DualFormatJournal: JSONL + Markdown
│ └── persistence.py # SimulationState save/load
├── agents/
│ ├── base.py # BaseAgent ABC, ExperienceReplayBuffer, BehavioralPolicy
│ ├── random_agent.py # Random baseline
│ ├── greedy_agent.py # Greedy heuristic (priority-based job allocation)
│ ├── llm_agent.py # Anthropic tool-use agent (Claude API)
│ ├── strategic_agent.py # MCTS/UCB1 negotiation planning
│ ├── supervisor_agent.py # Oversight: CART probes + collusion detection
│ └── cto_agent.py # Multi-actor CTO (directive-based management)
├── scripts/
│ ├── run_simulation.py # Typer CLI: run, evaluate
│ ├── evaluate.py # Metric computation + comparison
│ └── visualize.py # Rich terminal replay
└── tests/
├── test_engine.py
├── test_market.py
├── test_coalitions.py
├── test_observations.py
└── test_oversight.py
- Python 3.11+
pydantic— state and action validationgymnasium— standard RL environment interfaceanthropic/openai— LLM agent backends (optional)rich— terminal visualizationpytest— testing
| Preset | Agents | Rounds | Resources | Use Case |
|---|---|---|---|---|
tiny |
2 | 10 | Low | Unit testing, debugging |
standard |
4 | 50 | Medium | Normal training runs |
large |
8 | 100 | High | Emergent strategy research |
oversight |
4 + supervisor | 50 | Medium | Fleet AI track |
multi_actor |
1 CTO + 3 workers vs NPCs | 50 | Medium | Halluminate track |
- Run simulations with scripted/heuristic agents to generate trajectories
- Extract (observation, action, reward) tuples
- Fine-tune LLMs on high-reward trajectories (rejection sampling / DPO)
- Self-play: pit fine-tuned agents against each other, repeat
- Provide environment rules in system prompt
- Feed observations as user messages, collect actions as assistant responses
- Use reward signals to update prompting strategies
- Evaluate theory-of-mind via held-out negotiation scenarios
| Metric | What It Measures |
|---|---|
| Job Completion Rate | % of jobs completed before deadline |
| Negotiation Efficiency | Average surplus captured in trades |
| Coalition Success Rate | % of proposed coalitions that succeed |
| Theory-of-Mind Score | Accuracy of predicting other agents' next actions |
| Oversight Precision/Recall | Supervisor's ability to detect anomalies |
| Directive Effectiveness | CTO directive → worker outcome alignment |
| Social Welfare | Total score across all agents (cooperation measure) |
| Gini Coefficient | Resource distribution fairness |
- State dataclasses (
state.py) - Action parsing and validation (
actions.py) - Round loop and execution engine (
engine.py) - Job generation and deadline tracking
- Basic observation rendering (
observations.py) - Reward computation (
rewards.py) - Random + greedy baseline agents
- Bid/offer matching engine (
market.py) - Free-text message passing between agents
- Coalition proposal/acceptance flow (
coalitions.py) - Random events (outages, surges) (
events.py)
- LLM agent with structured action output (
llm_agent.py) - Observation → prompt template
- Action parsing from LLM output
- Simulation runner with LLM agents (
run_simulation.py)
- Supervisor privileged observation view
- Anomaly injection (scripted collusion, hoarding scenarios)
- Supervisor action interface and reward
- Oversight evaluation metrics
- CTO directive interface
- Worker agent with noise/misinterpretation model
- NPC opponent teams
- CTO reward function
- Gymnasium env wrapper
- Terminal visualization with
rich - Evaluation suite and leaderboard
- Documentation and examples
| Feature | Why It Stands Out |
|---|---|
| Natural language observations | LLM-native — no tensor encoding needed |
| Structured action space | Parseable but expressive — no free-form action ambiguity |
| Partial observability built-in | Forces theory-of-mind, not just pattern matching |
| Unified env for all modes | One codebase covers multi-agent negotiation, oversight, and multi-actor management |
| Scalable agent count | 2 to 8+ agents with consistent mechanics |
| Real-world analog | Compute allocation is a genuine industry problem |
These 5 combinations define the core technical contributions:
-
MCTS for Negotiation Strategy: MCTS tree search plans negotiation moves under hidden information, with UCB1 balancing exploring new tactics vs exploiting proven ones.
-
MoE Coalition Voting: Coalition members vote on resource splits using confidence-weighted per-option voting. Each member is an "expert"; the elimination pre-pass acts as a veto round.
-
CART Behavioral Probes for Oversight: Instead of probing residual streams for backdoor signals, the supervisor probes agent behavioral streams (trade patterns, messages) for collusion signals via CART ensemble with interpretable feature importance.
-
Market Physics from Portfolio Simulation: Commission, price impact, and portfolio accounting create realistic market dynamics -- large trades move prices, commissions discourage churn, Sharpe ratio provides risk-adjusted evaluation.
-
Feedback-Driven Oversight Loop: Supervisor detects anomaly -> issues structured feedback -> agent's BehavioralPolicy adjusts strategy weights -> behavior changes -> supervisor observes new behavior. Closed oversight loop.
| Category | Existing Tools | What They Miss |
|---|---|---|
| Cloud Cost Optimization | Spot by NetApp, nOps | Algorithmic optimization only, no multi-agent negotiation, no adversarial reasoning |
| Procurement AI | Pactum AI, Fairmarkit, Coupa | Agent-to-human negotiation or process automation,no agent-to-agent competition |
| Multi-Agent RL Environments | PettingZoo, OpenSpiel, Melting Pot | Classical RL agents in abstract domains, no LLM-native design, no market physics |
| LLM Multi-Agent Benchmarks | MultiAgentBench / MARBLE | Evaluation only, no training environment, no resource negotiation mechanics |
| Agent Observability | AgentOps, Arize Phoenix, LangSmith | Passive logging,no active behavioral classification or strategy detection |
- Only framework combining multi-agent LLM training, resource negotiation, behavioral oversight, and GRPO fine-tuning in a single environment.
- MCTS for negotiation planning with no existing tool or paper applies Monte Carlo Tree Search to multi-agent resource negotiation with hidden information.
- Active behavioral oversight that has a CART ensemble probes classify agent strategies (collusion, hoarding, free-riding) in real time, going beyond passive observability.
- Emergent theory-of-mind which are trained agents develop bid anticipation, coalition strategy, and counter-manipulation without hand-coded rules, aligned with cutting-edge ToM research.
- Direct simulation-to-reality mapping allowing for GPU/CPU resources map to EC2 instance types, job deadlines to SLA workloads, market mechanics to spot pricing dynamics.
- Simulation-to-reality gap where real cloud markets have API rate limits, multi-region pricing, and contractual constraints not yet modeled.
- Well-funded incumbents that are Spot by NetApp, and Pactum AI have production deployments and enterprise sales teams; they could add agent-based negotiation to existing platforms.
- Scale validation needed with current evaluation covers 2–4 agents over 10–50 rounds; behavior at 8+ agents or hundreds of rounds is untested.
The technical details, evaluation, and theoretical foundations are described in the accompanying IEEE-format paper:
If you use Nexus in your research, please cite:
@inproceedings{mauer2026nexus,
title={Nexus: A Multi-Agent Negotiation Environment for Training Procurement AI with Theory-of-Mind Reasoning},
author={Mauer, Nate},
year={2026}
}