CombatRLGymEnv is a single-agent Gymnasium wrapper around the deterministic
simulator. In the default P7 setup, one controlled agent acts alongside one
scripted teammate against two scripted opponents. Gymnasium does not own game
rules: movement, combat,
cooldowns, deaths, terminal state, events, and invariants remain in
SimulationEngine.
Default config: configs/env/gym_2v2_controlled_ranged.yaml
- Controlled agent:
team0_ranged_dps_0 - Teammate policy:
protector - Opponent policies:
aggressive,random - Simulation config:
configs/env/mvp_2v2_elimination.yaml render_mode=Noneby default- Replay capture disabled by default
Optional explicit policy assignment:
scripted_policy_by_agent_id:
team0_tank_0: protector
team1_tank_0: aggressive
team1_ranged_dps_0: randomWhen this mapping is present, every non-controlled agent must be assigned a known scripted policy and the controlled agent must be omitted.
Optional profile assignment:
teammate_profile_id: protective
opponent_profile_ids:
- aggressive
- kiter
profile_by_agent_id:
team0_tank_0: protective
controlled_profile_id: defensive
rerank_controlled_action_with_profile: falseScripted teammate and opponent policies can be wrapped with profiles. The
controlled RL action is not reranked unless
rerank_controlled_action_with_profile is explicitly enabled. P8 does not
change observation shape or train behavior-conditioned policies.
from combatrl.envs import CombatRLGymEnv
env = CombatRLGymEnv("configs/env/gym_2v2_controlled_ranged.yaml")
observation, info = env.reset(seed=42)
observation, reward, terminated, truncated, info = env.step(0)
env.close()Spaces:
observation_space = Box(low=-1.0, high=1.0, shape=(49,), dtype=np.float32)action_space = Discrete(10)
Termination and truncation:
- Elimination win/loss:
terminated=True,truncated=False - Max ticks:
terminated=False,truncated=True - Invariant failure:
terminated=False,truncated=True, withinfo["error"]
Step info includes match identity, controlled team, ally/enemy IDs, alive
counts, terminal reason, winner, reward breakdown, invalid-action flag, next
action mask, and event count.
| ID | Action |
|---|---|
| 0 | NO_OP |
| 1 | MOVE_UP |
| 2 | MOVE_DOWN |
| 3 | MOVE_LEFT |
| 4 | MOVE_RIGHT |
| 5 | MOVE_UP_LEFT |
| 6 | MOVE_UP_RIGHT |
| 7 | MOVE_DOWN_LEFT |
| 8 | MOVE_DOWN_RIGHT |
| 9 | ATTACK_NEAREST |
Invalid action IDs and mask-invalid actions fall back to NO_OP, set
info["invalid_action"] = True, and receive the invalid-action reward penalty.
The observation vector has 49 named features:
- Self, 10: HP, normalized position and velocity, cooldowns, role one-hot.
- Ally slot, 9: alive flag, relative position, distance, HP, role, threat flag.
- Enemy slot 1, 9: alive flag, relative position, distance, HP, role, in-range flag.
- Enemy slot 2, 9: same layout as enemy slot 1.
- Arena, 6: wall distances and relative center vector.
- Tactical, 6: nearest enemy/ally distances, outnumbered flag, recent damage placeholders, attack-ready flag.
Entity ordering is deterministic: live entities before dead entities, then
increasing distance from the controlled agent, then agent_id. Missing slots are
filled with zero flags, zero relative position, distance 1.0, and zero role
values. Recent damage flags are placeholders set to 0.0 in P7.
Every RewardBreakdown includes these components:
win_bonus:+1.0on controlled-team elimination winloss_penalty:-1.0on controlled-team elimination lossdamage_dealt: controlled damage to enemies divided by100.0damage_taken_penalty: controlled damage received divided by-150.0death_penalty:-0.5if the controlled agent dies this stepally_death_penalty:-0.25per controlled ally death this stepinvalid_action_penalty:-0.02for one invalid RL actiontime_penalty:-0.001per env step
reward_config scales component values multiplicatively and may be partial.
Run one 2v2 episode through the Gym env and save a replay:
uv run python scripts/run_2v2_env_episode.py --env-config configs/env/gym_2v2_controlled_ranged.yaml --seed 42 --policy random --save-replayRun a lightweight 2v2 baseline summary:
uv run python scripts/evaluate_2v2_baseline.py --env-config configs/env/gym_2v2_controlled_ranged.yaml --episodes 10 --seed 42P8 includes lightweight behavior profiles for scripted policies. It does not include NLP, frontend/backend, PettingZoo, self-play, opponent pools, shared team policies, centralized critics, support/healer mechanics, objective-control mode, full evaluation framework metrics, or simultaneous multi-agent learning.
uv sync
uv run pytest
uv run ruff check .
uv run ruff format --check .
uv run mypy src
uv run python scripts/check_env.py --env-config configs/env/gym_2v2_controlled_ranged.yaml --episodes 3 --seed 42
uv run python scripts/run_2v2_env_episode.py --env-config configs/env/gym_2v2_controlled_ranged.yaml --seed 42 --policy random --save-replayDeterminism check:
uv run python scripts/check_env.py --env-config configs/env/gym_2v2_controlled_ranged.yaml --episodes 2 --seed 42
uv run python scripts/check_env.py --env-config configs/env/gym_2v2_controlled_ranged.yaml --episodes 2 --seed 42The summaries should match for fixed seeds.