multi-agent v2: Agent, TaskSet, Registry abstractions by Bhoy1 · Pull Request #961 · PrimeIntellect-ai/verifiers

Bhoy1 · 2026-02-26T02:19:49Z

Summary

Rebuilt multi-agent support with clean separation of concerns
Agent: WHO responds — model, client, system_prompt, sampling config
TaskSet: WHAT to solve — game rules, prompts, scoring, reusable across agents
Registry: HOW to wire — connects agents to envs, enables Registry.spawn() for parallel child rollouts
MultiAgentEnv: engine that runs the game loop, supports TaskSet mode and subclass mode
Agent system_prompt auto-prepended to prompts built by TaskSet
Actor → Agent, Protocol → Registry

Environments on new pattern:

twenty_questions, poker, rock_paper_scissors, proposer_solver (TaskSet)
poker_multi (TaskSet, 2-9 players)
proposer_code_solver (TaskSet + tool execution, new)
arc_multiagent (subclass mode with Registry.spawn(), hierarchical pipeline)

Test plan

poker_multi: 6-player games with per-agent models, JSON actions, rewards
twenty_questions: alternating turns, win detection, efficiency scoring
proposer_code_solver: tool execution with code REPL
arc_multiagent: hierarchical spawning pipeline
Training loop integration (GRPO with per-actor states)

Note

Medium Risk
Large new environments add complex orchestration (subprocess sandboxing, parallel spawning, multimodal prompting) and game logic, which may introduce runtime/platform issues, but they are additive with minimal impact on existing core behavior.

Overview
Adds a new docs/multiagent.md guide describing multi-agent environment patterns (turn-taking, simultaneous moves, per-actor rewards/GRPO, and hierarchical spawning).

Introduces a full ARC-AGI multi-agent solver environment (environments/arc_multiagent) that orchestrates parallel solver strategies (shallow/deep, codegen with sandboxed execution, vision, objects/hints) via Registry.spawn(), then selects final answers via a duo-pick judge council with logic/consistency judging fallback; includes packaging metadata.

Adds new poker environments: environments/poker (heads-up) and environments/poker_multi (2–9 players) implemented in the new TaskSet composition style with per-player agents, JSON action parsing, and chip-based per-actor rewards, along with their pyproject.toml configs.

^{Written by Cursor Bugbot for commit 6821879. This will update automatically on new commits. Configure here.}

…bric

…de rollout

…tory step

… environments

…lay different models per actor. Should show in logs in CLI display. Example in twenty question/poker.

… Also, can setup currently to have different models with different system prompts play each other. Still validating environments

…into Bhoy1/Multiagent

…up signatures, fix eval_utils KeyError for child states

Replace Actor/Protocol with Agent/TaskSet/Registry pattern. Agent owns model config + system prompt, TaskSet owns game logic, Registry wires agents to envs and enables parallel spawn.

cursor

Cursor Bugbot has reviewed your changes and found 5 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable autofix in the Cursor dashboard.}

cursor · 2026-02-26T02:29:33Z

environments/arc_multiagent/arc_multiagent.py

+
+    for i, (inp, out) in enumerate(grids):
+        ax_in = axes[i][0] if num_pairs > 1 else axes[0]
+        ax_out = axes[i][1] if num_pairs > 1 else axes[1]


Image generation crashes when only one training pair

High Severity

When num_pairs == 1, plt.subplots(1, 2) returns a 1D array [ax1, ax2]. After wrapping with axes = [axes], the list becomes [[ax1, ax2]]. The else branch then accesses axes[0] (which gives the inner array, not a single axis) and axes[1] (which raises an IndexError since the outer list has only one element). The condition incorrectly bypasses the wrapped indexing axes[i][0] / axes[i][1] which would work correctly after the wrapping.

cursor · 2026-02-26T02:29:33Z

environments/poker/poker.py

+    async def should_stop(self, state: State) -> bool:
+        extras = state.get("extras", {})
+        if len(extras.get("folded", [])) > 0:
+            return True


Fold immediately ends multi-hand poker session

High Severity

The should_stop check if len(extras.get("folded", [])) > 0: return True immediately ends the entire game session on any fold, even though num_hands defaults to 5. Since folds are extremely common in heads-up poker, most games will end after a single hand without ever reaching _resolve_showdown, which is the only code path that starts new hands. The four remaining hands are silently skipped.

cursor · 2026-02-26T02:29:33Z

environments/poker_multi/poker_multi.py

+        if e.get("actions_hand", 0) >= self.max_actions:
+            self._resolve_showdown(state)
+            return True
+        return False


Fold path skips new hand in multi-hand poker

Low Severity

When all but one player fold, should_stop awards the pot and sets phase = "complete" without attempting to start a new hand, unlike _resolve_showdown which checks hands_played < num_hands and calls _start_hand. With num_hands > 1, a hand ending via fold terminates the entire session instead of rotating the dealer and starting the next hand.

cursor · 2026-02-26T02:29:33Z

environments/poker/poker.py

+            dealer = extras["dealer"]
+            non_dealer = "player2" if dealer == "player1" else "player1"
+            if non_dealer not in extras["folded"]:
+                extras["current_actor_id"] = non_dealer


Postflop turn order reversed by _advance_phase override

Medium Severity

_advance_phase sets extras["current_actor_id"] = non_dealer intending for the non-dealer to act first postflop. But the framework's get_next_role returns the opposite of current_actor_id, so this causes the dealer to act first postflop instead. In heads-up poker, the non-dealer acts first on flop/turn/river. When the dealer was the last preflop actor (e.g., dealer called a BB raise), the natural turn order would have been correct, but this override actively breaks it.

cursor · 2026-02-26T02:29:33Z

environments/poker_multi/pyproject.toml

+rollouts_per_example = 1
+num_players = 6
+num_hands = 3
+max_actions_per_hand = 50


Config parameter name mismatches load_environment signature

Medium Severity

The pyproject.toml specifies max_actions_per_hand = 50 under [tool.verifiers.eval], but load_environment expects the parameter max_actions, not max_actions_per_hand. This mismatch means the config value either causes a TypeError (unexpected keyword argument) or is silently ignored, with the function falling back to its default of 50. Users adjusting this config setting would see no effect.

Additional Locations (1)

environments/poker_multi/poker_multi.py#L600-L601

Override run_group() in MultiAgentEnv to split game states into per-actor states before scoring. Each trainable actor gets its own RolloutOutput with correct reward. Non-trainable actors excluded.

…guous parses

…game debug fields

…d fallback actor ID lookups

…p and state_to_output

…louts

wjh581 and others added 25 commits January 25, 2026 05:37

Add multi-agent support: Actor, Protocol, MultiAgentEnv, MultiAgentRu…

1e5f474

…bric

Simplify multi-agent: remove orchestrator, streamline env and rubric

62ab5f8

cleaning up

f097902

cleaning up

7246cc9

cleaning up

929c90f

cleaning up: added some hooks into multiturn so i dont have to overri…

c4d5613

…de rollout

cleaning up: added some hooks into multiturn to add extras for tracje…

fdf4071

…tory step

cleaning up some more

177b228

trying to unify it better with multiturn and multiagent with the game…

b62afdc

… environments

Add proposer_solver and poker environments, update multiagent support

2a663cc

Add model and client to actor so you can use different endpoints to p…

828f80a

…lay different models per actor. Should show in logs in CLI display. Example in twenty question/poker.

adding poker_multi which is poker extension that allows more players.…

c3af35c

… Also, can setup currently to have different models with different system prompts play each other. Still validating environments

adjust protocol spawn rubric.

dda05da

added error handling and working on making sure timing is per actor

5b1141e

adding documentation

842cb9d

ARC-AGI-2

86bfe6f

Merge branch 'main' into Bhoy1/Multiagent

604903b

updated arc-agi-2 as it still needed a few judges

75b634e

Merge branch 'Bhoy1/Multiagent' of https://github.com/Bhoy1/verifiers …

8d95d58

…into Bhoy1/Multiagent

add ARC multiagent overview doc

cae6005

Merge remote-tracking branch 'origin/main' into Bhoy1/Multiagent

ed70983

Fix multiagent compatibility with upstream: align run_group/score_gro…

fb9751f

…up signatures, fix eval_utils KeyError for child states

Add 2nd pick is_correct to pipeline debug output for pass@2 visibility

05a088d

merge upstream main into Bhoy1/Multiagent

ed72edb

multi-agent v2: Agent, TaskSet, Registry abstractions

6821879

Replace Actor/Protocol with Agent/TaskSet/Registry pattern. Agent owns model config + system prompt, TaskSet owns game logic, Registry wires agents to envs and enables parallel spawn.

Bhoy1 marked this pull request as draft February 26, 2026 02:22

cursor bot reviewed Feb 26, 2026

View reviewed changes

wjh581 added 3 commits February 26, 2026 19:56

multi-agent training: per-actor splitting in run_group()

7b31b0f

Override run_group() in MultiAgentEnv to split game states into per-actor states before scoring. Each trainable actor gets its own RolloutOutput with correct reward. Non-trainable actors excluded.

fix prime-rl compat and Windows support

28dbfde

disable thinking in guesser prompt, increase max_tokens to 100

34952be

wjh581 added 30 commits February 27, 2026 19:34

Remove unused outputs_per_input property

9c4cbcb

Remove dead code, outdated docs, and old-style environments

7745579

Revert unnecessary changes to multiturn_env and eval_utils

7a44917

Remove unused multiagent_stateful_tool_env and tool_utils alias

079e046

Restore poker_multi environment for eval

1efbfde

Add Prisoner's Dilemma env with masked actions and asymmetric payoffs

48405c3

PD: strip think tags in extractor, bump max_tokens to 50, handle ambi…

ff98b9b

…guous parses

Move agent.py and taskset.py into envs/ for organization

30adc1e

Strip tool loop, add rollouts_per_example divisibility check, remove …

00ee035

…game debug fields

Simplify MultiAgentRubric: remove group funcs, global reward path, an…

cf0152e

…d fallback actor ID lookups

Clean up environment docstrings

41e33d8

Pass advantage through state_to_output for per-actor GRPO

afed1f6

Add no-op env_response stub to satisfy abstract method

ff5b4a6

Debug log per-actor advantages

4d4b2cc

Use print for advantage debug logging

98fe8a6

Add debug print to trace advantage passthrough in state_to_output

24026e5

Unconditional debug prints to trace advantage loss between score_grou…

afd129c

…p and state_to_output

Debug: print which run_group path calls state_to_output

7128c2a

Debug: print inside attempt() right after score_group with object IDs

2f9e295

Debug: disable timing block to test if it resets advantage

175b359

debug: trap advantage reset with stack trace

de7922e

fix: skip advantage overwrite in metrics-only rubrics

d4c446a

proposer-solver: conflicting incentives for per-actor GRPO demo

7942b47

debug prints for reward/metrics pipeline tracing

d631e8c

both agents 2048 max tokens

df7a65f

fix metrics dilution: only include per-actor metrics for relevant rol…

4276d74

…louts

remove all debug prints from advantage/metrics debugging

5878a6d

twenty questions: use default model, add token reuse print

a20fc4d

add twenty_questions_v2 pyproject.toml

f18ab07

no_think + lower max_tokens for twenty questions

63f09e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-agent v2: Agent, TaskSet, Registry abstractions#961

multi-agent v2: Agent, TaskSet, Registry abstractions#961
Bhoy1 wants to merge 67 commits intoPrimeIntellect-ai:mainfrom
Bhoy1:Bhoy1/Multiagent_v2

Bhoy1 commented Feb 26, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 26, 2026

Uh oh!

cursor bot Feb 26, 2026

Uh oh!

cursor bot Feb 26, 2026

Uh oh!

cursor bot Feb 26, 2026

Uh oh!

cursor bot Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Bhoy1 commented Feb 26, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Image generation crashes when only one training pair

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Fold immediately ends multi-hand poker session

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Fold path skips new hand in multi-hand poker

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Postflop turn order reversed by _advance_phase override

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Config parameter name mismatches load_environment signature

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bhoy1 commented Feb 26, 2026 •

edited by cursor bot

Loading