Add world-model evaluation harness, learned MCTS action priors, docs, and README/training updates by jeevesh415 · Pull Request #4 · jeevesh415/HRM

jeevesh415 · 2026-05-16T14:58:28Z

User description

Motivation

Improve latent planning by replacing placeholder MCTS priors with a learned, model-informed action-prior mechanism.
Provide a lightweight, reproducible evaluation harness and development protocol to measure rollout stability, action consistency, and prior concentration.
Document repository structure and priorities to aid rigorous phase-1 development and gap analysis.

Description

Added a world-model evaluation script evaluate_world_model.py that computes rollout_drift_l2, trajectory_divergence_l2, max_action_prior, and action_prior_entropy and persists JSON manifests to eval_runs/.
Introduced a learned action-prior interface: policy_query_head in VJEPA (models/vjepa/vjepa_model.py) and updated MCTS to compute priors by dot-product similarity between candidate actions and the policy query (models/vjepa/planning.py).
Updated training entrypoint vjepa_train.py to accept --config, honor training.epochs from config, and generate a synthetic test video if data/ is absent.
Added repository metadata and process artifacts including CODE_ADDRESS_INDEX.md, docs/FRONTIER_GAP_ANALYSIS.md, and docs/RIGOROUS_DEVELOPMENT_PROTOCOL.md, and expanded README.md with usage notes and configuration examples.

Testing

No automated CI test suite was executed as part of this change.
The change includes a dedicated evaluation harness (evaluate_world_model.py) intended as the Phase-1 automated smoke/evaluation test (invokable via python evaluate_world_model.py --config config/vjepa_micro.yaml --seed 42) but running it was not performed in this PR.

Codex Task

Summary by CodeRabbit

Documentation
- Rewrote README with architecture overview and reproducible run commands and roadmap
- Added evaluation spec, rigorous development protocol, frontier gap analysis, post-PR audit, and a repository code index
New Features
- Added perception robustness and world-model dynamics evaluation scripts that emit JSON manifests
- Added an integration smoke-check script
Improvements
- Configurable training via a --config CLI option
- Better handling of synthetic-data generation when ffmpeg is unavailable
- Improved planner action-prior scoring using a learned query-based mechanism

CodeAnt-AI Description

Add phase-1 evaluation scripts, learned planning priors, and config-driven training

What Changed

Added a world-model evaluation script that saves run summaries with rollout drift, future divergence, action-prior confidence, and prior entropy in eval_runs/
Added a perception robustness check for color, brightness/shadow, noise, and spatial shift changes, using latent consistency as the score
Replaced placeholder MCTS action priors with learned, model-based action scoring so planning can favor actions that match the current latent state
Made training accept a config file from the command line, use the configured number of epochs, and avoid failing when ffmpeg is missing while creating fallback sample video data
Updated the project docs and README to describe the new VEM naming, evaluation commands, and phase-1 development protocol

Impact

✅ Reproducible world-model smoke tests
✅ Clearer planning signals during latent search
✅ Fewer training startup failures on new machines

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

codeant-ai · 2026-05-16T14:58:32Z

CodeAnt AI is reviewing your PR.

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

coderabbitai · 2026-05-16T14:58:33Z

📝 Walkthrough

Walkthrough

This PR adds a learned policy_query_head to VJEPA and updates MCTS to use policy_query-based action priors, introduces world-model and perception evaluation scripts, improves training CLI/ffmpeg handling, updates integration checks, and adds development/docs artifacts plus a repository code-address index.

Changes

Policy Query Integration and Evaluation

Layer / File(s)	Summary
Policy query head and alias `models/vjepa/vjepa_model.py`	Adds `VJEPA.policy_query_head` (two-layer MLP projecting predictor hidden state to action_dim) and `VisualExecutionModel` alias.
MCTS priors via policy_query `models/vjepa/planning.py`	`_imagine_future` now returns a `policy_query`; `_expand` accepts/derives `policy_query` and computes per-action priors by dot-product with `available_actions` then temperature-scaled softmax, replacing placeholder logits.
World-model evaluation harness `evaluate_world_model.py`	New script seeds RNGs, loads YAML config, instantiates `VJEPA`, runs latent rollouts via the physics engine, and measures rollout drift, trajectory divergence, and action-prior concentration; outputs a JSON EvalManifest with commit/config metadata.
Perception robustness evaluation `evaluate_perception.py`	New script implements perturbations (color jitter, brightness/shadow, Gaussian noise, spatial shift), computes latent_consistency as L2 between context encoder embeddings, runs deterministically with configurable seed, and writes JSON results.
Training CLI and ffmpeg guard `vjepa_train.py`	Adds `--config` CLI arg (default `config/vjepa_micro.yaml`), reads `epochs` from YAML (default 100), and guards synthetic ffmpeg invocation with `shutil.which()`.
Integration check script `check_integrations.py`	New script loads config, instantiates `VisualExecutionModel`, runs a synthetic forward, asserts expected outputs, invokes `MCTS.plan`, and validates chosen action shape.
Adaptive carry & topological updates `models/adaptive_depth.py`, `models/topological.py`	Removes local concrete carry import and constructs carry via runtime `type(carry)(...)`; reshapes connected-component tracking to batch×filtration and adjusts betti_0/betti_1 computations.
Development protocol and eval specs `docs/RIGOROUS_DEVELOPMENT_PROTOCOL.md`, `docs/HUMAN_VISION_EXECUTION_EVAL_SPEC.md`, `docs/FRONTIER_GAP_ANALYSIS.md`	Adds Phase-1 rigor gates, evaluation tracks and promotion rule, and a frontier gap analysis noting the MCTS action-prior upgrade and next steps.
README, audit, and repository index `README.md`, `AUDIT_RECHECK.md`, `CODE_ADDRESS_INDEX.md`	Rewrites README to the Visual Execution Model (VEM) architecture and workflow, adds a post-PR recheck audit, and adds an exhaustive CODE_ADDRESS_INDEX.md with per-file anchors.

Sequence Diagram(s):

sequenceDiagram
  participant User
  participant VJEPA
  participant PhysicsEngine
  participant MCTS
  User->>VJEPA: request planning/eval
  VJEPA->>PhysicsEngine: forward(state, action) -> predicted_next_latent
  PhysicsEngine->>VJEPA: predicted_next_latent
  VJEPA->>VJEPA: policy_query_head(predicted_next_latent) -> policy_query
  VJEPA->>MCTS: policy_query
  MCTS->>MCTS: dot(policy_query, available_actions) -> logits
  MCTS->>MCTS: softmax(logits/temperature) -> priors
  MCTS->>MCTS: selection/expansion/backprop -> chosen action

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs:

jeevesh415/HRM#3: Introduces the same CODE_ADDRESS_INDEX and related models/vjepa policy_query_head / MCTS prior changes.

"🐰 A query head learns the way,
MCTS priors now have their say,
Evaluation harnesses measure the dance,
Of rollouts, robustness, at each advance—
Protocol and specs bring clarity's light!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title comprehensively summarizes the main changes: world-model evaluation harness, learned MCTS action priors, documentation additions, and README/training updates. It is specific, clear, and directly reflects the primary modifications across multiple areas.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/review-project-vision-status-1bv5jo

⚔️ Resolve merge conflicts

Resolve merge conflict in branch codex/review-project-vision-status-1bv5jo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codeant-ai · 2026-05-16T15:01:15Z

+        # 5b. Policy query head for action-prior scoring in latent MCTS.
+        # Produces an action-space query vector that can be matched against
+        # candidate action vectors via dot-product similarity.
+        self.policy_query_head = nn.Sequential(
+            nn.Linear(predictor_config["hidden_size"], predictor_config["hidden_size"]),
+            nn.SiLU(),
+            nn.Linear(predictor_config["hidden_size"], action_dim)
+        )


🔴 Architect Review — CRITICAL

policy_query_head is used to score MCTS action priors and in the evaluation harness, but it is never included in any training loss, so its weights remain at random initialization and the supposed "learned" priors are in fact random.

Suggestion: Add an explicit training objective that depends on policy_query_head (e.g., behavior-cloning or contrastive action-ranking targets from search) and integrate it into the V-JEPA training loop, which currently only optimizes VICReg over latent states, so that gradients actually update this head.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is an **Architect / Logical Review** comment left during a code review. These reviews are first-class, important findings — not optional suggestions. Do NOT dismiss this as a 'big architectural change' just because the title says architect review; most of these can be resolved with a small, localized fix once the intent is understood. **Path:** models/vjepa/vjepa_model.py **Line:** 64:71 **Comment:** *CRITICAL: `policy_query_head` is used to score MCTS action priors and in the evaluation harness, but it is never included in any training loss, so its weights remain at random initialization and the supposed "learned" priors are in fact random. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. If a suggested approach is provided above, use it as the authoritative instruction. If no explicit code suggestion is given, you MUST still draft and apply your own minimal, localized fix — do not punt back with 'no suggestion provided, review manually'. Keep the change as small as possible: add a guard clause, gate on a loading state, reorder an await, wrap in a conditional, etc. Do not refactor surrounding code or expand scope beyond the finding. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0db3f49298

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T15:01:44Z

+            # available_actions: (num_actions, action_dim)
+            # policy_query: (1, action_dim)
+            logits = torch.matmul(available_actions, policy_query.squeeze(0))
+            priors = F.softmax(logits / self.temperature, dim=0)


Handle zero-temperature priors without softmax division

When temperature is set to 0 (which plan() explicitly supports for greedy selection), this path still computes softmax(logits / self.temperature), producing NaN priors whenever logits are tied (for example with symmetric or zero action vectors). Those NaN priors then flow into child.prior and can corrupt PUCT scoring/selection, making deterministic planning unstable. Please branch this logic for temperature == 0 (e.g., argmax/one-hot or unscaled logits) before the softmax.

Useful? React with 👍 / 👎.

codeant-ai · 2026-05-16T15:02:18Z

+        available_actions = torch.randn(num_actions, 128, device=device)
+        pooled = traj_a[-1].mean(dim=1)
+        query = model.policy_query_head(pooled).squeeze(0)
+        logits = torch.matmul(available_actions, query)
+        probs = torch.softmax(logits, dim=0)
+        max_prior = float(probs.max().item())


Suggestion: --num-actions is not validated, so num_actions=0 (or negative) creates an empty action tensor and then calls max() on an empty probability vector, raising a runtime error. Enforce num_actions > 0 before metric computation. [incorrect condition logic]

Severity Level: Major ⚠️

- ❌ `evaluate_world_model.py` crashes when num_actions <= 0 is passed. - ⚠️ World-model evaluation harness unusable under misconfigured CLI flags.

Steps of Reproduction ✅

1. From the repo root, run the evaluation harness with an empty action set, e.g. `python evaluate_world_model.py --config config/vjepa_micro.yaml --seed 42 --num-actions 0` (entrypoint `main()` at `evaluate_world_model.py:116-166`). 2. In `main()`, the argparse parser at `evaluate_world_model.py:117-123` parses `--num-actions` as an `int` without validation and passes `args.num_actions` into `evaluate_metrics(model=..., device=..., rollout_steps=args.rollout_steps, num_actions=args.num_actions)` at lines 137-142. 3. Inside `evaluate_metrics()` at `evaluate_world_model.py:76-113`, `available_actions = torch.randn(num_actions, 128, device=device)` at line 100 constructs a tensor of shape `(0, 128)` when `num_actions == 0`, and `logits = torch.matmul(available_actions, query)` at line 103 produces an empty logits tensor. 4. The subsequent `probs = torch.softmax(logits, dim=0)` at line 104 returns an empty tensor, and `max_prior = float(probs.max().item())` at line 105 calls `.max()` on this empty tensor, raising a `RuntimeError: cannot perform reduction function max on tensor with no elements`, aborting the script before the manifest is written at `evaluate_world_model.py:145-159`.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** evaluate_world_model.py **Line:** 100:105 **Comment:** *Incorrect Condition Logic: `--num-actions` is not validated, so `num_actions=0` (or negative) creates an empty action tensor and then calls `max()` on an empty probability vector, raising a runtime error. Enforce `num_actions > 0` before metric computation. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

codeant-ai · 2026-05-16T15:02:18Z

+            # available_actions: (num_actions, action_dim)
+            # policy_query: (1, action_dim)
+            logits = torch.matmul(available_actions, policy_query.squeeze(0))
+            priors = F.softmax(logits / self.temperature, dim=0)


Suggestion: The prior computation divides by self.temperature unconditionally; when temperature is set to 0 (which this class already supports in action selection), this produces invalid logits (inf/nan) and corrupts MCTS priors. Add an explicit zero-temperature branch for prior scoring (e.g., argmax/one-hot or unscaled logits) instead of dividing by zero. [logic error]

Severity Level: Critical 🚨

- ❌ Zero-temperature MCTS yields NaN priors, breaking tree search. - ⚠️ LatentPlannerMCTS users cannot safely enable greedy selection.

Steps of Reproduction ✅

1. In any consumer script or notebook, import `MCTS` from `models.vjepa.planning` and construct it with a zero temperature, e.g. `from models.vjepa.planning import MCTS; mcts = MCTS(model, temperature=0.0)` using a `VJEPA` model instance (constructor at `models/vjepa/planning.py:139-157`). 2. Call `mcts.plan(initial_state, available_actions)` where `initial_state` is a latent tensor and `available_actions` has shape `(num_actions, action_dim)` (method `plan()` at `models/vjepa/planning.py:349-399`), which initializes the root node and invokes `self._expand(root, available_actions)` at line 373. 3. Inside `_expand()` at `models/vjepa/planning.py:222-287`, if the model has a `policy_query_head` (true for `VJEPA` defined in `models/vjepa/vjepa_model.py:64-71`), the code builds `policy_query` and computes `logits = torch.matmul(available_actions, policy_query.squeeze(0))` followed by `priors = F.softmax(logits / self.temperature, dim=0)` at line 262. 4. With `self.temperature == 0.0`, the division `logits / self.temperature` performs a divide-by-zero, producing `inf`/`nan` logits; `F.softmax` then yields invalid prior probabilities (often all `nan`), which are used both for sorting actions at `models/vjepa/planning.py:269-272` and for setting `prior=priors[idx].item()` on child nodes at line 280, corrupting MCTS behavior even though the class explicitly treats `temperature == 0` as a valid greedy mode in `_get_action_probabilities()` at lines 337-345 and `plan()` at lines 392-395.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** models/vjepa/planning.py **Line:** 262:262 **Comment:** *Logic Error: The prior computation divides by `self.temperature` unconditionally; when temperature is set to 0 (which this class already supports in action selection), this produces invalid logits (`inf`/`nan`) and corrupts MCTS priors. Add an explicit zero-temperature branch for prior scoring (e.g., argmax/one-hot or unscaled logits) instead of dividing by zero. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

codeant-ai · 2026-05-16T15:02:18Z

+        self.policy_query_head = nn.Sequential(
+            nn.Linear(predictor_config["hidden_size"], predictor_config["hidden_size"]),
+            nn.SiLU(),
+            nn.Linear(predictor_config["hidden_size"], action_dim)
+        )


Suggestion: This new policy-prior head is introduced as "learned", but it is not connected to any training objective in the training path, so it receives no gradients and remains effectively random at inference. Add a loss term or supervised target path that updates this head during training. [incomplete implementation]

Severity Level: Critical 🚨

- ❌ World-model prior metrics reflect random, not learned, policies. - ⚠️ Latent MCTS planner explores using uninformative random priors.

Steps of Reproduction ✅

1. Start training via the main training script `vjepa_train.py` by running, for example, `python vjepa_train.py --config config/vjepa_micro.yaml`; this calls `train(args.config)` at `vjepa_train.py:208-216`, which constructs a `VJEPA` model at lines 127-134 and builds an optimizer over `model.parameters()` with `build_optimizer()` at lines 18-89. 2. In the training loop at `vjepa_train.py:171-199`, each batch is passed through `outputs = model(batch)` at line 184; the loss is computed solely from `outputs["predicted"]` and `outputs["target"]` via `vicreg_loss(...)` at lines 187-193, and then backpropagated with `loss.backward()` and `optimizer.step()` at lines 195-199. 3. The `VJEPA.forward()` method in `models/vjepa/vjepa_model.py:86-136` constructs and returns a dict with keys `"predicted"`, `"target"`, `"all_context"`, and `"value"`, but it never calls `self.policy_query_head` defined at lines 67-71, so none of the tensors feeding the VICReg loss depend on the policy-prior head's weights. 4. As a result, during training no gradients reach `self.policy_query_head` parameters; despite being part of `model.parameters()` and updated by the optimizer, their gradients remain zero, so the head stays effectively at random initialization while its outputs are later used under `torch.no_grad()` in `evaluate_world_model.evaluate_metrics()` at `evaluate_world_model.py:100-103` and in `MCTS._imagine_future()` at `models/vjepa/planning.py:186-189`, meaning the supposedly "learned" action priors and evaluation metrics are driven by an untrained module.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** models/vjepa/vjepa_model.py **Line:** 67:71 **Comment:** *Incomplete Implementation: This new policy-prior head is introduced as "learned", but it is not connected to any training objective in the training path, so it receives no gradients and remains effectively random at inference. Add a loss term or supervised target path that updates this head during training. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

codeant-ai · 2026-05-16T15:02:23Z

CodeAnt AI finished reviewing your PR.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

models/vjepa/planning.py (2)

329-335: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Fix action-probability indexing to use real action IDs.

Using child list position as the action index becomes incorrect once children are created in prior-sorted order. Returned action_probs (and sampled action) can map to the wrong action.

Suggested direction

 class MCTSNode:
     __slots__ = [
-        "state", "action", "parent", "children", "visits",
+        "state", "action", "action_index", "parent", "children", "visits",
         "value_sum", "prior", "virtual_loss", "is_terminal",
     ]

     def __init__(
         self,
         state: torch.Tensor,
         action: Optional[torch.Tensor] = None,
+        action_index: Optional[int] = None,
         parent: Optional["MCTSNode"] = None,
         prior: float = 0.0,
     ):
         self.state = state
         self.action = action
+        self.action_index = action_index
         self.parent = parent
@@
-                child = MCTSNode(
+                child = MCTSNode(
                     state=next_state,
                     action=action,
+                    action_index=int(idx.item()),
                     parent=node,
                     prior=priors[idx].item(),
                 )
@@
-                idx = root.children.index(child)
-                if idx < num_actions:
-                    visit_counts[idx] = child.visits
+                if child.action_index is not None and child.action_index < num_actions:
+                    visit_counts[child.action_index] = child.visits

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/vjepa/planning.py` around lines 329 - 335, The loop currently uses
root.children.index(child) to set visit_counts which misindexes when children
aren't in action-ID order; instead use the child's real action identifier
(child.action or child.action.id depending on type) as the index: get action_id
= child.action, check action_id < num_actions, and assign
visit_counts[action_id] = child.visits so action_probs and any sampled action
map to the correct action ID.

272-287: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use model-evaluated leaf value for backprop, not unvisited child mean.

_expand computes value from _imagine_future but discards it, then returns node.children[0].mean_value (typically 0 before visits). This weakens search signal and can stall learning of useful Q estimates.

Suggested fix

-        if len(node.children) < max_children:
+        expanded_values = []
+        if len(node.children) < max_children:
             # Sort actions by prior (highest first) for progressive widening
             _, sorted_indices = priors.sort(descending=True)

             for idx in sorted_indices[len(node.children) : max_children]:
                 action = available_actions[idx : idx + 1]
                 next_state, value, _ = self._imagine_future(node.state, action)
+                expanded_values.append(value)

                 child = MCTSNode(
                     state=next_state,
                     action=action,
                     parent=node,
                     prior=priors[idx].item(),
                 )
                 node.children.append(child)

-        # Return value of the first child (for backprop)
-        if node.children:
-            return node.children[0].mean_value
+        if expanded_values:
+            return float(expanded_values[0])
+        if node.children:
+            return node.children[0].mean_value
         return 0.0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/vjepa/planning.py` around lines 272 - 287, In _expand, the value
returned by _imagine_future is discarded and the function returns
node.children[0].mean_value (which is zero for unvisited children); change
_expand to use the model-evaluated leaf value for backprop by either assigning
that returned value to the new child's mean_value (or value field) when
constructing MCTSNode, or by returning the immediate value from _imagine_future
directly; update the logic in _expand (and any MCTSNode attribute used for
backprop such as mean_value) so the freshly-evaluated scalar from
_imagine_future is what gets propagated instead of an unvisited child's default
mean.

🧹 Nitpick comments (3)

vjepa_train.py (1)

153-160: ⚡ Quick win

Consider reporting ffmpeg execution failures to the user.

The current code detects when ffmpeg is missing and prints a helpful message, but if ffmpeg is found and then fails during execution (e.g., due to permissions, disk space, or invalid arguments), the failure is silent because check=False and capture_output=True suppress errors.
💬 Suggested improvement
         else:
-            subprocess.run([
+            result = subprocess.run([
                 ffmpeg_bin, '-f', 'lavfi', '-i', 'testsrc=duration=5:size=224x224:rate=15',
                 os.path.join(video_dir, 'test_video.mp4'), '-y'
             ], capture_output=True, check=False)
+            if result.returncode != 0:
+                print(f"ffmpeg failed with return code {result.returncode}. Add videos manually to data/.")
Note on static analysis warning S603: The warning about untrusted input is a false positive. shutil.which("ffmpeg") returns a trusted system path (or None), and all other arguments are hardcoded string literals, so there is no execution risk from user input.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vjepa_train.py` around lines 153 - 160, If ffmpeg is found (ffmpeg_bin) we
currently call subprocess.run(...) with check=False which suppresses failures;
update the call to either use check=True or capture the CompletedProcess result
and inspect result.returncode and result.stderr, then surface errors to the user
(print or logger) including stderr and a brief context mentioning the target
path (os.path.join(video_dir, 'test_video.mp4')) so failures
(permissions/disk/args) are visible; keep capture_output=True to obtain stderr
for the message.

CODE_ADDRESS_INDEX.md (1)

59-121: ⚖️ Poor tradeoff

Consider excluding self-reference or generating this file programmatically.

The file documents itself (lines 59–121), which creates a maintenance challenge: every edit shifts line numbers and requires updating the self-referential metadata. This circular dependency is fragile and easy to desync.

Options:

Exclude CODE_ADDRESS_INDEX.md from the inventory

Generate this file programmatically (e.g., via a script that can recompute its own entry)

Accept the maintenance cost
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CODE_ADDRESS_INDEX.md` around lines 59 - 121, The CODE_ADDRESS_INDEX.md
currently lists itself under the "File Inventory" which creates a brittle
self-reference; remove that self-entry or make the index auto-generated. Fix by
updating the index generation logic or the document: either (A) exclude
CODE_ADDRESS_INDEX.md from the generated inventory (ensure the generator/filter
excludes the filename "CODE_ADDRESS_INDEX.md" or the anchor "###
`CODE_ADDRESS_INDEX.md`"), or (B) make the generator produce
CODE_ADDRESS_INDEX.md dynamically and have the generator insert a stable
placeholder for its own entry instead of a live self-listing; change the code
that emits the "File Inventory" section to skip or synthesize the self-entry so
edits no longer require manual updates.

evaluate_world_model.py (1)

84-85: ⚡ Quick win

Avoid hardcoded 128 for action dimension in the harness.

Use one source of truth for action dimension (from config or model) so eval scripts stay aligned with planner/model changes.

Suggested refactor

-    model = VJEPA(
+    action_dim = int(config.get("action_dim", 128))
+    model = VJEPA(
         config["encoder"],
         config["predictor"],
         config["training"]["ema_momentum"],
-        action_dim=128,
+        action_dim=action_dim,
     ).to(device)

-        actions_a = [torch.randn(bs, seq_len, 128, device=device) for _ in range(rollout_steps)]
-        actions_b = [torch.randn(bs, seq_len, 128, device=device) for _ in range(rollout_steps)]
+        actions_a = [torch.randn(bs, seq_len, model.policy_query_head[-1].out_features, device=device) for _ in range(rollout_steps)]
+        actions_b = [torch.randn(bs, seq_len, model.policy_query_head[-1].out_features, device=device) for _ in range(rollout_steps)]
@@
-        available_actions = torch.randn(num_actions, 128, device=device)
+        available_actions = torch.randn(num_actions, model.policy_query_head[-1].out_features, device=device)

Also applies to: 100-103, 134-135

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evaluate_world_model.py` around lines 84 - 85, The harness currently
hardcodes the action dimension (128) when creating actions_a and actions_b (and
in other spots at lines referenced) which will drift from the planner/model;
replace the literal with a single source-of-truth action_dim (e.g., pull from
config.get("action_dim") or from the model/planner attribute like
planner.action_dim or model.action_dim) and use that variable when constructing
the random tensors for actions (for symbols: actions_a, actions_b,
rollout_steps, bs, seq_len, device); update the other instances mentioned
(around the other creation sites) to use the same action_dim variable so the
eval stays aligned with model/planner changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CODE_ADDRESS_INDEX.md`:
- Around line 68-121: The Markdown headings in CODE_ADDRESS_INDEX.md use nested
backticks (e.g., headings like `### `.github/workflows/sync-from-upstream.yml``)
which produces invalid inline code spans; update each affected heading (all
entries that currently look like backticked-heading-within-backticks such as the
entries for `.github/workflows/sync-from-upstream.yml`, `.gitignore`,
`.vscode/launch.json`, `CODE_ADDRESS_INDEX.md`, etc.) by removing the outer/back
wrapper so the filename retains a single code span (e.g., change from ``###
`.file/name` `` to `### `.file/name`` or simply `### .file/name` as
appropriate), apply this pattern consistently across all ~50 affected lines to
resolve MD038.

In `@evaluate_world_model.py`:
- Around line 121-122: Validate that the parsed num_actions is a positive
integer before it is used: after the ArgumentParser definition (the add_argument
for "--num-actions") check args.num_actions > 0 and raise
argparse.ArgumentTypeError or call parser.error if it's not, so downstream code
(the prior-metric computation that calls max()/entropy) never receives 0;
alternatively supply a custom type function to
parser.add_argument("--num-actions", type=...) that enforces >0 and raises
argparse.ArgumentTypeError on invalid values.

In `@models/vjepa/planning.py`:
- Around line 257-263: policy_query/available_actions path divides logits by
self.temperature causing NaNs when temperature == 0; change the priors
computation in that block to check for zero (or effectively zero) temperature
and handle it by producing a deterministic one-hot prior at logits.argmax(dim=0)
(matching the action-selection semantics) otherwise compute priors =
F.softmax(logits / self.temperature, dim=0); use torch.zeros_like(priors) and
scatter_ to build the one-hot prior to preserve dtype/device and shape.

---

Outside diff comments:
In `@models/vjepa/planning.py`:
- Around line 329-335: The loop currently uses root.children.index(child) to set
visit_counts which misindexes when children aren't in action-ID order; instead
use the child's real action identifier (child.action or child.action.id
depending on type) as the index: get action_id = child.action, check action_id <
num_actions, and assign visit_counts[action_id] = child.visits so action_probs
and any sampled action map to the correct action ID.
- Around line 272-287: In _expand, the value returned by _imagine_future is
discarded and the function returns node.children[0].mean_value (which is zero
for unvisited children); change _expand to use the model-evaluated leaf value
for backprop by either assigning that returned value to the new child's
mean_value (or value field) when constructing MCTSNode, or by returning the
immediate value from _imagine_future directly; update the logic in _expand (and
any MCTSNode attribute used for backprop such as mean_value) so the
freshly-evaluated scalar from _imagine_future is what gets propagated instead of
an unvisited child's default mean.

---

Nitpick comments:
In `@CODE_ADDRESS_INDEX.md`:
- Around line 59-121: The CODE_ADDRESS_INDEX.md currently lists itself under the
"File Inventory" which creates a brittle self-reference; remove that self-entry
or make the index auto-generated. Fix by updating the index generation logic or
the document: either (A) exclude CODE_ADDRESS_INDEX.md from the generated
inventory (ensure the generator/filter excludes the filename
"CODE_ADDRESS_INDEX.md" or the anchor "### `CODE_ADDRESS_INDEX.md`"), or (B)
make the generator produce CODE_ADDRESS_INDEX.md dynamically and have the
generator insert a stable placeholder for its own entry instead of a live
self-listing; change the code that emits the "File Inventory" section to skip or
synthesize the self-entry so edits no longer require manual updates.

In `@evaluate_world_model.py`:
- Around line 84-85: The harness currently hardcodes the action dimension (128)
when creating actions_a and actions_b (and in other spots at lines referenced)
which will drift from the planner/model; replace the literal with a single
source-of-truth action_dim (e.g., pull from config.get("action_dim") or from the
model/planner attribute like planner.action_dim or model.action_dim) and use
that variable when constructing the random tensors for actions (for symbols:
actions_a, actions_b, rollout_steps, bs, seq_len, device); update the other
instances mentioned (around the other creation sites) to use the same action_dim
variable so the eval stays aligned with model/planner changes.

In `@vjepa_train.py`:
- Around line 153-160: If ffmpeg is found (ffmpeg_bin) we currently call
subprocess.run(...) with check=False which suppresses failures; update the call
to either use check=True or capture the CompletedProcess result and inspect
result.returncode and result.stderr, then surface errors to the user (print or
logger) including stderr and a brief context mentioning the target path
(os.path.join(video_dir, 'test_video.mp4')) so failures (permissions/disk/args)
are visible; keep capture_output=True to obtain stderr for the message.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9688f9d-f2d9-4046-bc2f-61fb598c4c74

📥 Commits

Reviewing files that changed from the base of the PR and between 2c66f89 and 07b8481.

📒 Files selected for processing (10)

CODE_ADDRESS_INDEX.md
README.md
docs/FRONTIER_GAP_ANALYSIS.md
docs/HUMAN_VISION_EXECUTION_EVAL_SPEC.md
docs/RIGOROUS_DEVELOPMENT_PROTOCOL.md
evaluate_perception.py
evaluate_world_model.py
models/vjepa/planning.py
models/vjepa/vjepa_model.py
vjepa_train.py

coderabbitai · 2026-05-16T16:13:14Z

+    parser.add_argument("--num-actions", type=int, default=32)
+    parser.add_argument("--save-dir", default="eval_runs")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate --num-actions as a positive integer.

When num_actions is 0, prior-metric computation calls max()/entropy on an empty tensor and fails at runtime.

Suggested fix

def main() -> None: + def positive_int(value: str) -> int: + ivalue = int(value) + if ivalue <= 0: + raise argparse.ArgumentTypeError("must be > 0") + return ivalue + parser = argparse.ArgumentParser(description="Evaluate V-JEPA/HRM world-model metrics") @@ - parser.add_argument("--num-actions", type=int, default=32) + parser.add_argument("--num-actions", type=positive_int, default=32)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evaluate_world_model.py` around lines 121 - 122, Validate that the parsed num_actions is a positive integer before it is used: after the ArgumentParser definition (the add_argument for "--num-actions") check args.num_actions > 0 and raise argparse.ArgumentTypeError or call parser.error if it's not, so downstream code (the prior-metric computation that calls max()/entropy) never receives 0; alternatively supply a custom type function to parser.add_argument("--num-actions", type=...) that enforces >0 and raises argparse.ArgumentTypeError on invalid values.

coderabbitai · 2026-05-16T16:13:14Z

+        if policy_query is not None and policy_query.numel() > 0:
+            # Similarity(action_i, query) -> prior logit
+            # available_actions: (num_actions, action_dim)
+            # policy_query: (1, action_dim)
+            logits = torch.matmul(available_actions, policy_query.squeeze(0))
+            priors = F.softmax(logits / self.temperature, dim=0)
        else:


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard prior softmax against zero temperature.

temperature == 0 is explicitly supported in action selection, but this path divides by zero during expansion and can produce invalid priors.

Suggested fix

- logits = torch.matmul(available_actions, policy_query.squeeze(0)) - priors = F.softmax(logits / self.temperature, dim=0) + logits = torch.matmul(available_actions, policy_query.squeeze(0)) + if self.temperature == 0: + priors = torch.zeros_like(logits) + priors[torch.argmax(logits)] = 1.0 + else: + priors = F.softmax(logits / self.temperature, dim=0)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/vjepa/planning.py` around lines 257 - 263, policy_query/available_actions path divides logits by self.temperature causing NaNs when temperature == 0; change the priors computation in that block to check for zero (or effectively zero) temperature and handle it by producing a deterministic one-hot prior at logits.argmax(dim=0) (matching the action-selection semantics) otherwise compute priors = F.softmax(logits / self.temperature, dim=0); use torch.zeros_like(priors) and scatter_ to build the one-hot prior to preserve dtype/device and shape.

codeant-ai · 2026-05-17T10:32:18Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

codeant-ai · 2026-05-17T10:33:25Z

CodeAnt AI Incremental review completed.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

models/vjepa/vjepa_model.py (2)

139-141: 💤 Low value

Clarify the alias docstring.

The docstring says "backward-compatible alias" but this appears to provide the new preferred name (VisualExecutionModel) as an alias to the existing VJEPA class. Consider rewording for clarity.

📝 Proposed docstring clarification

 class VisualExecutionModel(VJEPA):
-    """Backward-compatible alias for the unified Visual Execution Model name."""
+    """Alias providing the preferred Visual Execution Model (VEM) naming."""
     pass

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/vjepa/vjepa_model.py` around lines 139 - 141, The docstring for class
VisualExecutionModel is ambiguous about which name is preferred; update the
docstring on VisualExecutionModel (aliasing VJEPA) to clearly state that
VisualExecutionModel is a backward-compatible alias for the existing VJEPA class
(or vice versa depending on intended preference). Edit the docstring on the
VisualExecutionModel class to explicitly say e.g. "Alias of VJEPA for backward
compatibility; use VJEPA as the canonical class name" (or "Alias of VJEPA
retained for backward compatibility; prefer VisualExecutionModel as the
canonical name") so consumers understand which name is preferred while keeping
the alias semantics intact.

64-71: ⚡ Quick win

Consider L2-normalizing the policy query for stable dot-product similarity.

For dot-product similarity scoring, normalizing the query vector to unit length ensures stable cosine similarity and consistent scale across training. Without normalization, the magnitude can drift, affecting the learned prior distributions even with temperature scaling.

🔄 Proposed refactor to add L2 normalization

 # 5b. Policy query head for action-prior scoring in latent MCTS.
 # Produces an action-space query vector that can be matched against
 # candidate action vectors via dot-product similarity.
-self.policy_query_head = nn.Sequential(
-    nn.Linear(predictor_config["hidden_size"], predictor_config["hidden_size"]),
-    nn.SiLU(),
-    nn.Linear(predictor_config["hidden_size"], action_dim)
-)
+self.policy_query_head = nn.Sequential(
+    nn.Linear(predictor_config["hidden_size"], predictor_config["hidden_size"]),
+    nn.SiLU(),
+    nn.Linear(predictor_config["hidden_size"], action_dim),
+    nn.Lambda(lambda x: F.normalize(x, p=2, dim=-1))
+)

Alternatively, apply F.normalize(policy_query, p=2, dim=-1) at the call site in planning.py.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/vjepa/vjepa_model.py` around lines 64 - 71, The policy_query produced
by the policy_query_head should be L2-normalized to stabilize dot-product
similarity; modify the code so the output of policy_query_head (symbol:
policy_query_head) is normalized to unit length (e.g., via F.normalize(..., p=2,
dim=-1) or an nn.functional call) before being used for scoring, or
alternatively normalize right where it is consumed in planning.py (the call site
that computes dot-product similarities) to ensure the query vectors are
unit-length for consistent cosine-like scoring.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@README.md`:
- Line 37: Update the compound modifiers in the README by hyphenating them for
readability: change the phrase "Stiefel-manifold style orthogonality
constraints/projections" to "Stiefel-manifold-style orthogonality
constraints/projections" and similarly change "Large scale" (or "Large-scale"
variant at the other location) to "Large-scale" so both compound modifiers are
fully hyphenated.

---

Nitpick comments:
In `@models/vjepa/vjepa_model.py`:
- Around line 139-141: The docstring for class VisualExecutionModel is ambiguous
about which name is preferred; update the docstring on VisualExecutionModel
(aliasing VJEPA) to clearly state that VisualExecutionModel is a
backward-compatible alias for the existing VJEPA class (or vice versa depending
on intended preference). Edit the docstring on the VisualExecutionModel class to
explicitly say e.g. "Alias of VJEPA for backward compatibility; use VJEPA as the
canonical class name" (or "Alias of VJEPA retained for backward compatibility;
prefer VisualExecutionModel as the canonical name") so consumers understand
which name is preferred while keeping the alias semantics intact.
- Around line 64-71: The policy_query produced by the policy_query_head should
be L2-normalized to stabilize dot-product similarity; modify the code so the
output of policy_query_head (symbol: policy_query_head) is normalized to unit
length (e.g., via F.normalize(..., p=2, dim=-1) or an nn.functional call) before
being used for scoring, or alternatively normalize right where it is consumed in
planning.py (the call site that computes dot-product similarities) to ensure the
query vectors are unit-length for consistent cosine-like scoring.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 393db8d8-b351-42cb-bd83-31866f7f12e0

📥 Commits

Reviewing files that changed from the base of the PR and between 07b8481 and 52e879b.

📒 Files selected for processing (2)

README.md
models/vjepa/vjepa_model.py

coderabbitai · 2026-05-17T10:35:20Z

-*   **10B Scale ViT**: A massive Vision Transformer designed to capture high-density visual information.
+### 2) Geometric Inductive Biases
+- **Lie-group / equivariance-oriented layers** for transformation-aware latent features.
+- **Stiefel-manifold style orthogonality constraints/projections** to stabilize relational geometry.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Hyphenate compound modifiers for readability.

At Line 37 and Line 74, compound modifiers read more cleanly with full hyphenation (e.g., Stiefel-manifold-style and Large-scale).

✏️ Proposed doc wording patch

-- **Stiefel-manifold style orthogonality constraints/projections** to stabilize relational geometry. +- **Stiefel-manifold-style orthogonality constraints/projections** to stabilize relational geometry. -- **Large scale profile (same model, 10B-class size target)**: `config/vjepa_10b.yaml` +- **Large-scale profile (same model, 10B-class size target)**: `config/vjepa_10b.yaml`

Also applies to: 74-74

🧰 Tools

🪛 LanguageTool

[grammar] ~37-~37: Use a hyphen to join words.
Context: ...re latent features. - **Stiefel-manifold style orthogonality constraints/projecti...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 37, Update the compound modifiers in the README by hyphenating them for readability: change the phrase "Stiefel-manifold style orthogonality constraints/projections" to "Stiefel-manifold-style orthogonality constraints/projections" and similarly change "Large scale" (or "Large-scale" variant at the other location) to "Large-scale" so both compound modifiers are fully hyphenated.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52e879b0ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T10:36:08Z

+            # available_actions: (num_actions, action_dim)
+            # policy_query: (1, action_dim)
+            logits = torch.matmul(available_actions, policy_query.squeeze(0))
+            priors = F.softmax(logits / self.temperature, dim=0)


Preserve action index mapping when applying learned priors

After this change computes non-uniform priors (softmax(logits / temperature)), expansion starts appending children in prior-sorted order rather than original available_actions order. Later in _get_action_probabilities, visit counts are written by child list position and then best_idx is used to index available_actions, so the probability mass can be assigned to the wrong action whenever priors reorder candidates. This makes plan() return a different action than the one actually most visited under MCTS whenever the learned prior is not uniform.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a3e1075ebd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T11:08:44Z

-        policy_logits = torch.zeros(1, device=state.device)  # placeholder
+        # Estimate action priors from a learned policy-query head.
+        pooled_next_state = next_state.mean(dim=1) if next_state.ndim > 2 else next_state
+        policy_query = self.model.policy_query_head(pooled_next_state)


Guard policy-query call for models without prior head

_expand() still has a uniform-prior fallback when policy_query_head is missing, but _imagine_future() now unconditionally calls self.model.policy_query_head(...), which raises AttributeError before that fallback can ever be used. This regresses compatibility with any MCTS caller that provides a model implementing only the documented predictor/value_head interface (or older checkpoints/test doubles), so planning now hard-fails instead of degrading to uniform priors.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

models/adaptive_depth.py (1)

184-193: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Batch halting logic is tied to sample 0’s step count.

Line [185] uses new_steps[0].item() for all samples, so mixed-step batches can halt/continue incorrectly.

Suggested guard to prevent silent misbehavior

         with torch.no_grad():
             new_steps = new_steps + 1

             # Adaptive depth decision
+            if not torch.equal(new_steps, new_steps[:1].expand_as(new_steps)):
+                raise RuntimeError(
+                    "AdaptiveDepthWrapper expects synchronized `steps` across batch."
+                )
             halted, depth_info = self.depth_controller.should_continue(
-                step=new_steps[0].item(),  # use first sample's step as reference
+                step=int(new_steps[0].item()),  # safe only when synchronized
                 q_halt_logits=q_halt_logits,
                 q_continue_logits=q_continue_logits,
             )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/adaptive_depth.py` around lines 184 - 193, The batch halting call to
depth_controller.should_continue incorrectly uses a single scalar step from
new_steps[0].item(), causing all samples to share sample-0’s step; change the
call to pass per-sample step counts (e.g., the tensor/array of steps for the
batch) or loop per-sample so should_continue receives matching batch-shaped step
input. Update the invocation in the surrounding method (the call to
depth_controller.should_continue and any code that consumes
outputs["depth_info"]) to accept and propagate a batch-sized step input instead
of a single item so halting decisions are computed per-sample.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@AUDIT_RECHECK.md`:
- Around line 12-13: Update the audit steps that list bare script names by
invoking them explicitly with Python: replace "evaluate_world_model.py" and
"evaluate_perception.py" with "python evaluate_world_model.py" and "python
evaluate_perception.py" in AUDIT_RECHECK.md so the commands are reproducible
across environments without relying on executable bits or shebangs.

---

Outside diff comments:
In `@models/adaptive_depth.py`:
- Around line 184-193: The batch halting call to
depth_controller.should_continue incorrectly uses a single scalar step from
new_steps[0].item(), causing all samples to share sample-0’s step; change the
call to pass per-sample step counts (e.g., the tensor/array of steps for the
batch) or loop per-sample so should_continue receives matching batch-shaped step
input. Update the invocation in the surrounding method (the call to
depth_controller.should_continue and any code that consumes
outputs["depth_info"]) to accept and propagate a batch-sized step input instead
of a single item so halting decisions are computed per-sample.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: db573fc6-cf7a-4a64-84e0-5c4610ac96b1

📥 Commits

Reviewing files that changed from the base of the PR and between 52e879b and 4f6bdc8.

📒 Files selected for processing (5)

AUDIT_RECHECK.md
CODE_ADDRESS_INDEX.md
check_integrations.py
models/adaptive_depth.py
models/topological.py

✅ Files skipped from review due to trivial changes (1)

CODE_ADDRESS_INDEX.md

coderabbitai · 2026-05-17T11:46:08Z

+4. Ran world-model eval smoke (`evaluate_world_model.py`).
+5. Ran perception eval smoke (`evaluate_perception.py`).


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use explicit Python invocation for reproducible audit steps (Line 12, Line 13).

These two commands are documented as bare script names, unlike Line 11. That can fail in many environments unless executable permissions/shebangs are set. Use explicit python ... for consistency and reproducibility.

Suggested doc fix

-4. Ran world-model eval smoke (`evaluate_world_model.py`). -5. Ran perception eval smoke (`evaluate_perception.py`). +4. Ran world-model eval smoke (`python evaluate_world_model.py`). +5. Ran perception eval smoke (`python evaluate_perception.py`).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

4. Ran world-model eval smoke (`evaluate_world_model.py`).

5. Ran perception eval smoke (`evaluate_perception.py`).

4. Ran world-model eval smoke (`python evaluate_world_model.py`).

5. Ran perception eval smoke (`python evaluate_perception.py`).

🧰 Tools

🪛 LanguageTool

[style] ~12-~12: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ke (python check_integrations.py). 4. Ran world-model eval smoke (`evaluate_world...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

[style] ~13-~13: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l smoke (evaluate_world_model.py). 5. Ran perception eval smoke (`evaluate_percep...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@AUDIT_RECHECK.md` around lines 12 - 13, Update the audit steps that list bare script names by invoking them explicitly with Python: replace "evaluate_world_model.py" and "evaluate_perception.py" with "python evaluate_world_model.py" and "python evaluate_perception.py" in AUDIT_RECHECK.md so the commands are reproducible across environments without relying on executable bits or shebangs.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f6bdc8d15

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T11:48:13Z

+    pt, ph, pw = cfg["encoder"]["patch_size"]
+    seq_len = (t // pt) * (h // ph) * (w // pw)
+    num_mask = max(1, seq_len // 4)
+    mask = torch.randperm(seq_len)[:num_mask]


Build a boolean patch mask before integration forward

apply_mask() expects mask to be boolean (True = masked), but this integration check builds mask as index positions via torch.randperm(...). With an integer mask, ~mask becomes negative indices instead of logical negation, so the model receives the wrong visible/masked token split while the script still reports success. This makes the check unreliable: masking bugs in the real training path can be missed because the harness is not exercising the same mask semantics.

Useful? React with 👍 / 👎.

Add phase-1 rigorous world-model evaluation harness and protocol

0db3f49

jeevesh415 added the codex label May 16, 2026 — with ChatGPT Codex Connector

codeant-ai Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files label May 16, 2026

codeant-ai Bot reviewed May 16, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

codeant-ai Bot reviewed May 16, 2026

View reviewed changes

Add perception evaluation track and harden training bootstrap fallback

07b8481

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

Unify naming under Visual Execution Model framework

52e879b

codeant-ai Bot added size:XXL This PR changes 1000+ lines, ignoring generated files and removed size:XXL This PR changes 1000+ lines, ignoring generated files labels May 17, 2026

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

jeevesh415 added 3 commits May 17, 2026 16:20

Remove puzzle-era components and harden VEM integrations

9429fdf

Regenerate full line-level CODE_ADDRESS_INDEX after repo changes

a6414ad

Restore previously removed legacy modules and resync CODE_ADDRESS_INDEX

a3e1075

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

Add explicit post-PR audit record for conflict/integration recheck

4f6bdc8

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

		parser.add_argument("--num-actions", type=int, default=32)
		parser.add_argument("--save-dir", default="eval_runs")

		4. Ran world-model eval smoke (`evaluate_world_model.py`).
		5. Ran perception eval smoke (`evaluate_perception.py`).

Conversation

jeevesh415 commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Motivation

Description

Testing

Summary by CodeRabbit

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Uh oh!

codeant-ai Bot commented May 16, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

codeant-ai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot commented May 16, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot commented May 17, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

codeant-ai Bot commented May 17, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

jeevesh415 commented May 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading