Skip to content

Integrate Visual Execution Model (VEM): planning priors, eval harness, docs, and repo index#5

Open
jeevesh415 wants to merge 1 commit into
mainfrom
codex/review-project-vision-status-xbvh3n
Open

Integrate Visual Execution Model (VEM): planning priors, eval harness, docs, and repo index#5
jeevesh415 wants to merge 1 commit into
mainfrom
codex/review-project-vision-status-xbvh3n

Conversation

@jeevesh415
Copy link
Copy Markdown
Owner

@jeevesh415 jeevesh415 commented May 17, 2026

User description

Motivation

  • Unify V-JEPA + HRM into a single Visual Execution Model (VEM) stack and restore important repository artifacts removed accidentally.
  • Improve latent planning by replacing placeholder priors with a learned action-prior query to yield model-informed MCTS search.
  • Provide lightweight, reproducible evaluation and integration smoke checks to satisfy Phase-1 rigorous development gates.

Description

  • Renamed / reworked top-level README to present the repository as the "Visual Execution Model (VEM)" and document training/eval workflows and configs.
  • Added repository index and audit artifacts: CODE_ADDRESS_INDEX.md, AUDIT_RECHECK.md, and RECTIFICATION_STATUS.md to record line-level mapping, recheck steps, and rectification summary.
  • Introduced evaluation and integration harnesses: evaluate_world_model.py, evaluate_perception.py, and check_integrations.py for smoke metrics and wiring checks.
  • Implemented planning/action-prior changes: added policy_query_head to models/vjepa/vjepa_model.py, updated models/vjepa/planning.py to return/use policy-query vectors and compute action priors by similarity, and preserved backward-compatible VisualExecutionModel alias.
  • Restored and/or synchronized important training and dataset components and added practical fixes: vjepa_train.py now accepts --config, detects ffmpeg before generating synthetic videos, uses training.epochs defaulting to 100, and improved data-directory handling.
  • Small model fixes and robustness changes: models/adaptive_depth.py now constructs returned carry using type(carry) instead of a concrete class, models/topological.py fixes shapes and pooling logic for Betti estimates, and other wiring updates across models/* referenced by docs and the index.

Testing

  • Ran repository sanity checks: python -m compileall -q . completed successfully with no compile errors.
  • Executed integration smoke via python check_integrations.py which performed a forward pass and a small MCTS plan and printed an "Integration check passed" message.
  • Executed world-model smoke via python evaluate_world_model.py --config config/vjepa_micro.yaml --seed 42 which produced and saved JSON metrics (rollout drift, trajectory divergence, action-prior stats).
  • Executed perception smoke via python evaluate_perception.py --config config/vjepa_micro.yaml --seed 42 which produced and saved latent-consistency metrics; all scripted smoke checks completed without runtime failures.

Codex Task


CodeAnt-AI Description

Add VEM documentation, evaluation smoke checks, and learned planning priors

What Changed

  • Renamed the project to Visual Execution Model (VEM) in the main README and added clearer training, evaluation, and repo workflow guidance
  • Added lightweight world-model and perception evaluation scripts that save run reports and measure rollout drift, action consistency, and robustness to color, brightness, noise, and shifts
  • Updated latent planning to use a learned action-prior query instead of placeholder priors, so search can score candidate actions from model output
  • Improved training startup by allowing a config file to be passed in, using the epoch count from config, and skipping synthetic video generation when ffmpeg is unavailable
  • Added audit and rectification documents to record the recheck status and restored repository files
  • Fixed a couple of shape and carry-handling issues in supporting model code so model state is preserved correctly during execution

Impact

✅ Clearer model setup and training steps
✅ More reliable planning action ranking
✅ Fewer setup failures when ffmpeg is missing

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Warning

Rate limit exceeded

@jeevesh415 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 50 minutes and 36 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 60608cef-1dc7-4ab7-ad46-309c4e42d687

📥 Commits

Reviewing files that changed from the base of the PR and between 2c66f89 and 057e256.

📒 Files selected for processing (15)
  • AUDIT_RECHECK.md
  • CODE_ADDRESS_INDEX.md
  • README.md
  • RECTIFICATION_STATUS.md
  • check_integrations.py
  • docs/FRONTIER_GAP_ANALYSIS.md
  • docs/HUMAN_VISION_EXECUTION_EVAL_SPEC.md
  • docs/RIGOROUS_DEVELOPMENT_PROTOCOL.md
  • evaluate_perception.py
  • evaluate_world_model.py
  • models/adaptive_depth.py
  • models/topological.py
  • models/vjepa/planning.py
  • models/vjepa/vjepa_model.py
  • vjepa_train.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/review-project-vision-status-xbvh3n

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files label May 17, 2026
Comment thread models/vjepa/planning.py
policy_logits = torch.zeros(1, device=state.device) # placeholder
# Estimate action priors from a learned policy-query head.
pooled_next_state = next_state.mean(dim=1) if next_state.ndim > 2 else next_state
policy_query = self.model.policy_query_head(pooled_next_state)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: _imagine_future now unconditionally calls policy_query_head, but MCTS's model contract/documentation only requires a predictor and value head, and _expand still contains fallback logic for models without policy priors. This unconditional call will raise AttributeError for compatible models that do not define policy_query_head; gate this call with hasattr and return an empty/None query when unavailable. [api mismatch]

Severity Level: Major ⚠️
- ❌ MCTS planning crashes with user models lacking policy_query_head.
- ⚠️ Breaks compatibility with models that follow documented MCTS interface.
- ⚠️ Future integrations must add unused heads just to satisfy planner.
Steps of Reproduction ✅
1. Inspect the documented model contract in `MCTS.__init__` at
`models/vjepa/planning.py:120-137`, which states the `model` argument "must have predictor
and value_head" but does not mention a policy prior head.

2. Note that `_expand` is already defensive: it only uses `self.model.policy_query_head`
when `hasattr(self.model, "policy_query_head")` (see `models/vjepa/planning.py:251-255`),
and otherwise falls back to uniform priors (lines 263-266), meaning the planner is
intended to work even when the model lacks a policy query head.

3. Observe that `_imagine_future` now unconditionally calls
`self.model.policy_query_head(pooled_next_state)` at `models/vjepa/planning.py:186-188`,
and returns `(next_state, value, policy_query)` at `line 190`. This helper is used inside
`_expand` when creating children: `next_state, value, _ = self._imagine_future(node.state,
action)` at `lines 15-16` of the second chunk (`models/vjepa/planning.py:260-279`).

4. If a caller instantiates `MCTS` with a custom model that follows the documented
contract (exposes `.predictor.physics_engine` and `.value_head` but no
`.policy_query_head`), then on the first expansion `_imagine_future` will execute
`policy_query = self.model.policy_query_head(pooled_next_state)` (`line 188`), raising
`AttributeError: 'CustomModel' object has no attribute 'policy_query_head'` and causing
any planning call (e.g., `mcts.plan(...)` as used in `check_integrations.py:44-47`) to
fail for otherwise compatible models.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** models/vjepa/planning.py
**Line:** 188:188
**Comment:**
	*Api Mismatch: `_imagine_future` now unconditionally calls `policy_query_head`, but `MCTS`'s model contract/documentation only requires a predictor and value head, and `_expand` still contains fallback logic for models without policy priors. This unconditional call will raise `AttributeError` for compatible models that do not define `policy_query_head`; gate this call with `hasattr` and return an empty/None query when unavailable.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment thread models/vjepa/planning.py
Comment on lines +261 to +262
logits = torch.matmul(available_actions, policy_query.squeeze(0))
priors = F.softmax(logits / self.temperature, dim=0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: self.temperature can be zero (the planner already supports temperature == 0 in action selection), but this new prior computation divides by self.temperature unconditionally. That will produce inf/nan logits and unstable priors during expansion. Add the same zero-temperature handling here (e.g., skip scaling or use an argmax prior path) before calling softmax. [falsy zero check]

Severity Level: Major ⚠️
- ❌ MCTS planning with greedy (temperature 0) yields NaN priors.
- ⚠️ Deterministic evaluation runs can produce unstable, non-reproducible plans.
- ⚠️ Any future use of temperature=0 silently corrupts tree search behavior.
Steps of Reproduction ✅
1. Instantiate the MCTS planner with zero temperature using the VisualExecutionModel world
model (as in `check_integrations.py:15-20`):

   `mcts = MCTS(model=model, n_simulations=4, temperature=0.0)` (constructor at
   `models/vjepa/planning.py:139-157`).

2. Prepare a non-empty action set as done in `check_integrations.py:44-47` (e.g., `actions
= torch.randn(8, cfg.get("action_dim", 128))`) and a latent root state tensor with shape
`(1, D)` or `(1, seq, D)` (see `check_integrations.py:45`).

3. Call `mcts.plan(root_state, actions)` (`models/vjepa/planning.py:90-140`). Inside
`plan`, the root node is initialized (`line 111`) and `_expand(root, available_actions)`
is invoked for the initial expansion (`line 114`).

4. `_expand` at `models/vjepa/planning.py:222-262` computes the policy-query-based priors.
With a model that has `policy_query_head` (VJEPA at `models/vjepa/vjepa_model.py:64-71`),
it obtains `policy_query` (lines 252-255) and then executes `logits =
torch.matmul(available_actions, policy_query.squeeze(0))` and `priors = F.softmax(logits /
self.temperature, dim=0)` at `lines 261-262`. Since `self.temperature == 0.0`, `logits /
self.temperature` produces `inf`/`nan`, and PyTorch's numerically stable softmax over
`[inf, ...]` (after subtracting `max`) yields `nan` probabilities, giving invalid priors
and destabilizing the subsequent search (e.g., `priors.sort()` and `priors[idx].item()` at
`lines 11 and 21` in the same region operate on NaNs).

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** models/vjepa/planning.py
**Line:** 261:262
**Comment:**
	*Falsy Zero Check: `self.temperature` can be zero (the planner already supports `temperature == 0` in action selection), but this new prior computation divides by `self.temperature` unconditionally. That will produce `inf`/`nan` logits and unstable priors during expansion. Add the same zero-temperature handling here (e.g., skip scaling or use an argmax prior path) before calling softmax.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment thread vjepa_train.py
default="config/vjepa_micro.yaml",
help="Path to YAML config file (e.g., config/vjepa_micro.yaml or config/vjepa_10b.yaml)",
)
args = parser.parse_args()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Using strict parse_args() can terminate the training script when launched via distributed runners that inject extra CLI flags (commonly --local-rank), which is likely in this file since it includes distributed/FSDP paths. Accept known launcher args or ignore unknown args to keep distributed startup working. [api mismatch]

Severity Level: Critical 🚨
- ❌ Multi-GPU FSDP training fails under torchrun/launch invocations.
- ⚠️ Users must avoid standard distributed launchers or patch script.
- ⚠️ CI or cluster jobs using launcher tooling will abort at startup.
Steps of Reproduction ✅
1. Confirm that `vjepa_train.py` is designed for distributed/FSDP training: it imports
`torch.distributed` and FSDP (`lines 7-9`), conditionally wraps the model in `FSDP` when
`dist.is_initialized()` and `world_size > 1` (`lines 137-143`), and uses distributed-aware
optimizers (`build_optimizer` at `lines 19-90`).

2. At the bottom of `vjepa_train.py`, note the CLI setup: the script defines only a
`--config` argument (`parser.add_argument("--config", ...)` at `lines 215-218`) and parses
arguments strictly with `args = parser.parse_args()` at `line 220`.

3. Launch training via a standard distributed runner such as `torchrun`, which injects
extra arguments like `--local_rank` or `--local-rank` (common for multi-GPU setups), e.g.:

   `torchrun --nproc_per_node=2 vjepa_train.py --config config/vjepa_micro.yaml`.

4. When `vjepa_train.py` executes, `argparse.ArgumentParser.parse_args()` at `line 220`
receives unknown flags (e.g., `--local_rank`, `--rdzv_backend`) that are not declared on
the parser. Argparse treats these as errors, prints an "unrecognized arguments" message,
and exits with `SystemExit`, preventing `train()` from running and effectively blocking
distributed training despite the FSDP logic being wired up.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** vjepa_train.py
**Line:** 220:220
**Comment:**
	*Api Mismatch: Using strict `parse_args()` can terminate the training script when launched via distributed runners that inject extra CLI flags (commonly `--local-rank`), which is likely in this file since it includes distributed/FSDP paths. Accept known launcher args or ignore unknown args to keep distributed startup working.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 057e2568ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread models/vjepa/planning.py
Comment on lines +261 to +262
logits = torch.matmul(available_actions, policy_query.squeeze(0))
priors = F.softmax(logits / self.temperature, dim=0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve action indices when using policy-query priors

Once priors are computed from policy_query, child expansion is no longer in the original available_actions order; however, _get_action_probabilities still maps visits back by child creation position (root.children.index(child)), not by the original action index. This can make action_probs (and therefore best_action) point to the wrong action whenever priors reorder candidates, which directly degrades planning correctness.

Useful? React with 👍 / 👎.

Comment thread check_integrations.py
pt, ph, pw = cfg["encoder"]["patch_size"]
seq_len = (t // pt) * (h // ph) * (w // pw)
num_mask = max(1, seq_len // 4)
mask = torch.randperm(seq_len)[:num_mask]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Construct a boolean full-length mask in integration check

apply_mask expects a boolean mask over all seq_len patches, but this script passes a shortened LongTensor of selected indices. In that case, ~mask becomes negative integer indexing and the model sees arbitrary patch selections instead of true visible/masked partitions, so this smoke test can report success even if masking integration is broken.

Useful? React with 👍 / 👎.

Comment thread evaluate_perception.py
Comment on lines +58 to +63
model = VJEPA(
cfg["encoder"],
cfg["predictor"],
cfg["training"]["ema_momentum"],
action_dim=128,
).to(device).eval()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This evaluation also runs on a newly initialized network and never restores trained parameters, so robustness metrics will reflect random weights rather than actual model invariance. Load a checkpoint before computing latent-consistency scores. [incomplete implementation]

Severity Level: Major ⚠️
- ❌ Perception robustness eval script measures randomly initialized VJEPA weights.
- ⚠️ Phase-2 robustness gating cannot assess actual trained model invariance.
Steps of Reproduction ✅
1. From the repository root, run the perception eval harness: `python
evaluate_perception.py --config config/vjepa_micro.yaml` (entrypoint guarded by `if
__name__ == "__main__":` at evaluate_perception.py:88-89).

2. In `main()` (evaluate_perception.py:46-51), the script parses `--config`, `--seed`, and
`--save-dir`, then loads the YAML config into `cfg` at lines 55-56.

3. The model is instantiated at lines 58-63: `model = VJEPA(cfg["encoder"],
cfg["predictor"], cfg["training"]["ema_momentum"], action_dim=128).to(device).eval()`,
with no call to `load_state_dict()`, no checkpoint path argument, and no other
weight-loading logic anywhere in evaluate_perception.py.

4. A synthetic random `video` tensor is generated at lines 65-67 and passed to
`model.context_encoder` inside `latent_consistency()` (evaluate_perception.py:39-43), so
all latent consistency metrics written at lines 69-82 and printed at line 85 are computed
using a freshly initialized VJEPA with random weights rather than any trained checkpoint.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** evaluate_perception.py
**Line:** 58:63
**Comment:**
	*Incomplete Implementation: This evaluation also runs on a newly initialized network and never restores trained parameters, so robustness metrics will reflect random weights rather than actual model invariance. Load a checkpoint before computing latent-consistency scores.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment thread evaluate_world_model.py
Comment on lines +100 to +105
available_actions = torch.randn(num_actions, 128, device=device)
pooled = traj_a[-1].mean(dim=1)
query = model.policy_query_head(pooled).squeeze(0)
logits = torch.matmul(available_actions, query)
probs = torch.softmax(logits, dim=0)
max_prior = float(probs.max().item())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: --num-actions is used without validation, so passing 0 (or a negative value) will produce an empty/invalid action tensor and then fail at probs.max()/softmax path. Enforce num_actions > 0 before metric computation to avoid runtime crashes. [incorrect condition logic]

Severity Level: Major ⚠️
- ❌ World-model eval crashes when `--num-actions` is zero.
- ⚠️ Automated evaluation pipelines fail under invalid CLI input.
Steps of Reproduction ✅
1. From the repository root, run the world-model eval harness with zero actions: `python
evaluate_world_model.py --config config/vjepa_micro.yaml --num-actions 0` (CLI defined in
`main()` at evaluate_world_model.py:116-123).

2. `argparse` parses `--num-actions` into `args.num_actions` at lines 117-123, and
`main()` passes this value directly into `evaluate_metrics()` at lines 137-142 without any
validation or clamping.

3. Inside `evaluate_metrics()` (evaluate_world_model.py:76-107), `available_actions =
torch.randn(num_actions, 128, device=device)` at line 100 creates a tensor of shape `(0,
128)` when `num_actions == 0`, and `logits = torch.matmul(available_actions, query)` at
line 103 yields an empty logits tensor.

4. `probs = torch.softmax(logits, dim=0)` at line 104 returns an empty probability tensor,
and `probs.max().item()` at line 105 then raises a runtime error because max-reduction on
an empty tensor is undefined, causing the evaluation script to crash instead of producing
metrics.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** evaluate_world_model.py
**Line:** 100:105
**Comment:**
	*Incorrect Condition Logic: `--num-actions` is used without validation, so passing `0` (or a negative value) will produce an empty/invalid action tensor and then fail at `probs.max()`/softmax path. Enforce `num_actions > 0` before metric computation to avoid runtime crashes.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment thread evaluate_world_model.py
Comment on lines +130 to +135
model = VJEPA(
config["encoder"],
config["predictor"],
config["training"]["ema_momentum"],
action_dim=128,
).to(device)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The script reports world-model evaluation metrics from a freshly initialized model because it never loads trained weights; this makes the output non-representative for regression gating or model quality checks. Add a checkpoint argument and load state dict before running metrics. [incomplete implementation]

Severity Level: Major ⚠️
- ❌ World-model metrics reflect untrained random network, not production.
- ⚠️ Regression gates misjudge model quality and temporal consistency.
Steps of Reproduction ✅
1. Run the world-model evaluation harness from the project root: `python
evaluate_world_model.py --config config/vjepa_micro.yaml` (entrypoint at
evaluate_world_model.py:165-166).

2. In `main()` (evaluate_world_model.py:116-123), the script parses CLI arguments and
loads the YAML config into `config` at lines 127-128, but it does not parse or accept any
checkpoint path parameter.

3. The model is constructed at lines 130-135: `model = VJEPA(config["encoder"],
config["predictor"], config["training"]["ema_momentum"], action_dim=128).to(device)`, with
no corresponding `load_state_dict()`, `torch.load()`, or other weight-restore call
anywhere in evaluate_world_model.py.

4. `evaluate_metrics()` (evaluate_world_model.py:76-113) then uses a random latent `z0`
(lines 80-82) and random action sequences (lines 84-85) together with this freshly
initialized VJEPA to compute rollout drift, trajectory divergence, and action prior
metrics, so the JSON manifest written at lines 144-159 and printed at 161-162 reports
metrics for a random, untrained network rather than the trained model that
evaluation/regression gating is intended to check.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** evaluate_world_model.py
**Line:** 130:135
**Comment:**
	*Incomplete Implementation: The script reports world-model evaluation metrics from a freshly initialized model because it never loads trained weights; this makes the output non-representative for regression gating or model quality checks. Add a checkpoint argument and load state dict before running metrics.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI finished reviewing your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

codex size:XXL This PR changes 1000+ lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant