Skip to content

Audit: Safety and Uncertainty - Himanshu Gupta#76

Open
himanshugupta1009 wants to merge 9 commits intomainfrom
audit/himanshugupta1009-safety_and_uncertainty
Open

Audit: Safety and Uncertainty - Himanshu Gupta#76
himanshugupta1009 wants to merge 9 commits intomainfrom
audit/himanshugupta1009-safety_and_uncertainty

Conversation

@himanshugupta1009
Copy link
Contributor

This audit focuses on safety and validation for VLAs.
It has an in-depth discussion of three papers:

  1. Foundation Models for Rapid Autonomy Validation,
  2. SAFE: Multitask Failure Detection for Vision-Language-Action Models
  3. Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning

@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

❌ Engineering Standards Check Failed

Your audit does not meet the required "Senior Staff" engineering standards. Please address the following issues before requesting instructor review:

Common Issues:

1. Required Frontmatter Fields

  • Every audit MDX file must include these fields:
    • title: Paper title
    • author: Paper author(s)
    • topic: Research topic/category
    • paper: Link to paper or citation
  • All fields must have non-empty values (no placeholders like "TBD" or "TODO")

2. Semantic Line Breaks

  • Each sentence should be on its own line
  • This makes PR commenting and reviewing much easier
  • Example:
    - This is a very long sentence with multiple ideas. It continues on the same line. This makes PR review difficult.
    + This is a sentence on its own line.
    + Each idea gets its own line.
    + This makes PR review much easier.

3. Clean Git History

  • No "Merge branch 'main'" commits allowed
  • Use git rebase main instead of git merge main
  • Keep your commit history linear and clean

How to Fix:

For semantic line breaks:

  1. Edit your MDX file to put each sentence on a new line
  2. Commit and push the changes

For git history:

  1. Run: git rebase main (or git rebase staging)
  2. Resolve any conflicts if needed
  3. Run: git push --force-with-lease

The linter will run automatically on your next push. Once all checks pass, your preview will deploy.

himanshugupta1009 and others added 8 commits March 17, 2026 23:53
Revised the audit document on Safety and Uncertainty in VLA systems, adding detailed sections on problem statement, safety decomposition, types of uncertainty, and analysis of three papers related to VLA safety mechanisms.
Expanded the section on SAFE — Multitask Failure Detection, detailing its methodology, strengths, limitations, and overall verdict on its effectiveness as a failure detection mechanism.
…soning

This paper introduces a framework for incorporating counterfactual reasoning into Vision-Language-Action (VLA) policies. It details a method for self-reflective correction in decision-making processes, emphasizing the importance of meta-actions and their refinement through counterfactual supervision.

At its core, the paper addresses autonomy validation as a **computationally constrained estimation problem**.

Let $D = \{x_1,x_2, \ldots,x_N\}$ denote a large dataset of real-world driving logs. For each scenario $x_i$, evaluating a policy requires running a high-fidelity simulation to obtain an outcome $y_i = f(x_i)$, where $f(\cdot)$ captures metrics such as collision occurrence or task success. The true objective is to estimate a global performance metric over the dataset, for example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does f capture a single metric at a time or does it capture multiple metrics at once? In the latter case, how are those metrics broken up, weighted, or scored within f?

This reframing reveals that the method is fundamentally a form of **biased Monte Carlo estimation**, where sampling is no longer uniform but guided by learned representations and difficulty scores. The quality of the final estimate $\hat{\mu}$ therefore depends critically on two factors:
(i) whether the embedding and clustering preserve the structure of the scenario distribution relevant to safety, and
(ii) whether the sampling and weighting scheme correctly compensates for the induced bias.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the subset selection strategy behave under policy iteration?Since the sampling distribution is influenced by the current policy and may shift over time.


> **Implicit Assumption:** A reconstruction-trained embedding preserves the aspects of a scenario that are relevant for validation and safety.

However, masked autoencoding optimizes for **reconstruction fidelity**, not **causal or safety-critical features**. As a result, the embedding may emphasize visually or statistically dominant patterns while underrepresenting rare but safety-critical configurations. Two scenarios that are visually similar but causally distinct (e.g., different intent of nearby agents) may be mapped close together, while causally similar but visually different scenarios may be separated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can a reader begin to understand the geometric make-up of this embedding space? Are there safety guarantees if the geometry of the embedding space looks one way versus a different way? How do we detect the issue mentioned in this paragraph?

- **Coverage Sufficiency:** Sampling across clusters, combined with difficulty-aware prioritization, is sufficient to capture both typical and rare safety-critical scenarios.

> None of these assumptions can be independently verified without performing the very exhaustive validation that the method seeks to avoid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to distinguish between assumptions that are empirically testable and those that are fundamentally unverifiable without full-scale evaluation. This would clarify which risks can realistically be mitigated in practice.

d(x_i) \approx \mathbb{P}(\text{failure} \mid x_i).
$$

This score is used as a proxy for scenario importance, enabling the system to prioritize scenarios that are more likely to expose policy weaknesses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused as to how just an MLP can understand the likelihood of failure? Are these task failures or just collisions? IF a vehicle was headed for a failure, but recovered at the last moment, how well would this translate to this difficulty score?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good q, the signal is ONLY "simulated collisions". Other surrogate metrics for collisions (one of the most popular was "post encroachment time") are not well correlated with collisions.


This assumption breaks down in precisely the regimes that matter most:

- rare edge cases may be diluted within clusters,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some might argue that if a rare edge case is diluted within clusters, then there aren't enough clusters to capture the important information. Is there any intuition on how many clusters to use or how to measure if this is a problem?

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass through rapid validation paper


However, unlike classical robotics pipelines, these models:

* lack explicit state estimation guarantees,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state estimation guarantees in my experience haven't been too useful. "guarantee" on state estimate is conditioned on (at best) exponential noise distributions and ad-hoc outlier removal techniques. many steps of modern state estimation processes are still heuristic (eg loop closures). might be better to say "lack explicit metric state estimates, reducing introspectability" or something like this?


* lack explicit state estimation guarantees,
* rely on learned representations under distribution shift,
* and produce decisions that are difficult to verify.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify through formal methods? are not guaranteed to be correct by construction?

"difficult to verify" seems like it could apply to any decisions that aren't produced through a very specifically crafted system that has the property of verifiable.


This work addresses a real and important bottleneck: the prohibitive cost of large-scale autonomy validation. By reframing validation as a subset selection problem, it provides a practical mechanism for scaling evaluation under limited computational budgets. However, the approach replaces exhaustive validation with a **structured approximation whose reliability is fundamentally assumption-dependent**. It provides no mechanism for ensuring coverage of rare or worst-case events, does not preserve causal structure, and introduces policy-dependent biases into the validation process.

> **I would not trust this as a primary validation pipeline for safety-critical systems.**
Copy link
Collaborator

@crheckman crheckman Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always easy to decide not to save money and just test everything. However it is an executive's decision whether the costs outweigh the risks, or vice versa. In this case, one can save tens of millions of dollars a year by not testing miles that are essentially the vehicle idling or driving on an empty road. We can redirect those tens of millions of dollars into more validation analysts who create and run targeted tests before release.

Inevitably, your decision is a judgment call that you are not in the seat of making, because it comes down to dollars and cents, or priorities among a broader validation scheme.

The most important question you need to ask yourself if you say "don't do this" is "what can we do to make it where I would be more comfortable doing this?" What is missing from this framework to have it achieve the extrapolative capability you're seeking?


## 6. Verdict

This work addresses a real and important bottleneck: the prohibitive cost of large-scale autonomy validation. By reframing validation as a subset selection problem, it provides a practical mechanism for scaling evaluation under limited computational budgets. However, the approach replaces exhaustive validation with a **structured approximation whose reliability is fundamentally assumption-dependent**. It provides no mechanism for ensuring coverage of rare or worst-case events, does not preserve causal structure, and introduces policy-dependent biases into the validation process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting - to tackle the problem of safety validation in models which are considered black-box in their outputs (i.e. can't statistically characterize failure modes), we make statistical assumptions during the validation step so we can effectively validate more rigorously to get a better safety metric...? This seems a bit unintuitive to me as I understand it at least - which dataset was used in particular, and has their been testing with external / new datasets? How does this estimator perform in terms of generalization to driving scenarios?


As a result:
- the guarantees are valid only with respect to the distribution of $s_t$ on the calibration data,
- they do not account for errors introduced by embedding, temporal compression, or scalar collapse,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you more accurately state this below where CP is reliant on the quality of representation. I think it's a misleading to state that CP doesn't take into account all the things you list here. Rather the guarantee holds regardless BUT a poor representation forces CP to generate overly conservative or useless safety bounds. (If that distinction makes sense?)

To provide statistical guarantees, the method incorporates a **functional conformal prediction (CP)** framework. Using a held-out calibration dataset, a threshold $\tau$ is computed such that, for a chosen significance level $\alpha$, the probability of failing to detect a failure is bounded. At inference time, a state is flagged as unsafe whenever:

$$
s_t \geq \tau.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two things easily comparable? Like is $\tau$ a vector with the same variables as the state and each vector in that variable has to be greater than or equal to that of the state vector variable? I'm a little confused about how this comparison holds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, later on it shows that s_t is a scalar score, not the state vector at time t...

Copy link
Collaborator

@cKohl10 cKohl10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audit feedback and questions. Great job Himanshu!

All three papers can be mapped to:

$$
\text{Perception} \rightarrow \text{Latent Representation} \rightarrow \text{Reasoning} \rightarrow \text{Action} \rightarrow \text{Safety Layer}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a unifying architecture? Can the safety constraints be baked into the latent representation somehow or is this not practical because the latent space is less interpretable?


> **Implicit Assumption:** A reconstruction-trained embedding preserves the aspects of a scenario that are relevant for validation and safety.

However, masked autoencoding optimizes for **reconstruction fidelity**, not **causal or safety-critical features**. As a result, the embedding may emphasize visually or statistically dominant patterns while underrepresenting rare but safety-critical configurations. Two scenarios that are visually similar but causally distinct (e.g., different intent of nearby agents) may be mapped close together, while causally similar but visually different scenarios may be separated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assuming that intent cannot be inferred from temporal sequences of data? Is this a verifiable limitation of the masked autoencoding formulation?

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

through paper 2

d(x_i) \approx \mathbb{P}(\text{failure} \mid x_i).
$$

This score is used as a proxy for scenario importance, enabling the system to prioritize scenarios that are more likely to expose policy weaknesses.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good q, the signal is ONLY "simulated collisions". Other surrogate metrics for collisions (one of the most popular was "post encroachment time") are not well correlated with collisions.


where $f_\theta$ is a neural network trained to predict the likelihood of failure (e.g., task termination or unsafe outcome). This scalar score serves as a unified signal for failure detection across diverse environments.

To provide statistical guarantees, the method incorporates a **functional conformal prediction (CP)** framework. Using a held-out calibration dataset, a threshold $\tau$ is computed such that, for a chosen significance level $\alpha$, the probability of failing to detect a failure is bounded. At inference time, a state is flagged as unsafe whenever:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this method is also data-derived and only true for in-distribution (the held-out calibration dataset in this case).

$$

under standard exchangeability assumptions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They may cluster in the learned representation space, but they are not isolated to that space. What fraction of failures are in a given fractional polyhedral closure of the latent space? Does it ever get above 0.99 for a fractional closure less than 0.5, say?

\end{cases}
$$

where $\tau$ is obtained via conformal calibration to enforce a desired error rate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this formulation come from other engineering safety analysis literature/industry best practice? For example, do carmakers use this for safety system validation? I'm just curious what the roots of this method are.


The pipeline reduces a large validation dataset to a small subset through a sequence of learned transformations. While each step appears reasonable in isolation, their composition introduces a systematic loss of information about the underlying data distribution, particularly along dimensions that are critical for safety.

### Loss of Causal Structure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lack of causality is a limitation of deep learning in general. To make this critique more actionable, I'd make it more explicit how this limits the statistical estimator or suggest somewhere to integrate causality in the architecture

* Reasoning attempts to avoid failures

Each layer addresses a different point of failure, but none provides complete coverage in isolation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting connection to recent work on representation stability in neural architectures (e.g., attention/residual dynamics), suggesting that some safety issues may originate not only at the system level, but also from instability in how information is encoded and propagated within the model.


The method relies on embeddings $e_t = \phi(x_t)$ extracted from a pretrained VLA model. These embeddings are optimized for perception and control, not for identifying failure modes.

As a result:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, do these bullet points mean that we should look for new methods and scrap this methodology entirely? Or is it just that we need to adjust these methods and build on them to improve these existing issues?

\delta_t = \mu_t + h_t.
$$

Thus, failure is detected whenever the observed score exceeds the upper bound of the conformal band.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they ever use any failed trajectories for learning? If you only use successful trajectories, I feel that you pick up a specific way of getting to a successful state.


> **Failure detection depends on structured, temporal, and potentially multimodal dynamics, yet the method ultimately reduces this information to a thresholded scalar prediction.**

The effectiveness of the approach therefore hinges on whether these successive reductions, from raw state to embedding, from sequence to summary, and from summary to scalar, preserve the aspects of the trajectory that are critical for identifying safety-relevant outcomes.
Copy link
Contributor

@jt7347 jt7347 Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check my understanding - the model only sees the trajectory and resultant outcome? No knowledge of the decision-making process (policy)? It seems this then becomes a problem of learning enough learned representations - one trajectory and outcome set could fail due to more than one reason, as well as be rolled out (i.e. robot chose this trajectory) due to multiple reasons, unless the planner is somewhat deterministic? The results showed some clustering of states to failure outcomes, but what do you do if a set of states lead to multiple failure outcomes? Or is it just purely states = failure, and we don't look much into type of failure (I know you mentioned "systematic loss of structure," I'm curious about this as well).


This paper addresses the problem of **failure detection in Vision-Language-Action (VLA) models**, where the objective is to identify, at runtime, whether a policy is likely to fail given its current state and observations. Rather than relying on task-specific heuristics or hand-engineered safety rules, the authors propose a **learned, model-agnostic framework** for detecting failures across multiple tasks.

The core approach maps the agent’s state at time $t$, denoted by $x_t$, to a latent embedding $e_t = \phi(x_t)$ derived from a pretrained VLA model, and then computes a scalar failure score:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the embedding encode uncertainty or temporal dependencies?


---

- **Scalar Adequacy:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just a note on expediency ;)

You make this comment about collapsing failure mechanisms into a single 1-D signal several times throughout the SAFE audit. Might be some editing down opportunities here

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partway through third paper


Under this formulation, the problem becomes:

> **How can a model learn to generate better decisions by observing how its reasoning should have been different?**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems similar to DeepSeek-R1 (kitchen sink unmitigated garbage version) --- where "Oh!" and "Wait." were strewn about the intermediate outputs to indicate that there was an error made by following the autoregressor's priors too far.


- the model only learns corrections that are represented in the data,
- rare or novel failure modes may not have corresponding corrective examples,
- reasoning generalization is limited to patterns seen during training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm just vigorously agreeing with you, but I don't see how this issue is surmountable. If correction examples are available (specifically for originally OOD scenarios), why not include them in the original training data? What's the point of trying to correct the reasoning this way (especially given how bizarre reasoning traces can be)?


under standard exchangeability assumptions.

A key empirical observation motivating the approach is that failure states tend to **cluster in the learned representation space**, even across different tasks. The authors visualize this using t-SNE, suggesting that failures share common structure that can be captured by a unified scoring model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is clustering correlational?


This paper reframes autonomy validation as a **subset selection problem under severe computational constraints**. Evaluating the performance of a VLA policy across all testing scenarios is both computationally and monetarily expensive, making it impractical to perform exhaustive validation over large-scale driving logs after every policy change.

To address this, the authors propose selecting a small subset of scenarios that can stand in for the full dataset. Their approach clusters scenarios in a learned embedding space to preserve diversity and prioritizes those with higher estimated failure likelihood to increase evaluation efficiency. The resulting pipeline trades exhaustive validation for **statistical estimation**, aiming to approximate full-dataset safety metrics from a carefully chosen subset of simulations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No mention of minimum dataset size or subset size. This information is valuable to those trying to use this approach. Also, is it always simulations or is it ever rolled out on real. Not clear yet how this paper approaches this.


> **Implicit Assumption:** A reconstruction-trained embedding preserves the aspects of a scenario that are relevant for validation and safety.

However, masked autoencoding optimizes for **reconstruction fidelity**, not **causal or safety-critical features**. As a result, the embedding may emphasize visually or statistically dominant patterns while underrepresenting rare but safety-critical configurations. Two scenarios that are visually similar but causally distinct (e.g., different intent of nearby agents) may be mapped close together, while causally similar but visually different scenarios may be separated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a small example to cap off this component


> **Implicit Assumption:** Failure likelihood under the current policy is a reliable indicator of scenario importance.

This introduces a strong **policy dependence**. The difficulty model reflects the behavior of a specific policy, biasing the selection process toward known failure modes while potentially neglecting scenarios that are not currently problematic but may become critical after future policy updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example of failure modes they focus on?


However, clustering operates on learned representations rather than causal structure. As a result:

- embedding-only clustering may group scenarios that look similar but fail for different reasons or split scenarios that share the same underlying risk factors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I would suggest an example if the researchers clearly motivate these components with examples.

Comment on lines +218 to +223
The validity of the proposed approach depends on several critical assumptions. These are not peripheral, but structural. If any of them fail, the reliability of the validation pipeline is compromised.

- **Embedding Fidelity:** The learned representation preserves the aspects of scenarios that determine safety-relevant outcomes.
- **Cluster Representativeness:** Scenarios within a cluster are sufficiently homogeneous such that a small number of samples can represent the entire cluster.
- **Difficulty Generalization:** The learned difficulty score accurately reflects intrinsic scenario risk, rather than being tied to the current policy’s behavior.
- **Coverage Sufficiency:** Sampling across clusters, combined with difficulty-aware prioritization, is sufficient to capture both typical and rare safety-critical scenarios.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these assumptions is seems the most dangerous? I feel like the Coverage Sufficiency seems to be the closest to breaking the system.


Collectively, these failure points are not independent. Each stage of the pipeline reduces the structure of the original trajectory, and the final decision operates on a highly compressed signal.

> **The same reductions that enable efficient, general-purpose failure detection also define what failures are visible to the system—and which ones are systematically ignored.**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they discuss any methods of determining or exploring systematically ignored samples via changing the parameters in their components?


As a result, the policy does not evaluate “what would happen if I take action $a$” at inference time. Instead, it attempts to internalize patterns of **how reasoning should change** to produce better outcomes.

In this sense, the method replaces explicit counterfactual evaluation with a form of **amortized self-correction**, where the model learns to generate improved meta-actions without ever directly observing the consequences of its own interventions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very interesting. How can the model evaluate its ability to improve reasoning if it never see's the roll out of its modified reasoning traces?


> **Implicit Assumption:** Imitating refined meta-actions is sufficient to internalize the reasoning process.

The model is optimized to reproduce corrected outputs, but is not explicitly trained to ensure that the intermediate reasoning $r_t$ or the refinement process is causally valid or consistent across environments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What problems can this induce? Do they discuss this at all in the paper?


There is no external mechanism that evaluates whether the reasoning or refinement is correct with respect to the environment. The model can only adjust its outputs based on patterns learned from data, not by verifying the consequences of its intermediate decisions.

As a result, correction is limited by the model’s ability to override its own prior tokens, rather than by any explicit grounding in causal outcomes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree that we want our models to ground their reasoning in causal outcomes. However, there still might be merit in creating proper reasoning in the first place and this method may be usable on a model that already understands how to generate causal reasoning (such as Alpamayo).


## 3. Method Dissection

The proposed method augments a VLA policy by introducing counterfactual supervision over a **multi-stage decision pipeline**. Rather than directly mapping observations to actions, the model generates and refines an intermediate meta-action through an explicit reasoning process before producing the final behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no mention of the time aspect required to perform counter factual reasoning. Timing results are required to determine if this method is usable on my robotic system.

\hat{\mu} = \frac{1}{|S|} \sum_{x_i \in S} w_i f(x_i),
$$

where $w_i$ are weights used to approximate the full-dataset metric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly are these $w_i$, and how are they computed? I only see it mentioned here twice, not sure how relevant it actually is, though, so you might not need to specify/expand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.