Audit: Safety and Uncertainty - Himanshu Gupta by himanshugupta1009 · Pull Request #76 · arpg/vla-foundations

himanshugupta1009 · 2026-03-18T05:50:14Z

This audit focuses on safety and validation for VLAs.
It has an in-depth discussion of three papers:

Foundation Models for Rapid Autonomy Validation,
SAFE: Multitask Failure Detection for Vision-Language-Action Models
Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning

github-actions · 2026-03-18T05:50:54Z

❌ Engineering Standards Check Failed

Your audit does not meet the required "Senior Staff" engineering standards. Please address the following issues before requesting instructor review:

Common Issues:

1. Required Frontmatter Fields

Every audit MDX file must include these fields:
- title: Paper title
- author: Paper author(s)
- topic: Research topic/category
- paper: Link to paper or citation
All fields must have non-empty values (no placeholders like "TBD" or "TODO")

2. Semantic Line Breaks

Each sentence should be on its own line
This makes PR commenting and reviewing much easier

Example:

- This is a very long sentence with multiple ideas. It continues on the same line. This makes PR review difficult.
+ This is a sentence on its own line.
+ Each idea gets its own line.
+ This makes PR review much easier.

3. Clean Git History

No "Merge branch 'main'" commits allowed
Use git rebase main instead of git merge main
Keep your commit history linear and clean

How to Fix:

For semantic line breaks:

Edit your MDX file to put each sentence on a new line
Commit and push the changes

For git history:

Run: git rebase main (or git rebase staging)
Resolve any conflicts if needed
Run: git push --force-with-lease

The linter will run automatically on your next push. Once all checks pass, your preview will deploy.

Revised the audit document on Safety and Uncertainty in VLA systems, adding detailed sections on problem statement, safety decomposition, types of uncertainty, and analysis of three papers related to VLA safety mechanisms.

Expanded the section on SAFE — Multitask Failure Detection, detailing its methodology, strengths, limitations, and overall verdict on its effectiveness as a failure detection mechanism.

…soning This paper introduces a framework for incorporating counterfactual reasoning into Vision-Language-Action (VLA) policies. It details a method for self-reflective correction in decision-making processes, emphasizing the importance of meta-actions and their refinement through counterfactual supervision.

lorinachey · 2026-03-24T15:38:04Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+At its core, the paper addresses autonomy validation as a **computationally constrained estimation problem**.
+
+Let $D = \{x_1,x_2, \ldots,x_N\}$ denote a large dataset of real-world driving logs. For each scenario $x_i$, evaluating a policy requires running a high-fidelity simulation to obtain an outcome $y_i = f(x_i)$, where $f(\cdot)$ captures metrics such as collision occurrence or task success. The true objective is to estimate a global performance metric over the dataset, for example:  


Does f capture a single metric at a time or does it capture multiple metrics at once? In the latter case, how are those metrics broken up, weighted, or scored within f?

Hhy903 · 2026-03-24T15:40:38Z

content/textbook/audits/staging/himanshugupta1009.mdx

+This reframing reveals that the method is fundamentally a form of **biased Monte Carlo estimation**, where sampling is no longer uniform but guided by learned representations and difficulty scores. The quality of the final estimate $\hat{\mu}$ therefore depends critically on two factors:  
+(i) whether the embedding and clustering preserve the structure of the scenario distribution relevant to safety, and  
+(ii) whether the sampling and weighting scheme correctly compensates for the induced bias.
+


How does the subset selection strategy behave under policy iteration？Since the sampling distribution is influenced by the current policy and may shift over time.

lorinachey · 2026-03-24T15:40:43Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+> **Implicit Assumption:** A reconstruction-trained embedding preserves the aspects of a scenario that are relevant for validation and safety.
+
+However, masked autoencoding optimizes for **reconstruction fidelity**, not **causal or safety-critical features**. As a result, the embedding may emphasize visually or statistically dominant patterns while underrepresenting rare but safety-critical configurations. Two scenarios that are visually similar but causally distinct (e.g., different intent of nearby agents) may be mapped close together, while causally similar but visually different scenarios may be separated.


How can a reader begin to understand the geometric make-up of this embedding space? Are there safety guarantees if the geometry of the embedding space looks one way versus a different way? How do we detect the issue mentioned in this paragraph?

Hhy903 · 2026-03-24T15:41:40Z

content/textbook/audits/staging/himanshugupta1009.mdx

+-   **Coverage Sufficiency:** Sampling across clusters, combined with difficulty-aware prioritization, is sufficient to capture both typical and rare safety-critical scenarios.
+
+> None of these assumptions can be independently verified without performing the very exhaustive validation that the method seeks to avoid.
+


It might be useful to distinguish between assumptions that are empirically testable and those that are fundamentally unverifiable without full-scale evaluation. This would clarify which risks can realistically be mitigated in practice.

aritrach · 2026-03-24T15:43:19Z

content/textbook/audits/staging/himanshugupta1009.mdx

+d(x_i) \approx \mathbb{P}(\text{failure} \mid x_i).  
+$$
+
+This score is used as a proxy for scenario importance, enabling the system to prioritize scenarios that are more likely to expose policy weaknesses.


I'm a little confused as to how just an MLP can understand the likelihood of failure? Are these task failures or just collisions? IF a vehicle was headed for a failure, but recovered at the last moment, how well would this translate to this difficulty score?

good q, the signal is ONLY "simulated collisions". Other surrogate metrics for collisions (one of the most popular was "post encroachment time") are not well correlated with collisions.

lorinachey · 2026-03-24T15:44:54Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+This assumption breaks down in precisely the regimes that matter most:
+
+-   rare edge cases may be diluted within clusters,


Some might argue that if a rare edge case is diluted within clusters, then there aren't enough clusters to capture the important information. Is there any intuition on how many clusters to use or how to measure if this is a problem?

crheckman

first pass through rapid validation paper

crheckman · 2026-03-24T15:35:31Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+However, unlike classical robotics pipelines, these models:
+
+* lack explicit state estimation guarantees,


state estimation guarantees in my experience haven't been too useful. "guarantee" on state estimate is conditioned on (at best) exponential noise distributions and ad-hoc outlier removal techniques. many steps of modern state estimation processes are still heuristic (eg loop closures). might be better to say "lack explicit metric state estimates, reducing introspectability" or something like this?

crheckman · 2026-03-24T15:36:42Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+* lack explicit state estimation guarantees,
+* rely on learned representations under distribution shift,
+* and produce decisions that are difficult to verify.


verify through formal methods? are not guaranteed to be correct by construction?

"difficult to verify" seems like it could apply to any decisions that aren't produced through a very specifically crafted system that has the property of verifiable.

crheckman · 2026-03-24T15:45:25Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+This work addresses a real and important bottleneck: the prohibitive cost of large-scale autonomy validation. By reframing validation as a subset selection problem, it provides a practical mechanism for scaling evaluation under limited computational budgets. However, the approach replaces exhaustive validation with a **structured approximation whose reliability is fundamentally assumption-dependent**. It provides no mechanism for ensuring coverage of rare or worst-case events, does not preserve causal structure, and introduces policy-dependent biases into the validation process.
+
+> **I would not trust this as a primary validation pipeline for safety-critical systems.**


It is always easy to decide not to save money and just test everything. However it is an executive's decision whether the costs outweigh the risks, or vice versa. In this case, one can save tens of millions of dollars a year by not testing miles that are essentially the vehicle idling or driving on an empty road. We can redirect those tens of millions of dollars into more validation analysts who create and run targeted tests before release.

Inevitably, your decision is a judgment call that you are not in the seat of making, because it comes down to dollars and cents, or priorities among a broader validation scheme.

The most important question you need to ask yourself if you say "don't do this" is "what can we do to make it where I would be more comfortable doing this?" What is missing from this framework to have it achieve the extrapolative capability you're seeking?

jt7347 · 2026-03-24T15:46:49Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+## 6. Verdict
+
+This work addresses a real and important bottleneck: the prohibitive cost of large-scale autonomy validation. By reframing validation as a subset selection problem, it provides a practical mechanism for scaling evaluation under limited computational budgets. However, the approach replaces exhaustive validation with a **structured approximation whose reliability is fundamentally assumption-dependent**. It provides no mechanism for ensuring coverage of rare or worst-case events, does not preserve causal structure, and introduces policy-dependent biases into the validation process.


This is interesting - to tackle the problem of safety validation in models which are considered black-box in their outputs (i.e. can't statistically characterize failure modes), we make statistical assumptions during the validation step so we can effectively validate more rigorously to get a better safety metric...? This seems a bit unintuitive to me as I understand it at least - which dataset was used in particular, and has their been testing with external / new datasets? How does this estimator perform in terms of generalization to driving scenarios?

kalhamilton · 2026-03-24T15:47:34Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+As a result:
+- the guarantees are valid only with respect to the distribution of $s_t$ on the calibration data,
+- they do not account for errors introduced by embedding, temporal compression, or scalar collapse,


I think you more accurately state this below where CP is reliant on the quality of representation. I think it's a misleading to state that CP doesn't take into account all the things you list here. Rather the guarantee holds regardless BUT a poor representation forces CP to generate overly conservative or useless safety bounds. (If that distinction makes sense?)

lorinachey · 2026-03-24T15:48:20Z

content/textbook/audits/staging/himanshugupta1009.mdx

+To provide statistical guarantees, the method incorporates a **functional conformal prediction (CP)** framework. Using a held-out calibration dataset, a threshold $\tau$ is computed such that, for a chosen significance level $\alpha$, the probability of failing to detect a failure is bounded. At inference time, a state is flagged as unsafe whenever:
+
+$$
+s_t \geq \tau.


Are these two things easily comparable? Like is $\tau$ a vector with the same variables as the state and each vector in that variable has to be greater than or equal to that of the state vector variable? I'm a little confused about how this comparison holds.

Oh wait, later on it shows that s_t is a scalar score, not the state vector at time t...

cKohl10

Audit feedback and questions. Great job Himanshu!

cKohl10 · 2026-03-24T15:37:52Z

content/textbook/audits/staging/himanshugupta1009.mdx

+All three papers can be mapped to:
+
+$$
+\text{Perception} \rightarrow \text{Latent Representation} \rightarrow \text{Reasoning} \rightarrow \text{Action} \rightarrow \text{Safety Layer}


Is this a unifying architecture? Can the safety constraints be baked into the latent representation somehow or is this not practical because the latent space is less interpretable?

cKohl10 · 2026-03-24T15:47:48Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+> **Implicit Assumption:** A reconstruction-trained embedding preserves the aspects of a scenario that are relevant for validation and safety.
+
+However, masked autoencoding optimizes for **reconstruction fidelity**, not **causal or safety-critical features**. As a result, the embedding may emphasize visually or statistically dominant patterns while underrepresenting rare but safety-critical configurations. Two scenarios that are visually similar but causally distinct (e.g., different intent of nearby agents) may be mapped close together, while causally similar but visually different scenarios may be separated.


Is this assuming that intent cannot be inferred from temporal sequences of data? Is this a verifiable limitation of the masked autoencoding formulation?

crheckman

through paper 2

crheckman · 2026-03-24T15:48:42Z

content/textbook/audits/staging/himanshugupta1009.mdx

+d(x_i) \approx \mathbb{P}(\text{failure} \mid x_i).  
+$$
+
+This score is used as a proxy for scenario importance, enabling the system to prioritize scenarios that are more likely to expose policy weaknesses.


good q, the signal is ONLY "simulated collisions". Other surrogate metrics for collisions (one of the most popular was "post encroachment time") are not well correlated with collisions.

crheckman · 2026-03-24T15:50:07Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+where $f_\theta$ is a neural network trained to predict the likelihood of failure (e.g., task termination or unsafe outcome). This scalar score serves as a unified signal for failure detection across diverse environments.
+
+To provide statistical guarantees, the method incorporates a **functional conformal prediction (CP)** framework. Using a held-out calibration dataset, a threshold $\tau$ is computed such that, for a chosen significance level $\alpha$, the probability of failing to detect a failure is bounded. At inference time, a state is flagged as unsafe whenever:


So this method is also data-derived and only true for in-distribution (the held-out calibration dataset in this case).

crheckman · 2026-03-24T15:52:35Z

content/textbook/audits/staging/himanshugupta1009.mdx

+$$
+
+under standard exchangeability assumptions.
+


They may cluster in the learned representation space, but they are not isolated to that space. What fraction of failures are in a given fractional polyhedral closure of the latent space? Does it ever get above 0.99 for a fractional closure less than 0.5, say?

lorinachey · 2026-03-24T15:55:08Z

content/textbook/audits/staging/himanshugupta1009.mdx

+\end{cases}
+$$
+
+where $\tau$ is obtained via conformal calibration to enforce a desired error rate.


Does this formulation come from other engineering safety analysis literature/industry best practice? For example, do carmakers use this for safety system validation? I'm just curious what the roots of this method are.

kalhamilton · 2026-03-24T15:55:27Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+The pipeline reduces a large validation dataset to a small subset through a sequence of learned transformations. While each step appears reasonable in isolation, their composition introduces a systematic loss of information about the underlying data distribution, particularly along dimensions that are critical for safety.
+
+### Loss of Causal Structure


lack of causality is a limitation of deep learning in general. To make this critique more actionable, I'd make it more explicit how this limits the statistical estimator or suggest somewhere to integrate causality in the architecture

Hhy903 · 2026-03-24T15:55:40Z

content/textbook/audits/staging/himanshugupta1009.mdx

+* Reasoning attempts to avoid failures  
+
+Each layer addresses a different point of failure, but none provides complete coverage in isolation.
+


An interesting connection to recent work on representation stability in neural architectures (e.g., attention/residual dynamics), suggesting that some safety issues may originate not only at the system level, but also from instability in how information is encoded and propagated within the model.

lorinachey · 2026-03-24T15:57:59Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+The method relies on embeddings $e_t = \phi(x_t)$ extracted from a pretrained VLA model. These embeddings are optimized for perception and control, not for identifying failure modes.
+
+As a result:


I'm curious, do these bullet points mean that we should look for new methods and scrap this methodology entirely? Or is it just that we need to adjust these methods and build on them to improve these existing issues?

aritrach · 2026-03-24T15:58:24Z

content/textbook/audits/staging/himanshugupta1009.mdx

+\delta_t = \mu_t + h_t.
+$$
+
+Thus, failure is detected whenever the observed score exceeds the upper bound of the conformal band.


Do they ever use any failed trajectories for learning? If you only use successful trajectories, I feel that you pick up a specific way of getting to a successful state.

jt7347 · 2026-03-24T15:58:51Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+> **Failure detection depends on structured, temporal, and potentially multimodal dynamics, yet the method ultimately reduces this information to a thresholded scalar prediction.**
+
+The effectiveness of the approach therefore hinges on whether these successive reductions, from raw state to embedding, from sequence to summary, and from summary to scalar, preserve the aspects of the trajectory that are critical for identifying safety-relevant outcomes.


Just to check my understanding - the model only sees the trajectory and resultant outcome? No knowledge of the decision-making process (policy)? It seems this then becomes a problem of learning enough learned representations - one trajectory and outcome set could fail due to more than one reason, as well as be rolled out (i.e. robot chose this trajectory) due to multiple reasons, unless the planner is somewhat deterministic? The results showed some clustering of states to failure outcomes, but what do you do if a set of states lead to multiple failure outcomes? Or is it just purely states = failure, and we don't look much into type of failure (I know you mentioned "systematic loss of structure," I'm curious about this as well).

callie-jones · 2026-03-24T15:59:43Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+This paper addresses the problem of **failure detection in Vision-Language-Action (VLA) models**, where the objective is to identify, at runtime, whether a policy is likely to fail given its current state and observations. Rather than relying on task-specific heuristics or hand-engineered safety rules, the authors propose a **learned, model-agnostic framework** for detecting failures across multiple tasks.
+
+The core approach maps the agent’s state at time $t$, denoted by $x_t$, to a latent embedding $e_t = \phi(x_t)$ derived from a pretrained VLA model, and then computes a scalar failure score:  


Does the embedding encode uncertainty or temporal dependencies?

kalhamilton · 2026-03-24T16:00:02Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+---
+
+- **Scalar Adequacy:**  


This just a note on expediency ;)

You make this comment about collapsing failure mechanisms into a single 1-D signal several times throughout the SAFE audit. Might be some editing down opportunities here

crheckman

partway through third paper

crheckman · 2026-03-24T15:59:09Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+Under this formulation, the problem becomes:
+
+> **How can a model learn to generate better decisions by observing how its reasoning should have been different?**


Seems similar to DeepSeek-R1 (kitchen sink unmitigated garbage version) --- where "Oh!" and "Wait." were strewn about the intermediate outputs to indicate that there was an error made by following the autoregressor's priors too far.

krusnim · 2026-03-24T15:59:04Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+- the model only learns corrections that are represented in the data,  
+- rare or novel failure modes may not have corresponding corrective examples,  
+- reasoning generalization is limited to patterns seen during training.


Maybe I'm just vigorously agreeing with you, but I don't see how this issue is surmountable. If correction examples are available (specifically for originally OOD scenarios), why not include them in the original training data? What's the point of trying to correct the reasoning this way (especially given how bizarre reasoning traces can be)?

callie-jones · 2026-03-24T16:00:56Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+under standard exchangeability assumptions.
+
+A key empirical observation motivating the approach is that failure states tend to **cluster in the learned representation space**, even across different tasks. The authors visualize this using t-SNE, suggesting that failures share common structure that can be captured by a unified scoring model.


Is clustering correlational?

Zaaler · 2026-03-24T15:37:58Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+This paper reframes autonomy validation as a **subset selection problem under severe computational constraints**. Evaluating the performance of a VLA policy across all testing scenarios is both computationally and monetarily expensive, making it impractical to perform exhaustive validation over large-scale driving logs after every policy change. 
+
+To address this, the authors propose selecting a small subset of scenarios that can stand in for the full dataset. Their approach clusters scenarios in a learned embedding space to preserve diversity and prioritizes those with higher estimated failure likelihood to increase evaluation efficiency. The resulting pipeline trades exhaustive validation for **statistical estimation**, aiming to approximate full-dataset safety metrics from a carefully chosen subset of simulations.


No mention of minimum dataset size or subset size. This information is valuable to those trying to use this approach. Also, is it always simulations or is it ever rolled out on real. Not clear yet how this paper approaches this.

Zaaler · 2026-03-24T15:42:08Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+> **Implicit Assumption:** A reconstruction-trained embedding preserves the aspects of a scenario that are relevant for validation and safety.
+
+However, masked autoencoding optimizes for **reconstruction fidelity**, not **causal or safety-critical features**. As a result, the embedding may emphasize visually or statistically dominant patterns while underrepresenting rare but safety-critical configurations. Two scenarios that are visually similar but causally distinct (e.g., different intent of nearby agents) may be mapped close together, while causally similar but visually different scenarios may be separated.


I suggest a small example to cap off this component

Zaaler · 2026-03-24T15:43:09Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+> **Implicit Assumption:** Failure likelihood under the current policy is a reliable indicator of scenario importance.
+
+This introduces a strong **policy dependence**. The difficulty model reflects the behavior of a specific policy, biasing the selection process toward known failure modes while potentially neglecting scenarios that are not currently problematic but may become critical after future policy updates.


One example of failure modes they focus on?

Zaaler · 2026-03-24T15:44:43Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+However, clustering operates on learned representations rather than causal structure. As a result:
+
+-   embedding-only clustering may group scenarios that look similar but fail for different reasons or split scenarios that share the same underlying risk factors,


Again, I would suggest an example if the researchers clearly motivate these components with examples.

Zaaler · 2026-03-24T15:49:14Z

content/textbook/audits/staging/himanshugupta1009.mdx

+The validity of the proposed approach depends on several critical assumptions. These are not peripheral, but structural. If any of them fail, the reliability of the validation pipeline is compromised.
+
+-   **Embedding Fidelity:** The learned representation preserves the aspects of scenarios that determine safety-relevant outcomes.
+-   **Cluster Representativeness:** Scenarios within a cluster are sufficiently homogeneous such that a small number of samples can represent the entire cluster.
+-   **Difficulty Generalization:** The learned difficulty score accurately reflects intrinsic scenario risk, rather than being tied to the current policy’s behavior.
+-   **Coverage Sufficiency:** Sampling across clusters, combined with difficulty-aware prioritization, is sufficient to capture both typical and rare safety-critical scenarios.


Which of these assumptions is seems the most dangerous? I feel like the Coverage Sufficiency seems to be the closest to breaking the system.

Zaaler · 2026-03-24T17:37:31Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+Collectively, these failure points are not independent. Each stage of the pipeline reduces the structure of the original trajectory, and the final decision operates on a highly compressed signal.
+
+> **The same reductions that enable efficient, general-purpose failure detection also define what failures are visible to the system—and which ones are systematically ignored.**


Do they discuss any methods of determining or exploring systematically ignored samples via changing the parameters in their components?

Zaaler · 2026-03-24T17:45:08Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+As a result, the policy does not evaluate “what would happen if I take action $a$” at inference time. Instead, it attempts to internalize patterns of **how reasoning should change** to produce better outcomes.
+
+In this sense, the method replaces explicit counterfactual evaluation with a form of **amortized self-correction**, where the model learns to generate improved meta-actions without ever directly observing the consequences of its own interventions.


This is very interesting. How can the model evaluate its ability to improve reasoning if it never see's the roll out of its modified reasoning traces?

Zaaler · 2026-03-24T17:47:21Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+> **Implicit Assumption:** Imitating refined meta-actions is sufficient to internalize the reasoning process.
+
+The model is optimized to reproduce corrected outputs, but is not explicitly trained to ensure that the intermediate reasoning $r_t$ or the refinement process is causally valid or consistent across environments.


What problems can this induce? Do they discuss this at all in the paper?

Zaaler · 2026-03-24T17:54:00Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+There is no external mechanism that evaluates whether the reasoning or refinement is correct with respect to the environment. The model can only adjust its outputs based on patterns learned from data, not by verifying the consequences of its intermediate decisions.
+
+As a result, correction is limited by the model’s ability to override its own prior tokens, rather than by any explicit grounding in causal outcomes.


I tend to agree that we want our models to ground their reasoning in causal outcomes. However, there still might be merit in creating proper reasoning in the first place and this method may be usable on a model that already understands how to generate causal reasoning (such as Alpamayo).

Zaaler · 2026-03-24T17:58:06Z

content/textbook/audits/staging/himanshugupta1009.mdx

+
+## 3. Method Dissection
+
+The proposed method augments a VLA policy by introducing counterfactual supervision over a **multi-stage decision pipeline**. Rather than directly mapping observations to actions, the model generates and refines an intermediate meta-action through an explicit reasoning process before producing the final behavior.


There is no mention of the time aspect required to perform counter factual reasoning. Timing results are required to determine if this method is usable on my robotic system.

antony-zhao · 2026-03-26T00:11:56Z

content/textbook/audits/staging/himanshugupta1009.mdx

+\hat{\mu} = \frac{1}{|S|} \sum_{x_i \in S} w_i f(x_i),  
+$$
+
+where $w_i$ are weights used to approximate the full-dataset metric.


What exactly are these $w_i$, and how are they computed? I only see it mentioned here twice, not sure how relevant it actually is, though, so you might not need to specify/expand

Initial Audit Commit

f5580cc

himanshugupta1009 and others added 8 commits March 17, 2026 23:53

Fix title and other params

5a7056c

Update audit on Safety and Uncertainty in VLA systems

b45582e

Revised the audit document on Safety and Uncertainty in VLA systems, adding detailed sections on problem statement, safety decomposition, types of uncertainty, and analysis of three papers related to VLA safety mechanisms.

Revise content for Paper 1

d2cc31a

Fix equation for Paper 1

79826c4

Add detailed analysis for the SAFE paper

45a89de

Expanded the section on SAFE — Multitask Failure Detection, detailing its methodology, strengths, limitations, and overall verdict on its effectiveness as a failure detection mechanism.

Remove Unertainty Table

d293115

Modify VLA question

0623b62

lorinachey reviewed Mar 24, 2026

View reviewed changes

Hhy903 reviewed Mar 24, 2026

View reviewed changes

lorinachey reviewed Mar 24, 2026

View reviewed changes

Hhy903 reviewed Mar 24, 2026

View reviewed changes

aritrach reviewed Mar 24, 2026

View reviewed changes

lorinachey reviewed Mar 24, 2026

View reviewed changes

crheckman requested changes Mar 24, 2026

View reviewed changes

jt7347 reviewed Mar 24, 2026

View reviewed changes

kalhamilton reviewed Mar 24, 2026

View reviewed changes

lorinachey reviewed Mar 24, 2026

View reviewed changes

cKohl10 reviewed Mar 24, 2026

View reviewed changes

crheckman requested changes Mar 24, 2026

View reviewed changes

lorinachey reviewed Mar 24, 2026

View reviewed changes

kalhamilton reviewed Mar 24, 2026

View reviewed changes

Hhy903 reviewed Mar 24, 2026

View reviewed changes

lorinachey reviewed Mar 24, 2026

View reviewed changes

aritrach reviewed Mar 24, 2026

View reviewed changes

jt7347 reviewed Mar 24, 2026

View reviewed changes

callie-jones reviewed Mar 24, 2026

View reviewed changes

kalhamilton reviewed Mar 24, 2026

View reviewed changes

crheckman requested changes Mar 24, 2026

View reviewed changes

krusnim reviewed Mar 24, 2026

View reviewed changes

callie-jones reviewed Mar 24, 2026

View reviewed changes

Zaaler reviewed Mar 24, 2026

View reviewed changes

antony-zhao reviewed Mar 26, 2026

View reviewed changes


		At its core, the paper addresses autonomy validation as a computationally constrained estimation problem.

		Let $D = \{x_1,x_2, \ldots,x_N\}$ denote a large dataset of real-world driving logs. For each scenario $x_i$, evaluating a policy requires running a high-fidelity simulation to obtain an outcome $y_i = f(x_i)$, where $f(\cdot)$ captures metrics such as collision occurrence or task success. The true objective is to estimate a global performance metric over the dataset, for example:


		> Implicit Assumption: A reconstruction-trained embedding preserves the aspects of a scenario that are relevant for validation and safety.

		However, masked autoencoding optimizes for reconstruction fidelity, not causal or safety-critical features. As a result, the embedding may emphasize visually or statistically dominant patterns while underrepresenting rare but safety-critical configurations. Two scenarios that are visually similar but causally distinct (e.g., different intent of nearby agents) may be mapped close together, while causally similar but visually different scenarios may be separated.

		- Coverage Sufficiency: Sampling across clusters, combined with difficulty-aware prioritization, is sufficient to capture both typical and rare safety-critical scenarios.

		> None of these assumptions can be independently verified without performing the very exhaustive validation that the method seeks to avoid.


		This assumption breaks down in precisely the regimes that matter most:

		- rare edge cases may be diluted within clusters,


		However, unlike classical robotics pipelines, these models:

		* lack explicit state estimation guarantees,


		This work addresses a real and important bottleneck: the prohibitive cost of large-scale autonomy validation. By reframing validation as a subset selection problem, it provides a practical mechanism for scaling evaluation under limited computational budgets. However, the approach replaces exhaustive validation with a structured approximation whose reliability is fundamentally assumption-dependent. It provides no mechanism for ensuring coverage of rare or worst-case events, does not preserve causal structure, and introduces policy-dependent biases into the validation process.

		> I would not trust this as a primary validation pipeline for safety-critical systems.


		## 6. Verdict

		This work addresses a real and important bottleneck: the prohibitive cost of large-scale autonomy validation. By reframing validation as a subset selection problem, it provides a practical mechanism for scaling evaluation under limited computational budgets. However, the approach replaces exhaustive validation with a structured approximation whose reliability is fundamentally assumption-dependent. It provides no mechanism for ensuring coverage of rare or worst-case events, does not preserve causal structure, and introduces policy-dependent biases into the validation process.


		where $f_\theta$ is a neural network trained to predict the likelihood of failure (e.g., task termination or unsafe outcome). This scalar score serves as a unified signal for failure detection across diverse environments.

		To provide statistical guarantees, the method incorporates a functional conformal prediction (CP) framework. Using a held-out calibration dataset, a threshold $\tau$ is computed such that, for a chosen significance level $\alpha$, the probability of failing to detect a failure is bounded. At inference time, a state is flagged as unsafe whenever:


		The pipeline reduces a large validation dataset to a small subset through a sequence of learned transformations. While each step appears reasonable in isolation, their composition introduces a systematic loss of information about the underlying data distribution, particularly along dimensions that are critical for safety.

		### Loss of Causal Structure

		* Reasoning attempts to avoid failures

		Each layer addresses a different point of failure, but none provides complete coverage in isolation.


		The method relies on embeddings $e_t = \phi(x_t)$ extracted from a pretrained VLA model. These embeddings are optimized for perception and control, not for identifying failure modes.

		As a result:


		> Failure detection depends on structured, temporal, and potentially multimodal dynamics, yet the method ultimately reduces this information to a thresholded scalar prediction.

		The effectiveness of the approach therefore hinges on whether these successive reductions, from raw state to embedding, from sequence to summary, and from summary to scalar, preserve the aspects of the trajectory that are critical for identifying safety-relevant outcomes.


		This paper addresses the problem of failure detection in Vision-Language-Action (VLA) models, where the objective is to identify, at runtime, whether a policy is likely to fail given its current state and observations. Rather than relying on task-specific heuristics or hand-engineered safety rules, the authors propose a learned, model-agnostic framework for detecting failures across multiple tasks.

		The core approach maps the agent’s state at time $t$, denoted by $x_t$, to a latent embedding $e_t = \phi(x_t)$ derived from a pretrained VLA model, and then computes a scalar failure score:


		Under this formulation, the problem becomes:

		> How can a model learn to generate better decisions by observing how its reasoning should have been different?


		under standard exchangeability assumptions.

		A key empirical observation motivating the approach is that failure states tend to cluster in the learned representation space, even across different tasks. The authors visualize this using t-SNE, suggesting that failures share common structure that can be captured by a unified scoring model.


		This paper reframes autonomy validation as a subset selection problem under severe computational constraints. Evaluating the performance of a VLA policy across all testing scenarios is both computationally and monetarily expensive, making it impractical to perform exhaustive validation over large-scale driving logs after every policy change.

		To address this, the authors propose selecting a small subset of scenarios that can stand in for the full dataset. Their approach clusters scenarios in a learned embedding space to preserve diversity and prioritizes those with higher estimated failure likelihood to increase evaluation efficiency. The resulting pipeline trades exhaustive validation for statistical estimation, aiming to approximate full-dataset safety metrics from a carefully chosen subset of simulations.


		> Implicit Assumption: Failure likelihood under the current policy is a reliable indicator of scenario importance.

		This introduces a strong policy dependence. The difficulty model reflects the behavior of a specific policy, biasing the selection process toward known failure modes while potentially neglecting scenarios that are not currently problematic but may become critical after future policy updates.


		However, clustering operates on learned representations rather than causal structure. As a result:

		- embedding-only clustering may group scenarios that look similar but fail for different reasons or split scenarios that share the same underlying risk factors,


		Collectively, these failure points are not independent. Each stage of the pipeline reduces the structure of the original trajectory, and the final decision operates on a highly compressed signal.

		> The same reductions that enable efficient, general-purpose failure detection also define what failures are visible to the system—and which ones are systematically ignored.


		As a result, the policy does not evaluate “what would happen if I take action $a$” at inference time. Instead, it attempts to internalize patterns of how reasoning should change to produce better outcomes.

		In this sense, the method replaces explicit counterfactual evaluation with a form of amortized self-correction, where the model learns to generate improved meta-actions without ever directly observing the consequences of its own interventions.

Conversation

himanshugupta1009 commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Engineering Standards Check Failed

Common Issues:

How to Fix:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cKohl10 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jt7347 Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

github-actions bot commented Mar 18, 2026 •

edited

Loading

crheckman Mar 24, 2026 •

edited

Loading

jt7347 Mar 24, 2026 •

edited

Loading


		> Implicit Assumption: Imitating refined meta-actions is sufficient to internalize the reasoning process.

		The model is optimized to reproduce corrected outputs, but is not explicitly trained to ensure that the intermediate reasoning $r_t$ or the refinement process is causally valid or consistent across environments.


		There is no external mechanism that evaluates whether the reasoning or refinement is correct with respect to the environment. The model can only adjust its outputs based on patterns learned from data, not by verifying the consequences of its intermediate decisions.

		As a result, correction is limited by the model’s ability to override its own prior tokens, rather than by any explicit grounding in causal outcomes.


		## 3. Method Dissection

		The proposed method augments a VLA policy by introducing counterfactual supervision over a multi-stage decision pipeline. Rather than directly mapping observations to actions, the model generates and refines an intermediate meta-action through an explicit reasoning process before producing the final behavior.