AUDIT: Mixture of Experts — Thanushraam Suresh Kumar#63
AUDIT: Mixture of Experts — Thanushraam Suresh Kumar#63
Conversation
|
🚀 Preview Deployed Your preview is ready for review! 🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/63/textbook/audits/staging/ChatVLAExperimentSetup.png/ Review Checklist
Next Steps
This preview will be removed when the PR is closed. |
Refactor text for clarity and improve structure in the Mixture-of-Experts audit document.
Updated image paths to be relative in Tr0612.mdx.
|
@crheckman Ready for your review! |
| Instead of activating every parameter for every input, MoE dynamically selects a small number of experts through a learned routing or gating mechanism. | ||
| This enables models to increase representational capacity without incurring proportional increases in computational cost. | ||
|
|
||
| At a high level, an MoE architecture consists of: |
There was a problem hiding this comment.
I think it would help in this high-level part to talk about whether the "experts" are trained separately and then stitched together or if they're trained together but somehow only focus on a piece of the overall data.
There was a problem hiding this comment.
Experts aren't trained separately, they are trained jointly within a single model. Their specialization emerges because the routing mechanism sends different inputs or tokens to different experts during training which allows each expert to gradually focus on a particular part of the data distribution!
| The central claim is that tightly coupling reasoning and control degrades either: | ||
|
|
||
| - reasoning generalization (if fine-tuned too aggressively), or | ||
| - control stability (if reasoning dominates action layers). |
There was a problem hiding this comment.
Not sure I'm understanding why reasoning dominating action layers is not a good thing. Don't we want reasoning-based actions?
There was a problem hiding this comment.
My reading is that excessive reasoning influence in the action layers may hurt control stability, because high-level reasoning and low-level motor execution serve different roles. However, the paper does not clearly explain this mechanism, so this appears to be more of an architectural assumption.
|
|
||
| It arises one central question: | ||
|
|
||
| > Does their MoE architecture meaningfully stabilize the reasoning–action interface,or just adding a smarter-looking middle layer that hides it? |
There was a problem hiding this comment.
This is a great question. How can we validate/evaluate that there is meaningful stability?
|
|
||
| > Modify only layers that minimally disturb control primitives. | ||
|
|
||
| This raises a couple of question: |
There was a problem hiding this comment.
I really like these discussion questions. Do we get answers to them somewhere or are they purely critical thinking questions?
|
|
||
| ### Sign-Off Criteria | ||
|
|
||
| I would not sign this system off for production deployment without additional validation on long-horizon tasks, stability of expert utilization, and clearly defined safety and recovery procedures. |
There was a problem hiding this comment.
What are some of the consequences of this? Is it okay if the system doesn't use all the experts if it still gives the correct answer?
| * **127/156** task success (**81.4%**) | ||
| * Represents a **3.52× improvement** over DexVLA [8] | ||
|
|
||
| **Experimental Conclusion** |
There was a problem hiding this comment.
One of the primary points of MoE is the reduced computation cost. I think it would be valuable to discuss that here; otherwise, why would I use this?
There was a problem hiding this comment.
Feel like the paper lacks a cohesive conclusion
|
|
||
| However, the important design decision is not the number of experts, it is **where intervention occurs**. | ||
|
|
||
| Rather than restructuring the pretrained VLM backbone, ChatVLA-2 preserves it and injects MoE primarily within the action pathway. |
There was a problem hiding this comment.
This is a good spot for an image showing the architecture. This seems like the key architectural design choice by the researchers.
|
|
||
| ChatVLA-2 integrates a **Dynamic Mixture-of-Experts layer** within the VLA architecture, selectively activating experts based on multimodal input representations. | ||
|
|
||
| Eight experts are instantiated, and two are selected per input during inference. |
There was a problem hiding this comment.
something I didn't quite get a grasp on from the rest of the audit: were these experts split across modalities (e.g. vision v action) or tasks and did the domain split choice affect the performance in your opinion? Does the paper happen to state what each expert is an "expert" in?
There was a problem hiding this comment.
Nope, the paper doesn't mention what each experts represent to
crheckman
left a comment
There was a problem hiding this comment.
First pass of comments
| This broke the tight coupling between parameter count and computation, allowing language models to scale to billions or trillions of parameters efficiently. | ||
| Subsequent works such as Switch Transformers [3] and systems like DeepSpeed-MoE [5] improved training stability, routing simplicity, and deployment efficiency, | ||
| establishing MoE as a core scaling strategy in large language models. | ||
| The paradigm later extended to vision through Vision Mixture-of-Experts (V-MoE) [4], which integrated sparse expert routing into the Vision Transformer architecture, |
There was a problem hiding this comment.
This is interesting. To date in this class we have looked at very few "extremely large" ie deeply scaled vision transformers. Did the vision transformer scaling show any relationship with the language expert scaling?
|
|
||
| However, this introduces a new dependency: | ||
|
|
||
| - The gating network becomes a bottleneck. |
There was a problem hiding this comment.
Designing the sparsity enforcement mechanism also seems like a potential challenge.
|
|
||
| **Aligning Reasoning with Action** | ||
|
|
||
| A defining contribution of ChatVLA-2 is the claim that actions must not merely follow instructions, but follow **generated reasoning traces**. |
There was a problem hiding this comment.
This is directly responsive to the "reasoning" element of the mode obviously but is there a particular connection between the MoE element and reasoning?
|
|
||
| The architectural philosophy here is : | ||
|
|
||
| > Modify only layers that minimally disturb control primitives. |
There was a problem hiding this comment.
How are ablations performer on these? When the models are fine tuned, are they also validated in the intermediate training steps to ensure their responses are coherent & reasonable, or is it end-to-end fine tuning and evaluation with the correct data mix to preserve reasoning?
There was a problem hiding this comment.
The paper didn't validate reasoning quality at intermediate training steps. Instead, it evaluates the training design through stage-level ablations: Stage 1 and Stage 2 are removed separately, and the final model performance is compared to show how each stage contributes to reasoning and action.
|
|
||
| 1. Pretrained reasoning backbone | ||
| 2. Dynamic expert selection layer | ||
| 3. Action-conditioned decoding |
There was a problem hiding this comment.
How does one action-condition the decoder structurally? Or is it just autoregressively generated / fed into the prompt?
|
|
||
| **Training Setup** | ||
|
|
||
| - Image–text to robot data ratio: 1:3 |
There was a problem hiding this comment.
Is there a curriculum schedule on this or is it all just thrown together (randomly or uniform enforcement)?
There was a problem hiding this comment.
The paper does not describe any curriculum schedule. My understanding is that the different datasets are mixed together during training, but the exact sampling policy is not specified.
| Stage 2 runs for 50k steps, with a 3k-step warm-up and cosine decay from 2e-5 to 2e-6. | ||
| The total reported training cost is approximately **340 GPU hours**. | ||
|
|
||
| TODO : Add the GPU model/Vram |
There was a problem hiding this comment.
Also, how was the routing trained? This section of the write-up seems to focus mostly on the VLA design / performance but not on the specific interactions between the VLA's action decoding design but not specifically why the MoE is employed.
|
Ping: your writeup requires addressing outstanding blocking comments from @crh |
Refactor text for clarity and emphasis on key points regarding MoE and routing.
Reformat text for clarity and readability by breaking long sentences into shorter ones.
This audit evaluates Mixture-of-Experts (MoE) architectures in Vision-Language-Action (VLA) models, with initial focus on generalization, reasoning, and the trade-offs introduced by the architecture. It also compares the use of MoE in VLA models with its role in large language models, Vision Transformers, and scaling large-scale models.