AUDIT: Mixture of Experts — Thanushraam Suresh Kumar by Tr0612 · Pull Request #63 · arpg/vla-foundations

Tr0612 · 2026-03-04T03:00:42Z

This audit evaluates Mixture-of-Experts (MoE) architectures in Vision-Language-Action (VLA) models, with initial focus on generalization, reasoning, and the trade-offs introduced by the architecture. It also compares the use of MoE in VLA models with its role in large language models, Vision Transformers, and scaling large-scale models.

github-actions · 2026-03-04T03:01:18Z

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/63/textbook/audits/staging/ChatVLAExperimentSetup.png/

Review Checklist

LaTeX equations render correctly
All sections are complete per the template
References are formatted properly
Figures/diagrams display correctly

Next Steps

Review your rendered content using the preview link above
Tag @crheckman when ready for instructor review
Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

Refactor text for clarity and improve structure in the Mixture-of-Experts audit document.

…e section

Updated image paths to be relative in Tr0612.mdx.

Tr0612 · 2026-03-04T04:43:06Z

@crheckman Ready for your review!

lorinachey · 2026-03-10T15:36:49Z

+Instead of activating every parameter for every input, MoE dynamically selects a small number of experts through a learned routing or gating mechanism.
+This enables models to increase representational capacity without incurring proportional increases in computational cost.
+
+At a high level, an MoE architecture consists of:


I think it would help in this high-level part to talk about whether the "experts" are trained separately and then stitched together or if they're trained together but somehow only focus on a piece of the overall data.

Experts aren't trained separately, they are trained jointly within a single model. Their specialization emerges because the routing mechanism sends different inputs or tokens to different experts during training which allows each expert to gradually focus on a particular part of the data distribution!

lorinachey · 2026-03-10T15:40:12Z

+The central claim is that tightly coupling reasoning and control degrades either:
+
+- reasoning generalization (if fine-tuned too aggressively), or
+- control stability (if reasoning dominates action layers).


Not sure I'm understanding why reasoning dominating action layers is not a good thing. Don't we want reasoning-based actions?

My reading is that excessive reasoning influence in the action layers may hurt control stability, because high-level reasoning and low-level motor execution serve different roles. However, the paper does not clearly explain this mechanism, so this appears to be more of an architectural assumption.

lorinachey · 2026-03-10T15:45:20Z

+
+It arises one central question:
+
+> Does their MoE architecture meaningfully stabilize the reasoning–action interface,or just adding a smarter-looking middle layer that hides it?


This is a great question. How can we validate/evaluate that there is meaningful stability?

lorinachey · 2026-03-10T15:48:51Z

+
+> Modify only layers that minimally disturb control primitives.
+
+This raises a couple of question:


I really like these discussion questions. Do we get answers to them somewhere or are they purely critical thinking questions?

Zaaler · 2026-03-10T16:07:54Z

+
+### Sign-Off Criteria
+
+I would not sign this system off for production deployment without additional validation on long-horizon tasks, stability of expert utilization, and clearly defined safety and recovery procedures.


What are some of the consequences of this? Is it okay if the system doesn't use all the experts if it still gives the correct answer?

Zaaler · 2026-03-10T16:09:33Z

+  * **127/156** task success (**81.4%**)
+  * Represents a **3.52× improvement** over DexVLA [8]
+
+**Experimental Conclusion**


One of the primary points of MoE is the reduced computation cost. I think it would be valuable to discuss that here; otherwise, why would I use this?

Zaaler · 2026-03-10T16:20:05Z

Feel like the paper lacks a cohesive conclusion

Zaaler · 2026-03-10T17:00:40Z

+
+However, the important design decision is not the number of experts, it is **where intervention occurs**.
+
+Rather than restructuring the pretrained VLM backbone, ChatVLA-2 preserves it and injects MoE primarily within the action pathway.


This is a good spot for an image showing the architecture. This seems like the key architectural design choice by the researchers.

kalhamilton · 2026-03-12T21:54:28Z

+
+ChatVLA-2 integrates a **Dynamic Mixture-of-Experts layer** within the VLA architecture, selectively activating experts based on multimodal input representations.
+
+Eight experts are instantiated, and two are selected per input during inference.


something I didn't quite get a grasp on from the rest of the audit: were these experts split across modalities (e.g. vision v action) or tasks and did the domain split choice affect the performance in your opinion? Does the paper happen to state what each expert is an "expert" in?

Nope, the paper doesn't mention what each experts represent to

crheckman

First pass of comments

crheckman · 2026-03-10T17:42:13Z

+This broke the tight coupling between parameter count and computation, allowing language models to scale to billions or trillions of parameters efficiently.
+Subsequent works such as Switch Transformers [3] and systems like DeepSpeed-MoE [5] improved training stability, routing simplicity, and deployment efficiency, 
+establishing MoE as a core scaling strategy in large language models.
+The paradigm later extended to vision through Vision Mixture-of-Experts (V-MoE) [4], which integrated sparse expert routing into the Vision Transformer architecture, 


This is interesting. To date in this class we have looked at very few "extremely large" ie deeply scaled vision transformers. Did the vision transformer scaling show any relationship with the language expert scaling?

crheckman · 2026-03-10T17:46:02Z

+
+However, this introduces a new dependency:
+
+- The gating network becomes a bottleneck.


Designing the sparsity enforcement mechanism also seems like a potential challenge.

crheckman · 2026-03-10T17:54:27Z

+
+**Aligning Reasoning with Action**
+
+A defining contribution of ChatVLA-2 is the claim that actions must not merely follow instructions, but follow **generated reasoning traces**.


This is directly responsive to the "reasoning" element of the mode obviously but is there a particular connection between the MoE element and reasoning?

crheckman · 2026-03-11T21:47:31Z

+
+The architectural philosophy here is :
+
+> Modify only layers that minimally disturb control primitives.


How are ablations performer on these? When the models are fine tuned, are they also validated in the intermediate training steps to ensure their responses are coherent & reasonable, or is it end-to-end fine tuning and evaluation with the correct data mix to preserve reasoning?

The paper didn't validate reasoning quality at intermediate training steps. Instead, it evaluates the training design through stage-level ablations: Stage 1 and Stage 2 are removed separately, and the final model performance is compared to show how each stage contributes to reasoning and action.

crheckman · 2026-03-11T21:54:13Z

+
+1. Pretrained reasoning backbone
+2. Dynamic expert selection layer
+3. Action-conditioned decoding


How does one action-condition the decoder structurally? Or is it just autoregressively generated / fed into the prompt?

crheckman · 2026-03-11T21:54:52Z

+
+**Training Setup**
+
+- Image–text to robot data ratio: 1:3


Is there a curriculum schedule on this or is it all just thrown together (randomly or uniform enforcement)?

The paper does not describe any curriculum schedule. My understanding is that the different datasets are mixed together during training, but the exact sampling policy is not specified.

crheckman · 2026-03-11T22:04:44Z

+Stage 2 runs for 50k steps, with a 3k-step warm-up and cosine decay from 2e-5 to 2e-6.
+The total reported training cost is approximately **340 GPU hours**.
+
+TODO : Add the GPU model/Vram


Also, how was the routing trained? This section of the write-up seems to focus mostly on the VLA design / performance but not on the specific interactions between the VLA's action decoding design but not specifically why the MoE is employed.

crh-bot · 2026-04-16T21:37:20Z

Ping: your writeup requires addressing outstanding blocking comments from @crh

…tery audit

Refactor text for clarity and emphasis on key points regarding MoE and routing.

Reformat text for clarity and readability by breaking long sentences into shorter ones.

Tr0612 and others added 5 commits March 3, 2026 19:14

audit: Mixture of Experts in VLA by Thanushraam Suresh Kumar

eca1e94

Corrected maths latex ; rephrased headings

19557ec

Corrected the reference links

249f400

Added links to paper in reference section

03ada8c

Update author field to audit author in Tr0612.mdx

6ce66ea

Tr0612 added 8 commits March 3, 2026 20:05

Fix author metadata format in Tr0612.mdx

b3e80a4

Enhance clarity and structure in MoE audit document

370939a

Refactor text for clarity and improve structure in the Mixture-of-Experts audit document.

Fix formatting and update reference for Tr0612.mdx

c98a092

Cleaned up HTML comment tags to MDX and added line breaks in referenc…

f967c05

…e section

Refactor references

976ecaa

Fix formatting and enhance clarity in Tr0612.mdx

b9393f1

Fix image paths in Tr0612.mdx

0079ba9

Updated image paths to be relative in Tr0612.mdx.

Updated image urls

3310605

lorinachey reviewed Mar 10, 2026

View reviewed changes

Zaaler reviewed Mar 10, 2026

View reviewed changes

Comment thread content/textbook/audits/staging/Tr0612.mdx

Copy link
Copy Markdown

Contributor

Zaaler Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like the paper lacks a cohesive conclusion

Zaaler reviewed Mar 10, 2026

View reviewed changes

kalhamilton reviewed Mar 12, 2026

View reviewed changes

crheckman requested changes Mar 20, 2026

View reviewed changes

crheckman changed the title ~~Audit: Mixture of Experts - Thanushraam Suresh Kumar~~ AUDIT: Mixture of Experts — Thanushraam Suresh Kumar Apr 17, 2026

Tr0612 and others added 3 commits April 18, 2026 02:39

Added discussion,conclusion,figures and polished the wordings for mas…

6e5e000

…tery audit

Refactor according to markdown rules

cdb426e

Refactor text for clarity and emphasis on key points regarding MoE and routing.

Improve text formatting

48df8b4

Reformat text for clarity and readability by breaking long sentences into shorter ones.


		It arises one central question:

		> Does their MoE architecture meaningfully stabilize the reasoning–action interface,or just adding a smarter-looking middle layer that hides it?


		> Modify only layers that minimally disturb control primitives.

		This raises a couple of question:


		### Sign-Off Criteria

		I would not sign this system off for production deployment without additional validation on long-horizon tasks, stability of expert utilization, and clearly defined safety and recovery procedures.


		However, the important design decision is not the number of experts, it is where intervention occurs.

		Rather than restructuring the pretrained VLM backbone, ChatVLA-2 preserves it and injects MoE primarily within the action pathway.


		ChatVLA-2 integrates a Dynamic Mixture-of-Experts layer within the VLA architecture, selectively activating experts based on multimodal input representations.

		Eight experts are instantiated, and two are selected per input during inference.


		However, this introduces a new dependency:

		- The gating network becomes a bottleneck.


		Aligning Reasoning with Action

		A defining contribution of ChatVLA-2 is the claim that actions must not merely follow instructions, but follow generated reasoning traces.


		The architectural philosophy here is :

		> Modify only layers that minimally disturb control primitives.

Conversation

Tr0612 commented Mar 4, 2026

Uh oh!

github-actions Bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Checklist

Next Steps

Uh oh!

Tr0612 commented Mar 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crheckman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crh-bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

github-actions Bot commented Mar 4, 2026 •

edited

Loading