Skip to content

AUDIT: Mixture of Experts — Thanushraam Suresh Kumar#63

Open
Tr0612 wants to merge 16 commits intomainfrom
audit/Tr0612-mixtureofexpert
Open

AUDIT: Mixture of Experts — Thanushraam Suresh Kumar#63
Tr0612 wants to merge 16 commits intomainfrom
audit/Tr0612-mixtureofexpert

Conversation

@Tr0612
Copy link
Copy Markdown
Contributor

@Tr0612 Tr0612 commented Mar 4, 2026

This audit evaluates Mixture-of-Experts (MoE) architectures in Vision-Language-Action (VLA) models, with initial focus on generalization, reasoning, and the trade-offs introduced by the architecture. It also compares the use of MoE in VLA models with its role in large language models, Vision Transformers, and scaling large-scale models.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 4, 2026

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/63/textbook/audits/staging/ChatVLAExperimentSetup.png/

Review Checklist

  • LaTeX equations render correctly
  • All sections are complete per the template
  • References are formatted properly
  • Figures/diagrams display correctly

Next Steps

  1. Review your rendered content using the preview link above
  2. Tag @crheckman when ready for instructor review
  3. Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

@Tr0612
Copy link
Copy Markdown
Contributor Author

Tr0612 commented Mar 4, 2026

@crheckman Ready for your review!

Instead of activating every parameter for every input, MoE dynamically selects a small number of experts through a learned routing or gating mechanism.
This enables models to increase representational capacity without incurring proportional increases in computational cost.

At a high level, an MoE architecture consists of:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help in this high-level part to talk about whether the "experts" are trained separately and then stitched together or if they're trained together but somehow only focus on a piece of the overall data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experts aren't trained separately, they are trained jointly within a single model. Their specialization emerges because the routing mechanism sends different inputs or tokens to different experts during training which allows each expert to gradually focus on a particular part of the data distribution!

The central claim is that tightly coupling reasoning and control degrades either:

- reasoning generalization (if fine-tuned too aggressively), or
- control stability (if reasoning dominates action layers).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm understanding why reasoning dominating action layers is not a good thing. Don't we want reasoning-based actions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading is that excessive reasoning influence in the action layers may hurt control stability, because high-level reasoning and low-level motor execution serve different roles. However, the paper does not clearly explain this mechanism, so this appears to be more of an architectural assumption.


It arises one central question:

> Does their MoE architecture meaningfully stabilize the reasoning–action interface,or just adding a smarter-looking middle layer that hides it?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question. How can we validate/evaluate that there is meaningful stability?


> Modify only layers that minimally disturb control primitives.

This raises a couple of question:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like these discussion questions. Do we get answers to them somewhere or are they purely critical thinking questions?


### Sign-Off Criteria

I would not sign this system off for production deployment without additional validation on long-horizon tasks, stability of expert utilization, and clearly defined safety and recovery procedures.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are some of the consequences of this? Is it okay if the system doesn't use all the experts if it still gives the correct answer?

* **127/156** task success (**81.4%**)
* Represents a **3.52× improvement** over DexVLA [8]

**Experimental Conclusion**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the primary points of MoE is the reduced computation cost. I think it would be valuable to discuss that here; otherwise, why would I use this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like the paper lacks a cohesive conclusion


However, the important design decision is not the number of experts, it is **where intervention occurs**.

Rather than restructuring the pretrained VLM backbone, ChatVLA-2 preserves it and injects MoE primarily within the action pathway.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good spot for an image showing the architecture. This seems like the key architectural design choice by the researchers.


ChatVLA-2 integrates a **Dynamic Mixture-of-Experts layer** within the VLA architecture, selectively activating experts based on multimodal input representations.

Eight experts are instantiated, and two are selected per input during inference.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something I didn't quite get a grasp on from the rest of the audit: were these experts split across modalities (e.g. vision v action) or tasks and did the domain split choice affect the performance in your opinion? Does the paper happen to state what each expert is an "expert" in?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, the paper doesn't mention what each experts represent to

Copy link
Copy Markdown
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass of comments

This broke the tight coupling between parameter count and computation, allowing language models to scale to billions or trillions of parameters efficiently.
Subsequent works such as Switch Transformers [3] and systems like DeepSpeed-MoE [5] improved training stability, routing simplicity, and deployment efficiency,
establishing MoE as a core scaling strategy in large language models.
The paradigm later extended to vision through Vision Mixture-of-Experts (V-MoE) [4], which integrated sparse expert routing into the Vision Transformer architecture,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. To date in this class we have looked at very few "extremely large" ie deeply scaled vision transformers. Did the vision transformer scaling show any relationship with the language expert scaling?


However, this introduces a new dependency:

- The gating network becomes a bottleneck.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Designing the sparsity enforcement mechanism also seems like a potential challenge.


**Aligning Reasoning with Action**

A defining contribution of ChatVLA-2 is the claim that actions must not merely follow instructions, but follow **generated reasoning traces**.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is directly responsive to the "reasoning" element of the mode obviously but is there a particular connection between the MoE element and reasoning?


The architectural philosophy here is :

> Modify only layers that minimally disturb control primitives.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are ablations performer on these? When the models are fine tuned, are they also validated in the intermediate training steps to ensure their responses are coherent & reasonable, or is it end-to-end fine tuning and evaluation with the correct data mix to preserve reasoning?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper didn't validate reasoning quality at intermediate training steps. Instead, it evaluates the training design through stage-level ablations: Stage 1 and Stage 2 are removed separately, and the final model performance is compared to show how each stage contributes to reasoning and action.


1. Pretrained reasoning backbone
2. Dynamic expert selection layer
3. Action-conditioned decoding
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one action-condition the decoder structurally? Or is it just autoregressively generated / fed into the prompt?


**Training Setup**

- Image–text to robot data ratio: 1:3
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a curriculum schedule on this or is it all just thrown together (randomly or uniform enforcement)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper does not describe any curriculum schedule. My understanding is that the different datasets are mixed together during training, but the exact sampling policy is not specified.

Stage 2 runs for 50k steps, with a 3k-step warm-up and cosine decay from 2e-5 to 2e-6.
The total reported training cost is approximately **340 GPU hours**.

TODO : Add the GPU model/Vram
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how was the routing trained? This section of the write-up seems to focus mostly on the VLA design / performance but not on the specific interactions between the VLA's action decoding design but not specifically why the MoE is employed.

@crh-bot
Copy link
Copy Markdown
Collaborator

crh-bot commented Apr 16, 2026

Ping: your writeup requires addressing outstanding blocking comments from @crh

@crheckman crheckman changed the title Audit: Mixture of Experts - Thanushraam Suresh Kumar AUDIT: Mixture of Experts — Thanushraam Suresh Kumar Apr 17, 2026
Tr0612 and others added 3 commits April 18, 2026 02:39
Refactor text for clarity and emphasis on key points regarding MoE and routing.
Reformat text for clarity and readability by breaking long sentences into shorter ones.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants