Date: January 26, 2026
Focus: Implementing efficient MoE architectures for models under 3B parameters, targeting resource-constrained environments (edge devices, mobile, laptops).
- Motivation
- Architecture design
- Implementation approach
- Training
- Memory optimization
- Tools and frameworks
- Step-by-step implementation
- Experimental design
- Expected results
- References
Most MoE research targets large models (7B+). There is little work on MoE for sub-3B models designed for mobile deployment. This guide addresses that gap.
Target architecture:
Base model: Qwen3-0.6B
MoE variant: TinyMoE-Qwen-0.6B-8E
Total parameters: ~1.2B (2x base)
Active parameters: ~750M (1.25x base)
Expert count: 8
Top-k: 1-2 (adaptive)
Shared experts: 1
Research questions:
- What is the optimal expert count for sub-3B models? Hypothesis: 4-8.
- How does top-k selection affect performance? Hypothesis: top-1 is sufficient for tiny models.
- Can upcycled MoE match dense model quality at 1.5x parameters? Hypothesis: yes, with proper router training.
- What memory optimizations are needed for resource-constrained deployment? Hypothesis: expert swapping and cache-aware routing.
How this differs from existing work:
| Aspect | Existing work | This work |
|---|---|---|
| Model size | 7B-70B | 600M-3B |
| Platform | Server-side GPU | Edge, mobile |
| Expert count | 16-128 | 4-8 |
| Training | From scratch | Upcycle dense to MoE |
| Routing | Top-2, fixed | Adaptive top-k |
| Memory | Abundant GPU memory | Memory-efficient design |
For models under 3B parameters, the optimal expert count is lower than for large models:
| Model size | Large MoE experts | Tiny MoE experts | Reason |
|---|---|---|---|
| 50B-70B | 64-128 | N/A | Sufficient capacity for specialization |
| 7B-15B | 16-32 | N/A | Balance specialization and efficiency |
| 1B-3B | N/A | 4-8 | Limited capacity, memory constraints |
| <1B | N/A | 2-4 | Minimal viable specialization |
Recommendation: 8 experts for 600M-1B models (6 routed, 1-2 shared). Total parameters land around 1.2-1.5B.
Sources: OLMoE-1B-7B, MoE Scaling Laws.
For tiny models, fewer active experts are better:
| Top-k | Large models (7B+) | Tiny models (<3B) | Reason |
|---|---|---|---|
| 1 | Quality loss | Optimal | Faster, less memory |
| 2 | Standard | Acceptable | Balance quality and speed |
| 4 | High quality | Overkill | Memory overflow |
Recommendation: adaptive top-k. Use k=1 for easy prompts (factual, creative), k=2 for hard prompts (reasoning, math). Force k=1 when battery is low or memory is constrained.
Sources: Expert Choice Routing, LExI.
Each expert should be smaller than the original dense FFN:
For Qwen3-0.6B:
- Base FFN dimension: 1536
- 8 experts, expansion factor 1.75
- Expert FFN size: ~336-384
When upcycling, keep the FFN dimension similar to the base model and distribute across experts via routing.
Sources: DeepSeekMoE, Switch Transformer.
Shared experts are always active and capture common linguistic patterns. This reduces redundancy between routed experts by 30-50%.
Output = Shared_Experts(Input) + sum(Routed_Experts(Input))
Recommendation: 1-2 shared experts (10-20% of capacity), 6-7 routed experts (80-90%).
Source: DeepSeekMoE: Shared Experts.
| Approach | Pros | Cons | Feasibility | Time |
|---|---|---|---|---|
| Train from scratch | Full control | Expensive, needs data | Low | 6-8 months |
| Upcycle dense to MoE | Reuses weights, minimal data | Sub-optimal init | High | 2-3 months |
| Merge existing models | Fast, no training | Quality mismatch | Medium | 1 month |
| Adapter-based MoE | Very fast, parameter-efficient | Limited capacity | High | 1-2 months |
Upcycling reuses pre-trained dense weights, requires minimal training data, and converges faster than training from scratch.
Step 1: Analyze dense model (identify FFN layers, extract dimensions)
Step 2: Create N expert copies of each FFN, initialize router, add shared experts
Step 3: Train router only (experts frozen), monitor load balancing
Step 4: Fine-tune experts (optional, low learning rate, 1-2 epochs)
Step 5: Quantize and optimize for mobile
Source: Upcycling LLMs to MoE.
Without regularization, the router collapses to using 1-2 experts. An auxiliary load balancing loss encourages uniform expert utilization:
def load_balance_loss(gate_scores, num_tokens):
expert_mask = (gate_scores > 0).float()
tokens_per_expert = expert_mask.sum(dim=0) / num_tokens
target = 1.0 / gate_scores.size(1)
return F.mse_loss(tokens_per_expert, torch.full_like(tokens_per_expert, target))
def total_loss(model_loss, gate_scores, num_tokens, aux_loss_weight=0.01):
return model_loss + aux_loss_weight * load_balance_loss(gate_scores, num_tokens)Source: Switch Transformer.
Penalizes large router logits to prevent numerical instability:
def router_z_loss(router_logits):
return torch.sum(torch.square(router_logits)) * 1e-3Combined loss: model_loss + 0.01 * lb_loss + 0.001 * z_loss.
Source: Expert Choice Routing.
Phase 1: Router training (experts frozen)
learning_rate = 1e-3
batch_size = 32
epochs = 5-10
for param in model.experts.parameters():
param.requires_grad = False
for param in model.router.parameters():
param.requires_grad = TruePhase 2: Full fine-tuning (optional)
learning_rate = 1e-5
batch_size = 16
epochs = 1-2
weight_decay = 0.01
for param in model.parameters():
param.requires_grad = True
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)All experts may not fit in device memory. Load them on-demand from storage using an LRU cache:
class ExpertManager:
def __init__(self, expert_paths, cache_size=2):
self.cache = LRUCache(capacity=cache_size)
self.storage = ExpertStorage(expert_paths)
def get_expert(self, expert_id):
if expert_id in self.cache:
return self.cache[expert_id]
if len(self.cache) >= self.cache.capacity:
self.storage.save(self.cache.evict())
expert = self.storage.load(expert_id)
self.cache[expert_id] = expert
return expertSources: Mixture of Cache-Conditional Experts, EdgeMoE.
Standard capacity factor is 1.25-1.5. For mobile, use 1.0-1.2 (stricter) because memory is limited and batch size is typically 1.
def calculate_capacity(num_tokens, num_experts, capacity_factor=1.0):
return int((num_tokens * capacity_factor) / num_experts)Reduces training memory by 40-60% by recomputing activations during backward pass instead of storing them:
from torch.utils.checkpoint import checkpoint
output = checkpoint(expert_forward, expert, x)Source: MoEtion.
| Repository | Features | Model size | License |
|---|---|---|---|
| OLMoE | Upcycling, training, inference | 1B-7B | Apache 2.0 |
| MoE in PyTorch | Clean implementation, aux losses | Any | MIT |
| DeepSeek-MoE | Shared experts, production | 16B | MIT |
| Parameter-Efficient-MoE | Upcycling, EMNLP 2024 | Any | Academic |
HuggingFace Transformers for general inference. vLLM for fast batch inference testing.
Hardware: GPU with 16GB+ VRAM (training), CPU with 8GB+ RAM (inference)
Software: Python 3.9+, PyTorch 2.0+, Transformers 4.35+
Data: 10K-100K prompts for training
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)Qwen3-0.6B architecture: hidden size 1536, FFN dimension ~6144, 24 layers, 12 attention heads.
import torch
import torch.nn as nn
import torch.nn.functional as F
class MoEFeedForward(nn.Module):
def __init__(self, hidden_size, ffn_size, num_experts, top_k=1):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
self.gate = nn.Linear(hidden_size, num_experts, bias=False)
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(hidden_size, ffn_size),
nn.SiLU(),
nn.Linear(ffn_size, hidden_size)
) for _ in range(num_experts)
])
def forward(self, x):
batch_size, seq_len, hidden_size = x.shape
x_flat = x.view(-1, hidden_size)
gate_scores = self.gate(x_flat)
top_k_scores, top_k_indices = torch.topk(gate_scores, self.top_k, dim=-1)
top_k_scores = F.softmax(top_k_scores, dim=-1)
output = torch.zeros_like(x_flat)
for k in range(self.top_k):
expert_idx = top_k_indices[:, k]
score = top_k_scores[:, k:k+1]
for expert_id in range(self.num_experts):
mask = (expert_idx == expert_id)
if mask.any():
output[mask] += score[mask] * self.experts[expert_id](x_flat[mask])
return output.view(batch_size, seq_len, hidden_size)
def load_balance_loss(self, gate_scores):
num_tokens = gate_scores.size(0)
expert_mask = (gate_scores > 0).float()
tokens_per_expert = expert_mask.sum(dim=0) / num_tokens
target = torch.ones_like(tokens_per_expert) / self.num_experts
return F.mse_loss(tokens_per_expert, target)def convert_dense_to_moe(dense_model, num_experts=8, top_k=1, layers_to_convert="all"):
config = dense_model.config
hidden_size = config.hidden_size
expert_ffn_size = config.intermediate_size // 2
layer_indices = (
range(len(dense_model.model.layers))
if layers_to_convert == "all"
else layers_to_convert
)
for layer_idx in layer_indices:
layer = dense_model.model.layers[layer_idx]
old_ffn = layer.mlp
new_moe = MoEFeedForward(hidden_size, expert_ffn_size, num_experts, top_k)
# Copy dense FFN weights to first expert
new_moe.experts[0][0].weight.data = old_ffn.gate_proj.weight.data.clone()
new_moe.experts[0][1].weight.data = old_ffn.up_proj.weight.data.clone()
new_moe.experts[0][2].weight.data = old_ffn.down_proj.weight.data.clone()
# Randomly initialize remaining experts
for expert_id in range(1, num_experts):
for param in new_moe.experts[expert_id].parameters():
nn.init.xavier_uniform_(param)
nn.init.xavier_uniform_(new_moe.gate.weight)
layer.mlp = new_moe
return dense_model
moe_model = convert_dense_to_moe(
model, num_experts=8, top_k=1, layers_to_convert=[6, 12, 18, 23]
)import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
class PromptDataset(Dataset):
def __init__(self, prompts_file):
with open(prompts_file) as f:
self.prompts = json.load(f)["prompts"]
def __len__(self):
return len(self.prompts)
def __getitem__(self, idx):
encoded = tokenizer(
self.prompts[idx]["prompt"],
max_length=512, padding="max_length", truncation=True, return_tensors="pt"
)
return {"input_ids": encoded["input_ids"].squeeze(0)}
dataset = PromptDataset("prompts.json")
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)
optimizer = optim.AdamW(
[p for n, p in moe_model.named_parameters() if "gate" in n],
lr=1e-3, weight_decay=0.01
)
for batch in dataloader:
input_ids = batch["input_ids"].to(device)
outputs = moe_model(input_ids=input_ids, labels=input_ids)
loss = outputs.loss # + aux_loss from MoE layers
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(moe_model.parameters(), max_norm=1.0)
optimizer.step()moe_model.save_pretrained("./tinymoe-qwen-0.6b-8e")
tokenizer.save_pretrained("./tinymoe-qwen-0.6b-8e")import time
from tqdm import tqdm
def evaluate(model, tokenizer, prompts):
results = []
for item in tqdm(prompts):
inputs = tokenizer(item["prompt"], return_tensors="pt", truncation=True, max_length=512)
start = time.time()
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=150, do_sample=False)
latency_ms = (time.time() - start) * 1000
completion = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
num_tokens = outputs[0].shape[0] - inputs["input_ids"].shape[1]
results.append({
"category": item["category"],
"latency_ms": latency_ms,
"tokens": num_tokens,
"tps": num_tokens / (latency_ms / 1000),
})
return results| Model | Parameters | Active | Architecture |
|---|---|---|---|
| Qwen3-0.6B-q4f32_1 (baseline) | 600M | 600M | Dense, quantized |
| Qwen3-0.6B-q0f32 (baseline) | 600M | 600M | Dense, full precision |
| TinyMoE-Qwen-0.6B-4E | ~900M | ~450M | 4 experts, top-1 |
| TinyMoE-Qwen-0.6B-8E | ~1.2B | ~600M | 8 experts, top-1 |
| TinyMoE-Qwen-0.6B-8E-k2 | ~1.2B | ~750M | 8 experts, top-2 |
| Metric | Description | Target |
|---|---|---|
| Accuracy | % correct completions | >90% of q0f32 |
| Latency | Avg generation time (ms) | <30ms per prompt |
| Memory | Peak RSS | <3GB |
| Energy | Per 100 prompts (mJ) | <400 mJ |
| TPS | Tokens per second | >15 |
| Expert utilization | % experts used | >80% |
| Load balance | Gini index | <0.3 |
- Expert count: 4 vs 8 vs 16
- Top-k: k=1 vs k=2 vs adaptive
- Initialization: random vs copied from dense
- Shared experts: 0 vs 1 vs 2
- Quantization: q4f32_1 on routed experts vs all experts
| Model | Accuracy | Latency | Memory | Energy |
|---|---|---|---|---|
| Qwen3-0.6B-q4f32_1 | 75% | 20ms | 1.5GB | 300mJ |
| Qwen3-0.6B-q0f32 | 95% | 45ms | 3GB | 800mJ |
| TinyMoE-8E-k1 | 88% | 25ms | 2GB | 400mJ |
| TinyMoE-8E-k2 | 92% | 35ms | 2.5GB | 550mJ |
Success criteria for TinyMoE-8E-k1:
- Accuracy within 5% of q0f32 baseline (target: >90%)
- Latency within 10ms of q4f32_1 baseline (target: <30ms)
- Memory <3GB (iOS compatible)
- Energy <500mJ per 100 prompts
Optimal configuration for 600M-1B base models:
Expert count: 8 (6-7 routed, 1-2 shared)
Top-k: 1 (adaptive to 2 for hard prompts)
Capacity factor: 1.0-1.2
Expert FFN size: 50-75% of dense FFN
Router LR: 1e-3 (frozen experts), 1e-5 (fine-tuning all)
Aux loss weight: 0.01
Z-loss weight: 0.001
Batch size: 4-16 (training), 1 (mobile inference)
- Expert collapse (only 1-2 experts used): Increase aux_loss_weight to 0.05-0.1, add weight_decay=0.01.
- CUDA OOM: Use gradient checkpointing, reduce batch size to 2-4, set capacity_factor=1.0.
- Poor accuracy after upcycling: Train router longer (10-15 epochs), fine-tune all parameters at LR 1e-5.
- Slow inference on iOS: Reduce num_experts to 4, use top-k=1, implement aggressive expert caching.
- Switch Transformers - Fedus et al., NeurIPS 2022
- Outrageously Large Neural Networks - Shazeer et al., 2017
- Upcycling LLMs to MoE - 2024
- Parameter-Efficient Sparsity Crafting - EMNLP 2024
- Scaling Laws for Upcycling MoE
- OLMoE - 2024
- Phi-4-Mini Technical Report - 2025
- Expert Choice Routing - NeurIPS 2022
- LExI: Layer-Adaptive Active Experts - 2025
- Optimizing MoE Routers - 2025
- Mixture of Cache-Conditional Experts - 2024
- EdgeMoE - IEEE TMC 2025
- MoEtion: Sparse Checkpointing - 2024
- On Implementing Load Balancing Loss - ACL 2025
- MegaScale-MoE - 2025
- PyTorch: Training MoEs - 2024
- A Comprehensive Survey of MoE - 2025
- MoE for LLMs Survey - 2024