Skip to content

afftab/tinyMoE-Qwen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

TinyMoE: Sub-3B Mixture-of-Experts for Mobile Devices

Date: January 26, 2026

Focus: Implementing efficient MoE architectures for models under 3B parameters, targeting resource-constrained environments (edge devices, mobile, laptops).


Contents

  1. Motivation
  2. Architecture design
  3. Implementation approach
  4. Training
  5. Memory optimization
  6. Tools and frameworks
  7. Step-by-step implementation
  8. Experimental design
  9. Expected results
  10. References

1. Motivation

Most MoE research targets large models (7B+). There is little work on MoE for sub-3B models designed for mobile deployment. This guide addresses that gap.

Target architecture:

Base model: Qwen3-0.6B
MoE variant: TinyMoE-Qwen-0.6B-8E
Total parameters: ~1.2B (2x base)
Active parameters: ~750M (1.25x base)
Expert count: 8
Top-k: 1-2 (adaptive)
Shared experts: 1

Research questions:

  1. What is the optimal expert count for sub-3B models? Hypothesis: 4-8.
  2. How does top-k selection affect performance? Hypothesis: top-1 is sufficient for tiny models.
  3. Can upcycled MoE match dense model quality at 1.5x parameters? Hypothesis: yes, with proper router training.
  4. What memory optimizations are needed for resource-constrained deployment? Hypothesis: expert swapping and cache-aware routing.

How this differs from existing work:

Aspect Existing work This work
Model size 7B-70B 600M-3B
Platform Server-side GPU Edge, mobile
Expert count 16-128 4-8
Training From scratch Upcycle dense to MoE
Routing Top-2, fixed Adaptive top-k
Memory Abundant GPU memory Memory-efficient design

2. Architecture design

2.1 Expert count

For models under 3B parameters, the optimal expert count is lower than for large models:

Model size Large MoE experts Tiny MoE experts Reason
50B-70B 64-128 N/A Sufficient capacity for specialization
7B-15B 16-32 N/A Balance specialization and efficiency
1B-3B N/A 4-8 Limited capacity, memory constraints
<1B N/A 2-4 Minimal viable specialization

Recommendation: 8 experts for 600M-1B models (6 routed, 1-2 shared). Total parameters land around 1.2-1.5B.

Sources: OLMoE-1B-7B, MoE Scaling Laws.

2.2 Top-k routing

For tiny models, fewer active experts are better:

Top-k Large models (7B+) Tiny models (<3B) Reason
1 Quality loss Optimal Faster, less memory
2 Standard Acceptable Balance quality and speed
4 High quality Overkill Memory overflow

Recommendation: adaptive top-k. Use k=1 for easy prompts (factual, creative), k=2 for hard prompts (reasoning, math). Force k=1 when battery is low or memory is constrained.

Sources: Expert Choice Routing, LExI.

2.3 Expert size

Each expert should be smaller than the original dense FFN:

For Qwen3-0.6B:
- Base FFN dimension: 1536
- 8 experts, expansion factor 1.75
- Expert FFN size: ~336-384

When upcycling, keep the FFN dimension similar to the base model and distribute across experts via routing.

Sources: DeepSeekMoE, Switch Transformer.

2.4 Shared experts

Shared experts are always active and capture common linguistic patterns. This reduces redundancy between routed experts by 30-50%.

Output = Shared_Experts(Input) + sum(Routed_Experts(Input))

Recommendation: 1-2 shared experts (10-20% of capacity), 6-7 routed experts (80-90%).

Source: DeepSeekMoE: Shared Experts.


3. Implementation approach

3.1 Approach comparison

Approach Pros Cons Feasibility Time
Train from scratch Full control Expensive, needs data Low 6-8 months
Upcycle dense to MoE Reuses weights, minimal data Sub-optimal init High 2-3 months
Merge existing models Fast, no training Quality mismatch Medium 1 month
Adapter-based MoE Very fast, parameter-efficient Limited capacity High 1-2 months

3.2 Upcycling process

Upcycling reuses pre-trained dense weights, requires minimal training data, and converges faster than training from scratch.

Step 1: Analyze dense model (identify FFN layers, extract dimensions)
Step 2: Create N expert copies of each FFN, initialize router, add shared experts
Step 3: Train router only (experts frozen), monitor load balancing
Step 4: Fine-tune experts (optional, low learning rate, 1-2 epochs)
Step 5: Quantize and optimize for mobile

Source: Upcycling LLMs to MoE.


4. Training

4.1 Load balancing loss

Without regularization, the router collapses to using 1-2 experts. An auxiliary load balancing loss encourages uniform expert utilization:

def load_balance_loss(gate_scores, num_tokens):
    expert_mask = (gate_scores > 0).float()
    tokens_per_expert = expert_mask.sum(dim=0) / num_tokens
    target = 1.0 / gate_scores.size(1)
    return F.mse_loss(tokens_per_expert, torch.full_like(tokens_per_expert, target))

def total_loss(model_loss, gate_scores, num_tokens, aux_loss_weight=0.01):
    return model_loss + aux_loss_weight * load_balance_loss(gate_scores, num_tokens)

Source: Switch Transformer.

4.2 Router z-loss

Penalizes large router logits to prevent numerical instability:

def router_z_loss(router_logits):
    return torch.sum(torch.square(router_logits)) * 1e-3

Combined loss: model_loss + 0.01 * lb_loss + 0.001 * z_loss.

Source: Expert Choice Routing.

4.3 Training schedule

Phase 1: Router training (experts frozen)

learning_rate = 1e-3
batch_size = 32
epochs = 5-10

for param in model.experts.parameters():
    param.requires_grad = False
for param in model.router.parameters():
    param.requires_grad = True

Phase 2: Full fine-tuning (optional)

learning_rate = 1e-5
batch_size = 16
epochs = 1-2
weight_decay = 0.01

for param in model.parameters():
    param.requires_grad = True

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

5. Memory optimization

5.1 Expert swapping

All experts may not fit in device memory. Load them on-demand from storage using an LRU cache:

class ExpertManager:
    def __init__(self, expert_paths, cache_size=2):
        self.cache = LRUCache(capacity=cache_size)
        self.storage = ExpertStorage(expert_paths)

    def get_expert(self, expert_id):
        if expert_id in self.cache:
            return self.cache[expert_id]
        if len(self.cache) >= self.cache.capacity:
            self.storage.save(self.cache.evict())
        expert = self.storage.load(expert_id)
        self.cache[expert_id] = expert
        return expert

Sources: Mixture of Cache-Conditional Experts, EdgeMoE.

5.2 Capacity factor

Standard capacity factor is 1.25-1.5. For mobile, use 1.0-1.2 (stricter) because memory is limited and batch size is typically 1.

def calculate_capacity(num_tokens, num_experts, capacity_factor=1.0):
    return int((num_tokens * capacity_factor) / num_experts)

5.3 Gradient checkpointing

Reduces training memory by 40-60% by recomputing activations during backward pass instead of storing them:

from torch.utils.checkpoint import checkpoint

output = checkpoint(expert_forward, expert, x)

Source: MoEtion.


6. Tools and frameworks

Open source implementations

Repository Features Model size License
OLMoE Upcycling, training, inference 1B-7B Apache 2.0
MoE in PyTorch Clean implementation, aux losses Any MIT
DeepSeek-MoE Shared experts, production 16B MIT
Parameter-Efficient-MoE Upcycling, EMNLP 2024 Any Academic

Inference

HuggingFace Transformers for general inference. vLLM for fast batch inference testing.


7. Step-by-step implementation

7.1 Prerequisites

Hardware: GPU with 16GB+ VRAM (training), CPU with 8GB+ RAM (inference)
Software: Python 3.9+, PyTorch 2.0+, Transformers 4.35+
Data: 10K-100K prompts for training

7.2 Load base model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Qwen3-0.6B architecture: hidden size 1536, FFN dimension ~6144, 24 layers, 12 attention heads.

7.3 Define MoE layer

import torch
import torch.nn as nn
import torch.nn.functional as F

class MoEFeedForward(nn.Module):
    def __init__(self, hidden_size, ffn_size, num_experts, top_k=1):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k

        self.gate = nn.Linear(hidden_size, num_experts, bias=False)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_size, ffn_size),
                nn.SiLU(),
                nn.Linear(ffn_size, hidden_size)
            ) for _ in range(num_experts)
        ])

    def forward(self, x):
        batch_size, seq_len, hidden_size = x.shape
        x_flat = x.view(-1, hidden_size)

        gate_scores = self.gate(x_flat)
        top_k_scores, top_k_indices = torch.topk(gate_scores, self.top_k, dim=-1)
        top_k_scores = F.softmax(top_k_scores, dim=-1)

        output = torch.zeros_like(x_flat)
        for k in range(self.top_k):
            expert_idx = top_k_indices[:, k]
            score = top_k_scores[:, k:k+1]
            for expert_id in range(self.num_experts):
                mask = (expert_idx == expert_id)
                if mask.any():
                    output[mask] += score[mask] * self.experts[expert_id](x_flat[mask])

        return output.view(batch_size, seq_len, hidden_size)

    def load_balance_loss(self, gate_scores):
        num_tokens = gate_scores.size(0)
        expert_mask = (gate_scores > 0).float()
        tokens_per_expert = expert_mask.sum(dim=0) / num_tokens
        target = torch.ones_like(tokens_per_expert) / self.num_experts
        return F.mse_loss(tokens_per_expert, target)

7.4 Convert dense model to MoE

def convert_dense_to_moe(dense_model, num_experts=8, top_k=1, layers_to_convert="all"):
    config = dense_model.config
    hidden_size = config.hidden_size
    expert_ffn_size = config.intermediate_size // 2

    layer_indices = (
        range(len(dense_model.model.layers))
        if layers_to_convert == "all"
        else layers_to_convert
    )

    for layer_idx in layer_indices:
        layer = dense_model.model.layers[layer_idx]
        old_ffn = layer.mlp

        new_moe = MoEFeedForward(hidden_size, expert_ffn_size, num_experts, top_k)

        # Copy dense FFN weights to first expert
        new_moe.experts[0][0].weight.data = old_ffn.gate_proj.weight.data.clone()
        new_moe.experts[0][1].weight.data = old_ffn.up_proj.weight.data.clone()
        new_moe.experts[0][2].weight.data = old_ffn.down_proj.weight.data.clone()

        # Randomly initialize remaining experts
        for expert_id in range(1, num_experts):
            for param in new_moe.experts[expert_id].parameters():
                nn.init.xavier_uniform_(param)

        nn.init.xavier_uniform_(new_moe.gate.weight)
        layer.mlp = new_moe

    return dense_model

moe_model = convert_dense_to_moe(
    model, num_experts=8, top_k=1, layers_to_convert=[6, 12, 18, 23]
)

7.5 Training

import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

class PromptDataset(Dataset):
    def __init__(self, prompts_file):
        with open(prompts_file) as f:
            self.prompts = json.load(f)["prompts"]

    def __len__(self):
        return len(self.prompts)

    def __getitem__(self, idx):
        encoded = tokenizer(
            self.prompts[idx]["prompt"],
            max_length=512, padding="max_length", truncation=True, return_tensors="pt"
        )
        return {"input_ids": encoded["input_ids"].squeeze(0)}

dataset = PromptDataset("prompts.json")
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

optimizer = optim.AdamW(
    [p for n, p in moe_model.named_parameters() if "gate" in n],
    lr=1e-3, weight_decay=0.01
)

for batch in dataloader:
    input_ids = batch["input_ids"].to(device)
    outputs = moe_model(input_ids=input_ids, labels=input_ids)
    loss = outputs.loss  # + aux_loss from MoE layers
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(moe_model.parameters(), max_norm=1.0)
    optimizer.step()

7.6 Save and export

moe_model.save_pretrained("./tinymoe-qwen-0.6b-8e")
tokenizer.save_pretrained("./tinymoe-qwen-0.6b-8e")

7.7 Evaluation

import time
from tqdm import tqdm

def evaluate(model, tokenizer, prompts):
    results = []
    for item in tqdm(prompts):
        inputs = tokenizer(item["prompt"], return_tensors="pt", truncation=True, max_length=512)
        start = time.time()
        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=150, do_sample=False)
        latency_ms = (time.time() - start) * 1000
        completion = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
        num_tokens = outputs[0].shape[0] - inputs["input_ids"].shape[1]
        results.append({
            "category": item["category"],
            "latency_ms": latency_ms,
            "tokens": num_tokens,
            "tps": num_tokens / (latency_ms / 1000),
        })
    return results

8. Experimental design

Models to compare

Model Parameters Active Architecture
Qwen3-0.6B-q4f32_1 (baseline) 600M 600M Dense, quantized
Qwen3-0.6B-q0f32 (baseline) 600M 600M Dense, full precision
TinyMoE-Qwen-0.6B-4E ~900M ~450M 4 experts, top-1
TinyMoE-Qwen-0.6B-8E ~1.2B ~600M 8 experts, top-1
TinyMoE-Qwen-0.6B-8E-k2 ~1.2B ~750M 8 experts, top-2

Metrics

Metric Description Target
Accuracy % correct completions >90% of q0f32
Latency Avg generation time (ms) <30ms per prompt
Memory Peak RSS <3GB
Energy Per 100 prompts (mJ) <400 mJ
TPS Tokens per second >15
Expert utilization % experts used >80%
Load balance Gini index <0.3

Ablation studies

  1. Expert count: 4 vs 8 vs 16
  2. Top-k: k=1 vs k=2 vs adaptive
  3. Initialization: random vs copied from dense
  4. Shared experts: 0 vs 1 vs 2
  5. Quantization: q4f32_1 on routed experts vs all experts

9. Expected results

Model Accuracy Latency Memory Energy
Qwen3-0.6B-q4f32_1 75% 20ms 1.5GB 300mJ
Qwen3-0.6B-q0f32 95% 45ms 3GB 800mJ
TinyMoE-8E-k1 88% 25ms 2GB 400mJ
TinyMoE-8E-k2 92% 35ms 2.5GB 550mJ

Success criteria for TinyMoE-8E-k1:

  • Accuracy within 5% of q0f32 baseline (target: >90%)
  • Latency within 10ms of q4f32_1 baseline (target: <30ms)
  • Memory <3GB (iOS compatible)
  • Energy <500mJ per 100 prompts

Quick reference

Optimal configuration for 600M-1B base models:

Expert count: 8 (6-7 routed, 1-2 shared)
Top-k: 1 (adaptive to 2 for hard prompts)
Capacity factor: 1.0-1.2
Expert FFN size: 50-75% of dense FFN
Router LR: 1e-3 (frozen experts), 1e-5 (fine-tuning all)
Aux loss weight: 0.01
Z-loss weight: 0.001
Batch size: 4-16 (training), 1 (mobile inference)

Troubleshooting

  • Expert collapse (only 1-2 experts used): Increase aux_loss_weight to 0.05-0.1, add weight_decay=0.01.
  • CUDA OOM: Use gradient checkpointing, reduce batch size to 2-4, set capacity_factor=1.0.
  • Poor accuracy after upcycling: Train router longer (10-15 epochs), fine-tune all parameters at LR 1e-5.
  • Slow inference on iOS: Reduce num_experts to 4, use top-k=1, implement aggressive expert caching.

References

MoE fundamentals

  1. Switch Transformers - Fedus et al., NeurIPS 2022
  2. Outrageously Large Neural Networks - Shazeer et al., 2017

Upcycling

  1. Upcycling LLMs to MoE - 2024
  2. Parameter-Efficient Sparsity Crafting - EMNLP 2024
  3. Scaling Laws for Upcycling MoE

Tiny MoE models

  1. OLMoE - 2024
  2. Phi-4-Mini Technical Report - 2025

Routing

  1. Expert Choice Routing - NeurIPS 2022
  2. LExI: Layer-Adaptive Active Experts - 2025
  3. Optimizing MoE Routers - 2025

Memory optimization

  1. Mixture of Cache-Conditional Experts - 2024
  2. EdgeMoE - IEEE TMC 2025
  3. MoEtion: Sparse Checkpointing - 2024

Training

  1. On Implementing Load Balancing Loss - ACL 2025
  2. MegaScale-MoE - 2025
  3. PyTorch: Training MoEs - 2024

Surveys

  1. A Comprehensive Survey of MoE - 2025
  2. MoE for LLMs Survey - 2024

Tutorials

  1. nanoMoE: Building MoE from Scratch
  2. HuggingFace: MoE Explained
  3. A Visual Guide to MoE

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors