TinyMoE: Sub-3B Mixture-of-Experts for Mobile Devices

Date: January 26, 2026

Focus: Implementing efficient MoE architectures for models under 3B parameters, targeting resource-constrained environments (edge devices, mobile, laptops).

1. Motivation

Most MoE research targets large models (7B+). There is little work on MoE for sub-3B models designed for mobile deployment. This guide addresses that gap.

Target architecture:

Base model: Qwen3-0.6B
MoE variant: TinyMoE-Qwen-0.6B-8E
Total parameters: ~1.2B (2x base)
Active parameters: ~750M (1.25x base)
Expert count: 8
Top-k: 1-2 (adaptive)
Shared experts: 1

Research questions:

What is the optimal expert count for sub-3B models? Hypothesis: 4-8.
How does top-k selection affect performance? Hypothesis: top-1 is sufficient for tiny models.
Can upcycled MoE match dense model quality at 1.5x parameters? Hypothesis: yes, with proper router training.
What memory optimizations are needed for resource-constrained deployment? Hypothesis: expert swapping and cache-aware routing.

How this differs from existing work:

Aspect	Existing work	This work
Model size	7B-70B	600M-3B
Platform	Server-side GPU	Edge, mobile
Expert count	16-128	4-8
Training	From scratch	Upcycle dense to MoE
Routing	Top-2, fixed	Adaptive top-k
Memory	Abundant GPU memory	Memory-efficient design

2. Architecture design

2.1 Expert count

For models under 3B parameters, the optimal expert count is lower than for large models:

Model size	Large MoE experts	Tiny MoE experts	Reason
50B-70B	64-128	N/A	Sufficient capacity for specialization
7B-15B	16-32	N/A	Balance specialization and efficiency
1B-3B	N/A	4-8	Limited capacity, memory constraints
<1B	N/A	2-4	Minimal viable specialization

Recommendation: 8 experts for 600M-1B models (6 routed, 1-2 shared). Total parameters land around 1.2-1.5B.

Sources: OLMoE-1B-7B, MoE Scaling Laws.

2.2 Top-k routing

For tiny models, fewer active experts are better:

Top-k	Large models (7B+)	Tiny models (<3B)	Reason
1	Quality loss	Optimal	Faster, less memory
2	Standard	Acceptable	Balance quality and speed
4	High quality	Overkill	Memory overflow

Recommendation: adaptive top-k. Use k=1 for easy prompts (factual, creative), k=2 for hard prompts (reasoning, math). Force k=1 when battery is low or memory is constrained.

Sources: Expert Choice Routing, LExI.

2.3 Expert size

Each expert should be smaller than the original dense FFN:

For Qwen3-0.6B:
- Base FFN dimension: 1536
- 8 experts, expansion factor 1.75
- Expert FFN size: ~336-384

When upcycling, keep the FFN dimension similar to the base model and distribute across experts via routing.

Sources: DeepSeekMoE, Switch Transformer.

2.4 Shared experts

Shared experts are always active and capture common linguistic patterns. This reduces redundancy between routed experts by 30-50%.

Output = Shared_Experts(Input) + sum(Routed_Experts(Input))

Recommendation: 1-2 shared experts (10-20% of capacity), 6-7 routed experts (80-90%).

Source: DeepSeekMoE: Shared Experts.

3. Implementation approach

3.1 Approach comparison

Approach	Pros	Cons	Feasibility	Time
Train from scratch	Full control	Expensive, needs data	Low	6-8 months
Upcycle dense to MoE	Reuses weights, minimal data	Sub-optimal init	High	2-3 months
Merge existing models	Fast, no training	Quality mismatch	Medium	1 month
Adapter-based MoE	Very fast, parameter-efficient	Limited capacity	High	1-2 months

3.2 Upcycling process

Upcycling reuses pre-trained dense weights, requires minimal training data, and converges faster than training from scratch.

Step 1: Analyze dense model (identify FFN layers, extract dimensions)
Step 2: Create N expert copies of each FFN, initialize router, add shared experts
Step 3: Train router only (experts frozen), monitor load balancing
Step 4: Fine-tune experts (optional, low learning rate, 1-2 epochs)
Step 5: Quantize and optimize for mobile

Source: Upcycling LLMs to MoE.

4. Training

4.1 Load balancing loss

Without regularization, the router collapses to using 1-2 experts. An auxiliary load balancing loss encourages uniform expert utilization:

def load_balance_loss(gate_scores, num_tokens):
    expert_mask = (gate_scores > 0).float()
    tokens_per_expert = expert_mask.sum(dim=0) / num_tokens
    target = 1.0 / gate_scores.size(1)
    return F.mse_loss(tokens_per_expert, torch.full_like(tokens_per_expert, target))

def total_loss(model_loss, gate_scores, num_tokens, aux_loss_weight=0.01):
    return model_loss + aux_loss_weight * load_balance_loss(gate_scores, num_tokens)

Source: Switch Transformer.

4.2 Router z-loss

Penalizes large router logits to prevent numerical instability:

def router_z_loss(router_logits):
    return torch.sum(torch.square(router_logits)) * 1e-3

Combined loss: model_loss + 0.01 * lb_loss + 0.001 * z_loss.

Source: Expert Choice Routing.

4.3 Training schedule

Phase 1: Router training (experts frozen)

learning_rate = 1e-3
batch_size = 32
epochs = 5-10

for param in model.experts.parameters():
    param.requires_grad = False
for param in model.router.parameters():
    param.requires_grad = True

Phase 2: Full fine-tuning (optional)

learning_rate = 1e-5
batch_size = 16
epochs = 1-2
weight_decay = 0.01

for param in model.parameters():
    param.requires_grad = True

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

5. Memory optimization

5.1 Expert swapping

All experts may not fit in device memory. Load them on-demand from storage using an LRU cache:

class ExpertManager:
    def __init__(self, expert_paths, cache_size=2):
        self.cache = LRUCache(capacity=cache_size)
        self.storage = ExpertStorage(expert_paths)

    def get_expert(self, expert_id):
        if expert_id in self.cache:
            return self.cache[expert_id]
        if len(self.cache) >= self.cache.capacity:
            self.storage.save(self.cache.evict())
        expert = self.storage.load(expert_id)
        self.cache[expert_id] = expert
        return expert

Sources: Mixture of Cache-Conditional Experts, EdgeMoE.

5.2 Capacity factor

Standard capacity factor is 1.25-1.5. For mobile, use 1.0-1.2 (stricter) because memory is limited and batch size is typically 1.

def calculate_capacity(num_tokens, num_experts, capacity_factor=1.0):
    return int((num_tokens * capacity_factor) / num_experts)

5.3 Gradient checkpointing

Reduces training memory by 40-60% by recomputing activations during backward pass instead of storing them:

from torch.utils.checkpoint import checkpoint

output = checkpoint(expert_forward, expert, x)

Source: MoEtion.

6. Tools and frameworks

Open source implementations

Repository	Features	Model size	License
OLMoE	Upcycling, training, inference	1B-7B	Apache 2.0
MoE in PyTorch	Clean implementation, aux losses	Any	MIT
DeepSeek-MoE	Shared experts, production	16B	MIT
Parameter-Efficient-MoE	Upcycling, EMNLP 2024	Any	Academic

Inference

HuggingFace Transformers for general inference. vLLM for fast batch inference testing.

7. Step-by-step implementation

7.1 Prerequisites

Hardware: GPU with 16GB+ VRAM (training), CPU with 8GB+ RAM (inference)
Software: Python 3.9+, PyTorch 2.0+, Transformers 4.35+
Data: 10K-100K prompts for training

7.2 Load base model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Qwen3-0.6B architecture: hidden size 1536, FFN dimension ~6144, 24 layers, 12 attention heads.

7.3 Define MoE layer

import torch
import torch.nn as nn
import torch.nn.functional as F

class MoEFeedForward(nn.Module):
    def __init__(self, hidden_size, ffn_size, num_experts, top_k=1):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k

        self.gate = nn.Linear(hidden_size, num_experts, bias=False)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_size, ffn_size),
                nn.SiLU(),
                nn.Linear(ffn_size, hidden_size)
            ) for _ in range(num_experts)
        ])

    def forward(self, x):
        batch_size, seq_len, hidden_size = x.shape
        x_flat = x.view(-1, hidden_size)

        gate_scores = self.gate(x_flat)
        top_k_scores, top_k_indices = torch.topk(gate_scores, self.top_k, dim=-1)
        top_k_scores = F.softmax(top_k_scores, dim=-1)

        output = torch.zeros_like(x_flat)
        for k in range(self.top_k):
            expert_idx = top_k_indices[:, k]
            score = top_k_scores[:, k:k+1]
            for expert_id in range(self.num_experts):
                mask = (expert_idx == expert_id)
                if mask.any():
                    output[mask] += score[mask] * self.experts[expert_id](x_flat[mask])

        return output.view(batch_size, seq_len, hidden_size)

    def load_balance_loss(self, gate_scores):
        num_tokens = gate_scores.size(0)
        expert_mask = (gate_scores > 0).float()
        tokens_per_expert = expert_mask.sum(dim=0) / num_tokens
        target = torch.ones_like(tokens_per_expert) / self.num_experts
        return F.mse_loss(tokens_per_expert, target)

7.4 Convert dense model to MoE

def convert_dense_to_moe(dense_model, num_experts=8, top_k=1, layers_to_convert="all"):
    config = dense_model.config
    hidden_size = config.hidden_size
    expert_ffn_size = config.intermediate_size // 2

    layer_indices = (
        range(len(dense_model.model.layers))
        if layers_to_convert == "all"
        else layers_to_convert
    )

    for layer_idx in layer_indices:
        layer = dense_model.model.layers[layer_idx]
        old_ffn = layer.mlp

        new_moe = MoEFeedForward(hidden_size, expert_ffn_size, num_experts, top_k)

        # Copy dense FFN weights to first expert
        new_moe.experts[0][0].weight.data = old_ffn.gate_proj.weight.data.clone()
        new_moe.experts[0][1].weight.data = old_ffn.up_proj.weight.data.clone()
        new_moe.experts[0][2].weight.data = old_ffn.down_proj.weight.data.clone()

        # Randomly initialize remaining experts
        for expert_id in range(1, num_experts):
            for param in new_moe.experts[expert_id].parameters():
                nn.init.xavier_uniform_(param)

        nn.init.xavier_uniform_(new_moe.gate.weight)
        layer.mlp = new_moe

    return dense_model

moe_model = convert_dense_to_moe(
    model, num_experts=8, top_k=1, layers_to_convert=[6, 12, 18, 23]
)

7.5 Training

import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

class PromptDataset(Dataset):
    def __init__(self, prompts_file):
        with open(prompts_file) as f:
            self.prompts = json.load(f)["prompts"]

    def __len__(self):
        return len(self.prompts)

    def __getitem__(self, idx):
        encoded = tokenizer(
            self.prompts[idx]["prompt"],
            max_length=512, padding="max_length", truncation=True, return_tensors="pt"
        )
        return {"input_ids": encoded["input_ids"].squeeze(0)}

dataset = PromptDataset("prompts.json")
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

optimizer = optim.AdamW(
    [p for n, p in moe_model.named_parameters() if "gate" in n],
    lr=1e-3, weight_decay=0.01
)

for batch in dataloader:
    input_ids = batch["input_ids"].to(device)
    outputs = moe_model(input_ids=input_ids, labels=input_ids)
    loss = outputs.loss  # + aux_loss from MoE layers
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(moe_model.parameters(), max_norm=1.0)
    optimizer.step()

7.6 Save and export

moe_model.save_pretrained("./tinymoe-qwen-0.6b-8e")
tokenizer.save_pretrained("./tinymoe-qwen-0.6b-8e")

7.7 Evaluation

import time
from tqdm import tqdm

def evaluate(model, tokenizer, prompts):
    results = []
    for item in tqdm(prompts):
        inputs = tokenizer(item["prompt"], return_tensors="pt", truncation=True, max_length=512)
        start = time.time()
        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=150, do_sample=False)
        latency_ms = (time.time() - start) * 1000
        completion = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
        num_tokens = outputs[0].shape[0] - inputs["input_ids"].shape[1]
        results.append({
            "category": item["category"],
            "latency_ms": latency_ms,
            "tokens": num_tokens,
            "tps": num_tokens / (latency_ms / 1000),
        })
    return results

8. Experimental design

Models to compare

Model	Parameters	Active	Architecture
Qwen3-0.6B-q4f32_1 (baseline)	600M	600M	Dense, quantized
Qwen3-0.6B-q0f32 (baseline)	600M	600M	Dense, full precision
TinyMoE-Qwen-0.6B-4E	~900M	~450M	4 experts, top-1
TinyMoE-Qwen-0.6B-8E	~1.2B	~600M	8 experts, top-1
TinyMoE-Qwen-0.6B-8E-k2	~1.2B	~750M	8 experts, top-2

Metrics

Metric	Description	Target
Accuracy	% correct completions	>90% of q0f32
Latency	Avg generation time (ms)	<30ms per prompt
Memory	Peak RSS	<3GB
Energy	Per 100 prompts (mJ)	<400 mJ
TPS	Tokens per second	>15
Expert utilization	% experts used	>80%
Load balance	Gini index	<0.3

Ablation studies

Expert count: 4 vs 8 vs 16
Top-k: k=1 vs k=2 vs adaptive
Initialization: random vs copied from dense
Shared experts: 0 vs 1 vs 2
Quantization: q4f32_1 on routed experts vs all experts

9. Expected results

Model	Accuracy	Latency	Memory	Energy
Qwen3-0.6B-q4f32_1	75%	20ms	1.5GB	300mJ
Qwen3-0.6B-q0f32	95%	45ms	3GB	800mJ
TinyMoE-8E-k1	88%	25ms	2GB	400mJ
TinyMoE-8E-k2	92%	35ms	2.5GB	550mJ

Success criteria for TinyMoE-8E-k1:

Accuracy within 5% of q0f32 baseline (target: >90%)
Latency within 10ms of q4f32_1 baseline (target: <30ms)
Memory <3GB (iOS compatible)
Energy <500mJ per 100 prompts

Quick reference

Optimal configuration for 600M-1B base models:

Expert count: 8 (6-7 routed, 1-2 shared)
Top-k: 1 (adaptive to 2 for hard prompts)
Capacity factor: 1.0-1.2
Expert FFN size: 50-75% of dense FFN
Router LR: 1e-3 (frozen experts), 1e-5 (fine-tuning all)
Aux loss weight: 0.01
Z-loss weight: 0.001
Batch size: 4-16 (training), 1 (mobile inference)

Troubleshooting

Expert collapse (only 1-2 experts used): Increase aux_loss_weight to 0.05-0.1, add weight_decay=0.01.
CUDA OOM: Use gradient checkpointing, reduce batch size to 2-4, set capacity_factor=1.0.
Poor accuracy after upcycling: Train router longer (10-15 epochs), fine-tune all parameters at LR 1e-5.
Slow inference on iOS: Reduce num_experts to 4, use top-k=1, implement aggressive expert caching.

References

MoE fundamentals

Switch Transformers - Fedus et al., NeurIPS 2022
Outrageously Large Neural Networks - Shazeer et al., 2017

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
MoE_Qwen0.6B.ipynb		MoE_Qwen0.6B.ipynb
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

TinyMoE: Sub-3B Mixture-of-Experts for Mobile Devices

Contents

1. Motivation

2. Architecture design

2.1 Expert count

2.2 Top-k routing

2.3 Expert size

2.4 Shared experts

3. Implementation approach

3.1 Approach comparison

3.2 Upcycling process

4. Training

4.1 Load balancing loss

4.2 Router z-loss

4.3 Training schedule

5. Memory optimization

5.1 Expert swapping

5.2 Capacity factor

5.3 Gradient checkpointing

6. Tools and frameworks

Open source implementations

Inference

7. Step-by-step implementation

7.1 Prerequisites

7.2 Load base model

7.3 Define MoE layer

7.4 Convert dense model to MoE

7.5 Training

7.6 Save and export

7.7 Evaluation

8. Experimental design

Models to compare

Metrics

Ablation studies

9. Expected results

Quick reference

Troubleshooting

References

MoE fundamentals

Upcycling

Tiny MoE models

Routing

Memory optimization

Training

Surveys

Tutorials

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages