SmallDoge models feature innovative architectural components designed for efficiency and performance:
- Dynamic Mask Attention (DMA) - Efficient attention mechanism for long sequences
- Cross Domain Mixture of Experts (CDMoE) - Sparse experts with dense-to-sparse continuation training
- WSD Scheduler - Warmup-Stable-Decay for seamless checkpoint resumption
Pre-trained foundation models for general-purpose language understanding:
| Model | Parameters | Speed (i7-11 CPU) | MMLU | HuggingFace |
|---|---|---|---|---|
| Doge-20M | 20M | 142 tok/s | 25.4 | 🤗 View Card |
| Doge-60M | 60M | 62 tok/s | 26.4 | 🤗 View Card |
| Doge-160M | 160M | 28 tok/s | 29.2 | 🤗 View Card |
| Doge-320M | 320M | 16 tok/s | 33.8 | 🤗 View Card |
Chat-optimized models fine-tuned for conversations and instruction following:
| Model | Base Model | Training | HuggingFace |
|---|---|---|---|
| Doge-20M-Instruct | Doge-20M | SFT + DPO | 🤗 View Card |
| Doge-60M-Instruct | Doge-60M | SFT + DPO | 🤗 View Card |
| Doge-160M-Instruct | Doge-160M | SFT + DPO | 🤗 View Card |
| Doge-320M-Instruct | Doge-320M | SFT + DPO | 🤗 View Card |
Partially trained models for supervised fine-tuning stages:
| Model | Training Stage | Base Model | HuggingFace |
|---|---|---|---|
| Doge-20M-Instruct-SFT | SFT Only | Doge-20M | 🤗 View Card |
| Doge-60M-Instruct-SFT | SFT Only | Doge-60M | 🤗 View Card |
| Doge-160M-Instruct-SFT | SFT Only | Doge-160M | 🤗 View Card |
| Doge-320M-Instruct-SFT | SFT Only | Doge-320M | 🤗 View Card |
Intermediate checkpoints for continued training with stable learning rates:
| Model | Recommended LR | Scheduler | HuggingFace |
|---|---|---|---|
| Doge-20M-checkpoint | 8e-3 | wsd_scheduler | 🤗 View Card |
| Doge-60M-checkpoint | 6e-3 | wsd_scheduler | 🤗 View Card |
| Doge-160M-checkpoint | 4e-3 | wsd_scheduler | 🤗 View Card |
| Doge-320M-checkpoint | 2e-3 | wsd_scheduler | 🤗 View Card |
Advanced models enhanced with reasoning capabilities through knowledge distillation:
| Model | Training Method | Capabilities | HuggingFace |
|---|---|---|---|
| Doge-160M-Reason-Distill | Knowledge Distillation + GRPO | Chain-of-thought reasoning | 🤗 View Card |
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load any model (example: instruction model)
model_name = "SmallDoge/Doge-60M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
# Generate text
prompt = "Explain machine learning in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))For detailed usage examples, see individual model cards above.
- 🔬 Research & Experimentation: Start with Doge-20M for fast iteration
- 💻 Development & Prototyping: Use Doge-60M for balanced performance
- 🎯 Production Applications: Deploy Doge-160M or Doge-320M for best quality
- 💬 Chat Applications: Use
-Instructvariants for conversation - 🧠 Reasoning Tasks: Try Doge-160M-Reason-Distill for complex problems
- 📚 Continued Training: Use
-checkpointmodels with specified learning rates
- CPU-only: Doge-20M (142 tok/s) or Doge-60M (62 tok/s)
- Mobile/Edge: Doge-20M with quantization
- GPU Available: Any model, Doge-320M recommended for best results
- Memory Constrained: Doge-20M (0.5GB) or Doge-60M (1.2GB)
- Training Guide - Complete training pipeline
- Quick Start - Get started in 5 minutes
- Installation - Setup instructions
- WebUI Guide - Web interface usage
