ASR-Doge is a parameter-efficient automatic speech recognition (ASR) model that combines IBM's Granite Speech encoder with SmallDoge language model through a lightweight MLP adapter. This project demonstrates that competitive ASR performance can be achieved by training only 0.05% of the total model parameters.
| Metric | Score |
|---|---|
| Word Error Rate (WER) | 4.70% |
| Character Error Rate (CER) | 2.75% |
| Perfect Match Rate | 46.0% |
| Real-Time Factor | 0.91x (faster than real-time!) |
| Trainable Parameters | 1.57M / 3.36B (0.05%) |
┌─────────────────────────────────────────────────────────────┐
│ ASR-Doge Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ Audio Input (16kHz) │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ Speech Encoder (FROZEN) │ │
│ │ IBM Granite Speech 3.3-2B │ │
│ │ Parameters: ~3.04B │ │
│ └────────────────┬───────────────────┘ │
│ │ [B, T, 2048] │
│ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ MLP Adapter (TRAINABLE) │ │
│ │ 2048 → 512 → 1024 │ │
│ │ Parameters: 1.57M │ │
│ └────────────────┬───────────────────┘ │
│ │ [B, T, 1024] │
│ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ Language Model (TRAINABLE) │ │
│ │ SmallDoge-320M │ │
│ │ Parameters: 320M │ │
│ └────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ Text Transcription │
│ │
└─────────────────────────────────────────────────────────────┘
- Python 3.8+
- PyTorch 2.0+
- CUDA 11.8+ (for GPU training)
- 24GB+ VRAM (40GB recommended)
# Clone the repository
git clone https://github.com/SmallDoge/asr-doge.git
cd asr-doge
# Create conda environment
conda create -n asr-doge python=3.10
conda activate asr-doge
# Install dependencies
pip install -r requirements.txt# requirements.txt
torch>=2.0.0
torchaudio>=2.0.0
transformers>=4.35.0
jiwer>=3.0.0
wandb>=0.15.0
tqdm>=4.65.0
numpy>=1.24.0# Download LibriSpeech (train-clean-100, dev-clean, test-clean)
python scripts/download_librispeech.py --output_dir ./datapython src/data/data_processor.py \
--dataset librispeech \
--data_dir ./data \
--output_dir ./processed \
--train_split train-clean-100 \
--dev_split dev-clean \
--test_split test-cleanpython src/training/train.py \
--data_dir ./processed \
--output_dir ./checkpoints \
--batch_size 16 \
--learning_rate 2e-3 \
--epochs 1 \
--patience 2 \
--wandb_project asr-dogepython src/benchmark/benchmark.py \
--checkpoint_dir ./checkpoints/best \
--data_dir ./processed \
--output_dir ./benchmark_results \
--max_samples 200To reproduce the exact results from our paper/TCC:
# Full training and evaluation pipeline
./scripts/reproduce.shOr step by step:
# 1. Process LibriSpeech data
python src/data/data_processor.py \
--data_dir /path/to/librispeech \
--output_dir ./processed
# 2. Train with our exact hyperparameters
python src/training/train.py \
--data_dir ./processed \
--output_dir ./checkpoints \
--batch_size 16 \
--learning_rate 0.002 \
--epochs 1 \
--patience 2 \
--eval_steps 100 \
--speech_encoder ibm-granite/granite-speech-3.3-2b \
--language_model SmallDoge/Doge-320M-Checkpoint
# 3. Run comprehensive benchmark
python src/benchmark/benchmark.py \
--checkpoint_dir ./checkpoints/best \
--data_dir ./processed \
--max_samples 200 \
--num_speed_runs 10asr-doge/
├── README.md # This file
├── LICENSE # Apache 2.0 License
├── requirements.txt # Python dependencies
├── src/
│ ├── data/
│ │ ├── __init__.py
│ │ └── data_processor.py # Dataset processing utilities
│ ├── training/
│ │ ├── __init__.py
│ │ └── train.py # Training script and model definition
│ ├── benchmark/
│ │ ├── __init__.py
│ │ └── benchmark.py # Comprehensive benchmark script
│ ├── models/ # ⚠️ LEGACY - See note below
│ │ └── ...
│ └── modules/ # ⚠️ LEGACY - See note below
│ └── ...
├── scripts/
│ ├── download_librispeech.py # Dataset download script
│ └── reproduce.sh # Full reproduction script
├── configs/
│ └── default.yaml # Default configuration
└── examples/
└── inference.py # Example inference script
⚠️ Note on Legacy DirectoriesThe following directories contain experimental/legacy code from early research explorations and should be ignored:
src/models/- Early Doge model experiments (configuration_singer_doge.py, modeling_singer_doge.py)src/modules/- Neural audio encoder experiments (SEANet, conv modules, LSTM variants)These are kept for historical reference but are NOT used in the main ASR-Doge implementation. The actual model architecture is defined in
src/training/train.pywhich uses pre-trained models from HuggingFace (Granite Speech + SmallDoge).
| Parameter | Default | Description |
|---|---|---|
speech_encoder |
ibm-granite/granite-speech-3.3-2b |
Pre-trained speech encoder |
language_model |
SmallDoge/Doge-320M-Checkpoint |
Language model for text generation |
adapter_input_dim |
2048 | Input dimension from speech encoder |
adapter_hidden_dim |
512 | Hidden dimension in MLP adapter |
adapter_output_dim |
1024 | Output dimension matching LM |
| Parameter | Default | Description |
|---|---|---|
batch_size |
16 | Training batch size |
learning_rate |
2e-3 | Peak learning rate |
epochs |
1 | Number of training epochs |
patience |
2 | Early stopping patience |
eval_steps |
100 | Steps between evaluations |
Our model converges quickly due to the pre-trained components:
- Training Loss: 3.5 → 0.24 (over ~1,800 steps)
- Validation CER: Improves steadily with early stopping at patience=2
from src.benchmark import ASRDogeBenchmark, BenchmarkConfig
config = BenchmarkConfig(checkpoint_dir="./checkpoints/best")
benchmark = ASRDogeBenchmark(config)
benchmark.load_models()
# Run on test set
result = benchmark.run_full_benchmark(test_samples)
print(f"WER: {result.accuracy.wer:.2%}")
print(f"CER: {result.accuracy.cer:.2%}")# Measure inference speed
speed_result = benchmark.benchmark_speed(audio_path)
print(f"Latency: {speed_result['total_latency_ms']:.2f}ms")
print(f"Real-Time Factor: {speed_result['real_time_factor']:.2f}x")We welcome contributions! Please see our Contributing Guide for details.
- Model Improvements: Better adapter architectures, LoRA integration
- Dataset Support: Additional datasets beyond LibriSpeech
- Multilingual: Leverage Granite's multilingual capabilities
- Streaming: Real-time streaming ASR implementation
- Quantization: INT8/INT4 quantization for edge deployment
If you use ASR-Doge in your research, please cite:
@misc{asrdoge2026,
title={ASR-Doge: Parameter-Efficient Speech Recognition with SmallDoge},
author={Julio Hsu and SmallDoge Team},
year={2026},
howpublished={\url{https://github.com/SmallDoge/asr-doge}},
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- SmallDoge Team for the efficient Doge language model family
- IBM for the Granite Speech encoder
- HuggingFace for the transformers library
- LibriSpeech creators for the benchmark dataset
- GitHub Issues: For bugs and feature requests
- Discussions: For questions and community support
- Email: [your-email@example.com]
Made with ❤️ by the SmallDoge Team