ASR-Doge: Parameter-Efficient Speech Recognition with SmallDoge

ASR-Doge is a parameter-efficient automatic speech recognition (ASR) model that combines IBM's Granite Speech encoder with SmallDoge language model through a lightweight MLP adapter. This project demonstrates that competitive ASR performance can be achieved by training only 0.05% of the total model parameters.

🎯 Key Results

Metric	Score
Word Error Rate (WER)	4.70%
Character Error Rate (CER)	2.75%
Perfect Match Rate	46.0%
Real-Time Factor	0.91x (faster than real-time!)
Trainable Parameters	1.57M / 3.36B (0.05%)

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                     ASR-Doge Architecture                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Audio Input (16kHz)                                         │
│       │                                                      │
│       ▼                                                      │
│  ┌────────────────────────────────────┐                      │
│  │   Speech Encoder (FROZEN)          │                      │
│  │   IBM Granite Speech 3.3-2B        │                      │
│  │   Parameters: ~3.04B               │                      │
│  └────────────────┬───────────────────┘                      │
│                   │ [B, T, 2048]                              │
│                   ▼                                           │
│  ┌────────────────────────────────────┐                      │
│  │   MLP Adapter (TRAINABLE)          │                      │
│  │   2048 → 512 → 1024                │                      │
│  │   Parameters: 1.57M                │                      │
│  └────────────────┬───────────────────┘                      │
│                   │ [B, T, 1024]                              │
│                   ▼                                           │
│  ┌────────────────────────────────────┐                      │
│  │   Language Model (TRAINABLE)       │                      │
│  │   SmallDoge-320M                   │                      │
│  │   Parameters: 320M                 │                      │
│  └────────────────┬───────────────────┘                      │
│                   │                                           │
│                   ▼                                           │
│            Text Transcription                                 │
│                                                              │
└─────────────────────────────────────────────────────────────┘

📦 Installation

Requirements

Python 3.8+
PyTorch 2.0+
CUDA 11.8+ (for GPU training)
24GB+ VRAM (40GB recommended)

Setup

# Clone the repository
git clone https://github.com/SmallDoge/asr-doge.git
cd asr-doge

# Create conda environment
conda create -n asr-doge python=3.10
conda activate asr-doge

# Install dependencies
pip install -r requirements.txt

Dependencies

# requirements.txt
torch>=2.0.0
torchaudio>=2.0.0
transformers>=4.35.0
jiwer>=3.0.0
wandb>=0.15.0
tqdm>=4.65.0
numpy>=1.24.0

🚀 Quick Start

1. Download Dataset

# Download LibriSpeech (train-clean-100, dev-clean, test-clean)
python scripts/download_librispeech.py --output_dir ./data

2. Process Dataset

python src/data/data_processor.py \
    --dataset librispeech \
    --data_dir ./data \
    --output_dir ./processed \
    --train_split train-clean-100 \
    --dev_split dev-clean \
    --test_split test-clean

3. Train Model

python src/training/train.py \
    --data_dir ./processed \
    --output_dir ./checkpoints \
    --batch_size 16 \
    --learning_rate 2e-3 \
    --epochs 1 \
    --patience 2 \
    --wandb_project asr-doge

4. Run Benchmark

python src/benchmark/benchmark.py \
    --checkpoint_dir ./checkpoints/best \
    --data_dir ./processed \
    --output_dir ./benchmark_results \
    --max_samples 200

📊 Reproduce Our Results

To reproduce the exact results from our paper/TCC:

# Full training and evaluation pipeline
./scripts/reproduce.sh

Or step by step:

# 1. Process LibriSpeech data
python src/data/data_processor.py \
    --data_dir /path/to/librispeech \
    --output_dir ./processed

# 2. Train with our exact hyperparameters
python src/training/train.py \
    --data_dir ./processed \
    --output_dir ./checkpoints \
    --batch_size 16 \
    --learning_rate 0.002 \
    --epochs 1 \
    --patience 2 \
    --eval_steps 100 \
    --speech_encoder ibm-granite/granite-speech-3.3-2b \
    --language_model SmallDoge/Doge-320M-Checkpoint

# 3. Run comprehensive benchmark
python src/benchmark/benchmark.py \
    --checkpoint_dir ./checkpoints/best \
    --data_dir ./processed \
    --max_samples 200 \
    --num_speed_runs 10

📁 Project Structure

asr-doge/
├── README.md                     # This file
├── LICENSE                       # Apache 2.0 License
├── requirements.txt              # Python dependencies
├── src/
│   ├── data/
│   │   ├── __init__.py
│   │   └── data_processor.py     # Dataset processing utilities
│   ├── training/
│   │   ├── __init__.py
│   │   └── train.py              # Training script and model definition
│   ├── benchmark/
│   │   ├── __init__.py
│   │   └── benchmark.py          # Comprehensive benchmark script
│   ├── models/                   # ⚠️ LEGACY - See note below
│   │   └── ...
│   └── modules/                  # ⚠️ LEGACY - See note below
│       └── ...
├── scripts/
│   ├── download_librispeech.py   # Dataset download script
│   └── reproduce.sh              # Full reproduction script
├── configs/
│   └── default.yaml              # Default configuration
└── examples/
    └── inference.py              # Example inference script

⚠️ Note on Legacy Directories

The following directories contain experimental/legacy code from early research explorations and should be ignored:

src/models/ - Early Doge model experiments (configuration_singer_doge.py, modeling_singer_doge.py)

src/modules/ - Neural audio encoder experiments (SEANet, conv modules, LSTM variants)

These are kept for historical reference but are NOT used in the main ASR-Doge implementation. The actual model architecture is defined in src/training/train.py which uses pre-trained models from HuggingFace (Granite Speech + SmallDoge).

🔧 Configuration

Model Configuration

Parameter	Default	Description
`speech_encoder`	`ibm-granite/granite-speech-3.3-2b`	Pre-trained speech encoder
`language_model`	`SmallDoge/Doge-320M-Checkpoint`	Language model for text generation
`adapter_input_dim`	2048	Input dimension from speech encoder
`adapter_hidden_dim`	512	Hidden dimension in MLP adapter
`adapter_output_dim`	1024	Output dimension matching LM

Training Configuration

Parameter	Default	Description
`batch_size`	16	Training batch size
`learning_rate`	2e-3	Peak learning rate
`epochs`	1	Number of training epochs
`patience`	2	Early stopping patience
`eval_steps`	100	Steps between evaluations

📈 Training Curves

Our model converges quickly due to the pre-trained components:

Training Loss: 3.5 → 0.24 (over ~1,800 steps)
Validation CER: Improves steadily with early stopping at patience=2

🧪 Evaluation

Accuracy Metrics

from src.benchmark import ASRDogeBenchmark, BenchmarkConfig

config = BenchmarkConfig(checkpoint_dir="./checkpoints/best")
benchmark = ASRDogeBenchmark(config)
benchmark.load_models()

# Run on test set
result = benchmark.run_full_benchmark(test_samples)
print(f"WER: {result.accuracy.wer:.2%}")
print(f"CER: {result.accuracy.cer:.2%}")

Speed Benchmark

# Measure inference speed
speed_result = benchmark.benchmark_speed(audio_path)
print(f"Latency: {speed_result['total_latency_ms']:.2f}ms")
print(f"Real-Time Factor: {speed_result['real_time_factor']:.2f}x")

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Areas for Contribution

Model Improvements: Better adapter architectures, LoRA integration
Dataset Support: Additional datasets beyond LibriSpeech
Multilingual: Leverage Granite's multilingual capabilities
Streaming: Real-time streaming ASR implementation
Quantization: INT8/INT4 quantization for edge deployment

📜 Citation

If you use ASR-Doge in your research, please cite:

@misc{asrdoge2026,
  title={ASR-Doge: Parameter-Efficient Speech Recognition with SmallDoge},
  author={Julio Hsu and SmallDoge Team},
  year={2026},
  howpublished={\url{https://github.com/SmallDoge/asr-doge}},
}

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

SmallDoge Team for the efficient Doge language model family
IBM for the Granite Speech encoder
HuggingFace for the transformers library
LibriSpeech creators for the benchmark dataset

📬 Contact

GitHub Issues: For bugs and feature requests
Discussions: For questions and community support
Email: [your-email@example.com]

Made with ❤️ by the SmallDoge Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASR-Doge: Parameter-Efficient Speech Recognition with SmallDoge

🎯 Key Results

🏗️ Architecture

📦 Installation

Requirements

Setup

Dependencies

🚀 Quick Start

1. Download Dataset

2. Process Dataset

3. Train Model

4. Run Benchmark

📊 Reproduce Our Results

📁 Project Structure

🔧 Configuration

Model Configuration

Training Configuration

📈 Training Curves

🧪 Evaluation

Accuracy Metrics

Speed Benchmark

🤝 Contributing

Areas for Contribution

📜 Citation

📄 License

🙏 Acknowledgments

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
examples		examples
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ASR-Doge: Parameter-Efficient Speech Recognition with SmallDoge

🎯 Key Results

🏗️ Architecture

📦 Installation

Requirements

Setup

Dependencies

🚀 Quick Start

1. Download Dataset

2. Process Dataset

3. Train Model

4. Run Benchmark

📊 Reproduce Our Results

📁 Project Structure

🔧 Configuration

Model Configuration

Training Configuration

📈 Training Curves

🧪 Evaluation

Accuracy Metrics

Speed Benchmark

🤝 Contributing

Areas for Contribution

📜 Citation

📄 License

🙏 Acknowledgments

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages