Update Chapter 5 ASR content with latest datasets and models#217
Update Chapter 5 ASR content with latest datasets and models#217Deep-unlearning wants to merge 3 commits intomainfrom
Conversation
- Update Common Voice dataset from v13 to v17 (latest available) - Update language count from 108 to 124 languages in Common Voice 17 - Update all dataset URLs and references throughout chapter5 files - Update Whisper model reference from whisper-large-v2 to whisper-large-v3 - Update training examples and code snippets to use latest dataset version - Maintain educational content structure while using current resources Files updated: - chapters/en/chapter5/choosing_dataset.mdx - chapters/en/chapter5/evaluation.mdx - chapters/en/chapter5/fine-tuning.mdx - chapters/en/chapter5/asr_models.mdx - chapters/en/_toctree.yml (minor formatting fix)
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
- Add detailed section on Moonshine ASR: edge-optimized, 5x faster for short audio - Add detailed section on Kyutai STT: real-time streaming capabilities - Include architecture comparison table with performance characteristics - Add code examples for using Moonshine and Kyutai models - Update model selection table with new ASR alternatives - Add model-specific dataset recommendations in choosing_dataset.mdx - Provide guidance on when to choose each model architecture - Update summary to reflect expanded ASR landscape This addresses the Whisper-centric nature of Chapter 5 by providing comprehensive coverage of modern ASR alternatives with different optimization focuses.
ebezzam
left a comment
There was a problem hiding this comment.
@Deep-unlearning thanks for the updates! In short, I think it could also be good to mention Parakeet and Voxtral
| | small | 244 M | 2.3 | 6 | [✓](https://huggingface.co/openai/whisper-small.en) | [✓](https://huggingface.co/openai/whisper-small) | | ||
| | medium | 769 M | 4.2 | 2 | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium) | | ||
| | large | 1550 M | 7.5 | 1 | x | [✓](https://huggingface.co/openai/whisper-large-v2) | | ||
| | large | 1550 M | 7.5 | 1 | x | [✓](https://huggingface.co/openai/whisper-large-v3) | |
There was a problem hiding this comment.
How about adding Turbo as well? https://huggingface.co/openai/whisper-large-v3-turbo
should be faster as it has fewer decoder layers
| | Moonshine Tiny | 27 M | 0.5 | 5x faster for short audio | English | [✓](https://huggingface.co/UsefulSensors/moonshine-tiny) | | ||
| | Moonshine Base | 61 M | 1.0 | Edge-optimized | English | [✓](https://huggingface.co/UsefulSensors/moonshine-base) | | ||
| | Kyutai STT 1B | 1000 M | 3.0 | Real-time streaming | English, French | [✓](https://huggingface.co/kyutai/stt-1b-en_fr) | | ||
| | Kyutai STT 2.6B | 2600 M | 6.0 | Low-latency streaming | English | [✓](https://huggingface.co/kyutai/stt-2.6b-en) | |
There was a problem hiding this comment.
How about adding others supported by transformers:
- Parakeet (giving some LInkedIn reactions, it seems this is quite popular!): https://huggingface.co/nvidia/parakeet-ctc-1.1b
- Voxtral: https://huggingface.co/mistralai/Voxtral-Mini-3B-2507
- Audio Flamingo: https://huggingface.co/nvidia/audio-flamingo-3-hf
NOTE: last two are Audio LLM models that have built in audio understanding
and can also mention ASR leaderboard so people have a convenient comparison
| ### When to Choose Each Model: | ||
|
|
||
| **Choose Whisper when:** | ||
| - You need multilingual support (96+ languages) |
There was a problem hiding this comment.
equivalent but I think most docs I see say 99+?
|
|
||
| While Whisper has been a game-changer for speech recognition, the field continues to evolve with new architectures designed to address specific limitations and use cases. Let's explore two notable recent developments: **Moonshine** and **Kyutai STT**, which offer different approaches to improving upon Whisper's capabilities. | ||
|
|
||
| ### Moonshine: Efficient Edge Computing ASR |
There was a problem hiding this comment.
how about adding such sections for Parakeet and Voxtral? (as they are both quite popular)
| - **Diverse domains** benefit from Whisper's robust pre-training | ||
| - **Punctuation and casing** are recommended as Whisper handles them well | ||
|
|
||
| ### For Moonshine Models |
There was a problem hiding this comment.
similar as above, such sections for Voxtral and Parakeet?
Summary
Changes Made
Dataset and Model Updates
common_voice_13_0→common_voice_17_0New ASR Architecture Coverage
Files Modified
chapters/en/chapter5/asr_models.mdx- Added modern ASR section, comparison table, code exampleschapters/en/chapter5/choosing_dataset.mdx- Added model-specific dataset recommendationschapters/en/chapter5/evaluation.mdx- Updated dataset referenceschapters/en/chapter5/fine-tuning.mdx- Updated training exampleschapters/en/_toctree.yml- Minor formatting fixKey Features Added
Architecture Comparison Table
Model Selection Guidelines
Test Plan