Update Chapter 5 ASR content with latest datasets and models by Deep-unlearning · Pull Request #217 · huggingface/audio-transformers-course

Deep-unlearning · 2025-07-14T13:15:14Z

Summary

Updated Common Voice dataset from v13 to v17 (latest available version)
Updated language count from 108 to 124 languages in Common Voice 17
Updated Whisper model reference from whisper-large-v2 to whisper-large-v3
NEW: Added comprehensive coverage of modern ASR architectures beyond Whisper
NEW: Added Moonshine ASR (edge-optimized, 5x faster for short audio)
NEW: Added Kyutai STT (real-time streaming capabilities)

Changes Made

Dataset and Model Updates

Dataset Version: Common Voice 13 → Common Voice 17
Language Support: 108 → 124 languages
Model Reference: whisper-large-v2 → whisper-large-v3
URLs Updated: All common_voice_13_0 → common_voice_17_0

New ASR Architecture Coverage

Moonshine ASR: Edge computing focus, 5x faster processing for short audio
Kyutai STT: Real-time streaming with ultra-low latency (0.5-2.5s)
Architecture Comparison: Detailed comparison table with performance metrics
Code Examples: Working examples for all three model types
Model Selection Guide: When to choose each architecture

Files Modified

chapters/en/chapter5/asr_models.mdx - Added modern ASR section, comparison table, code examples
chapters/en/chapter5/choosing_dataset.mdx - Added model-specific dataset recommendations
chapters/en/chapter5/evaluation.mdx - Updated dataset references
chapters/en/chapter5/fine-tuning.mdx - Updated training examples
chapters/en/_toctree.yml - Minor formatting fix

Key Features Added

Architecture Comparison Table

Feature	Whisper	Moonshine	Kyutai STT
Processing	Fixed 30s chunks	Variable-length	Streaming
Best Use Case	General-purpose ASR	Edge/Mobile devices	Real-time applications
Speed	Baseline	5x faster (short audio)	Ultra-low latency
Languages	96+ languages	English only	English (+French)

Model Selection Guidelines

Whisper: Multilingual support, high accuracy, translation capabilities
Moonshine: Edge deployment, memory efficiency, fast processing
Kyutai STT: Real-time streaming, low latency, robust audio handling

Test Plan

Verified Common Voice 17 dataset is available on Hugging Face Hub
Confirmed Dhivehi language is supported in Common Voice 17
Checked that all URLs and references are valid
Ensured code examples maintain compatibility
Verified Moonshine and Kyutai models are available on Hugging Face Hub
Tested code examples for syntax and API compatibility

- Update Common Voice dataset from v13 to v17 (latest available) - Update language count from 108 to 124 languages in Common Voice 17 - Update all dataset URLs and references throughout chapter5 files - Update Whisper model reference from whisper-large-v2 to whisper-large-v3 - Update training examples and code snippets to use latest dataset version - Maintain educational content structure while using current resources Files updated: - chapters/en/chapter5/choosing_dataset.mdx - chapters/en/chapter5/evaluation.mdx - chapters/en/chapter5/fine-tuning.mdx - chapters/en/chapter5/asr_models.mdx - chapters/en/_toctree.yml (minor formatting fix)

HuggingFaceDocBuilderDev · 2025-07-14T13:19:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

- Add detailed section on Moonshine ASR: edge-optimized, 5x faster for short audio - Add detailed section on Kyutai STT: real-time streaming capabilities - Include architecture comparison table with performance characteristics - Add code examples for using Moonshine and Kyutai models - Update model selection table with new ASR alternatives - Add model-specific dataset recommendations in choosing_dataset.mdx - Provide guidance on when to choose each model architecture - Update summary to reflect expanded ASR landscape This addresses the Whisper-centric nature of Chapter 5 by providing comprehensive coverage of modern ASR alternatives with different optimization focuses.

ebezzam

@Deep-unlearning thanks for the updates! In short, I think it could also be good to mention Parakeet and Voxtral

ebezzam · 2025-11-26T10:43:34Z

 | small  | 244 M      | 2.3       | 6         | [✓](https://huggingface.co/openai/whisper-small.en)  | [✓](https://huggingface.co/openai/whisper-small)    |
 | medium | 769 M      | 4.2       | 2         | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium)   |
-| large  | 1550 M     | 7.5       | 1         | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v2) |
+| large  | 1550 M     | 7.5       | 1         | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v3) |


How about adding Turbo as well? https://huggingface.co/openai/whisper-large-v3-turbo

should be faster as it has fewer decoder layers

ebezzam · 2025-11-26T10:49:16Z

+| Moonshine Tiny | 27 M | 0.5 | 5x faster for short audio | English | [✓](https://huggingface.co/UsefulSensors/moonshine-tiny) |
+| Moonshine Base | 61 M | 1.0 | Edge-optimized | English | [✓](https://huggingface.co/UsefulSensors/moonshine-base) |
+| Kyutai STT 1B | 1000 M | 3.0 | Real-time streaming | English, French | [✓](https://huggingface.co/kyutai/stt-1b-en_fr) |
+| Kyutai STT 2.6B | 2600 M | 6.0 | Low-latency streaming | English | [✓](https://huggingface.co/kyutai/stt-2.6b-en) |


How about adding others supported by transformers:

Parakeet (giving some LInkedIn reactions, it seems this is quite popular!): https://huggingface.co/nvidia/parakeet-ctc-1.1b

Voxtral: https://huggingface.co/mistralai/Voxtral-Mini-3B-2507

Audio Flamingo: https://huggingface.co/nvidia/audio-flamingo-3-hf

NOTE: last two are Audio LLM models that have built in audio understanding

and can also mention ASR leaderboard so people have a convenient comparison

ebezzam · 2025-11-26T10:50:21Z

+### When to Choose Each Model:
+
+**Choose Whisper when:**
+- You need multilingual support (96+ languages)


equivalent but I think most docs I see say 99+?

ebezzam · 2025-11-26T10:52:51Z

+
+While Whisper has been a game-changer for speech recognition, the field continues to evolve with new architectures designed to address specific limitations and use cases. Let's explore two notable recent developments: **Moonshine** and **Kyutai STT**, which offer different approaches to improving upon Whisper's capabilities.
+
+### Moonshine: Efficient Edge Computing ASR


how about adding such sections for Parakeet and Voxtral? (as they are both quite popular)

ebezzam · 2025-11-26T10:53:58Z

+- **Diverse domains** benefit from Whisper's robust pre-training
+- **Punctuation and casing** are recommended as Whisper handles them well
+
+### For Moonshine Models


similar as above, such sections for Voxtral and Parakeet?

Deep-unlearning added 2 commits July 14, 2025 13:26

nit

0f7984f

ebezzam reviewed Nov 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Chapter 5 ASR content with latest datasets and models#217

Update Chapter 5 ASR content with latest datasets and models#217
Deep-unlearning wants to merge 3 commits intomainfrom
update-chapter5-dataset-versions

Deep-unlearning commented Jul 14, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 14, 2025

Uh oh!

ebezzam left a comment

Uh oh!

ebezzam Nov 26, 2025

Uh oh!

ebezzam Nov 26, 2025

Uh oh!

ebezzam Nov 26, 2025

Uh oh!

ebezzam Nov 26, 2025

Uh oh!

ebezzam Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		While Whisper has been a game-changer for speech recognition, the field continues to evolve with new architectures designed to address specific limitations and use cases. Let's explore two notable recent developments: Moonshine and Kyutai STT, which offer different approaches to improving upon Whisper's capabilities.

		### Moonshine: Efficient Edge Computing ASR

Conversation

Deep-unlearning commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

Dataset and Model Updates

New ASR Architecture Coverage

Files Modified

Key Features Added

Architecture Comparison Table

Model Selection Guidelines

Test Plan

Uh oh!

HuggingFaceDocBuilderDev commented Jul 14, 2025

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

ebezzam Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Deep-unlearning commented Jul 14, 2025 •

edited

Loading