huggingface · EngineerGraduate · Jan 29, 2026
diff --git a/chapters/en/chapter1/5.mdx b/chapters/en/chapter1/5.mdx
@@ -153,7 +153,7 @@ Transformers are not limited to text. They can also be applied to other modaliti
 
 Let's start by exploring how Transformer models handle speech and audio data, which presents unique challenges compared to text or images.
 
-[Whisper](https://huggingface.co/docs/transformers/main/en/model_doc/whisper) is a encoder-decoder (sequence-to-sequence) transformer pretrained on 680,000 hours of labeled audio data. This amount of pretraining data enables zero-shot performance on audio tasks in English and many other languages. The decoder allows Whisper to map the encoders learned speech representations to useful outputs, such as text, without additional fine-tuning. Whisper just works out of the box.
+[Whisper](https://huggingface.co/docs/transformers/main/en/model_doc/whisper) is a encoder-decoder (sequence-to-sequence) transformer pretrained on 680,000 hours of labeled audio data. This amount of pretraining data enables zero-shot performance on audio tasks in English and many other languages. The decoder allows Whisper to map the encoder's learned speech representations to useful outputs, such as text, without additional fine-tuning. Whisper just works out of the box.
 
 <div class="flex justify-center">
     <img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/whisper_architecture.png"/>