Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion chapters/en/chapter1/5.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ Transformers are not limited to text. They can also be applied to other modaliti

Let's start by exploring how Transformer models handle speech and audio data, which presents unique challenges compared to text or images.

[Whisper](https://huggingface.co/docs/transformers/main/en/model_doc/whisper) is a encoder-decoder (sequence-to-sequence) transformer pretrained on 680,000 hours of labeled audio data. This amount of pretraining data enables zero-shot performance on audio tasks in English and many other languages. The decoder allows Whisper to map the encoders learned speech representations to useful outputs, such as text, without additional fine-tuning. Whisper just works out of the box.
[Whisper](https://huggingface.co/docs/transformers/main/en/model_doc/whisper) is a encoder-decoder (sequence-to-sequence) transformer pretrained on 680,000 hours of labeled audio data. This amount of pretraining data enables zero-shot performance on audio tasks in English and many other languages. The decoder allows Whisper to map the encoder's learned speech representations to useful outputs, such as text, without additional fine-tuning. Whisper just works out of the box.

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/whisper_architecture.png"/>
Expand Down