audio-to-text pipeline fails on return_timestamps=word

### Describe the bug

`audio-to-text` pipeline is not returning word level timestamps.

@RUFFY-369 is there a way to change to `sdpa` if word level timestamps is requested without reloading the pipeline to the gpu?

![image](https://github.com/user-attachments/assets/a3aa2049-aacc-4dee-a587-f011907cbb4d)


### Reproduction steps

1. Download new `audio-to-text` pipeline with flash attention 2 enabled
2. Send request to pipeline including `return_timestamps=word`
`curl -X POST http://172.17.0.1:6666/audio-to-text -F "audio=@test-audio.mp4" -F "model_id=openai/whisper-large-v3" -F "return_timestamps=word"`
3. See error returned
`{"error":{"message":": Error during model execution: WhisperFlashAttention2 attention does not support output_attentions."}}`



### Expected behaviour

Return word level timestamps.



### Severity

None

### Screenshots / Live demo link

_No response_

### OS

None

### Running on

None

### AI-worker version

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audio-to-text pipeline fails on return_timestamps=word #390

Describe the bug

Reproduction steps

Expected behaviour

Severity

Screenshots / Live demo link

OS

Running on

AI-worker version

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

audio-to-text pipeline fails on return_timestamps=word #390

Description

Describe the bug

Reproduction steps

Expected behaviour

Severity

Screenshots / Live demo link

OS

Running on

AI-worker version

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions