Skip to content

[Inference] fal-ai: audio-to-audio fix + text-to-audio + NeMo ASR & PersonaPlex#2234

Open
jetjodh wants to merge 8 commits into
huggingface:mainfrom
jetjodh:jetjodh/review-audio-model-support
Open

[Inference] fal-ai: audio-to-audio fix + text-to-audio + NeMo ASR & PersonaPlex#2234
jetjodh wants to merge 8 commits into
huggingface:mainfrom
jetjodh:jetjodh/review-audio-model-support

Conversation

@jetjodh

@jetjodh jetjodh commented Jun 15, 2026

Copy link
Copy Markdown

Summary

Adds and fixes audio model support for the fal-ai inference provider.

  1. Fix audio-to-audio for fal-ai audioToAudio() now resolves url/headers/signal via makeRequestOptions and forwards them to getResponse (mirroring imageSegmentation), so the fal queue task can poll for results instead of throwing "URL and headers are required for audio-to-audio task". Input data-URL MIME handling reuses FAL_AI_AUDIO_MIME_MAP (consistent with the ASR fix), so audio/wav/audio/webm/etc. aren't rejected by fal's data-URL decoder.

  2. Add text-to-audio support. New queue-based FalAITextToAudioTask (handles both audio_file and audio result shapes), a textToAudio() task function (auto-exposed on InferenceClient via the tasks barrel), a widened TextToAudioTaskHelper.getResponse signature, and export of the (previously defined-but-unexported) text-to-audio inference types from @huggingface/tasks.

  3. ASR: handle NeMo/nemotron output shape + timestamps. fal's nemotron ASR endpoint returns the transcript under output (with a partial flag), not text like whisper — so the existing helper would reject it. getResponse now parses both text and output, and normalizes timestamps from chunks (whisper) or segments to HF's AutomaticSpeechRecognitionOutput.chunks.

Testing

  • New unit tests: fal-ai audio-to-audio, text-to-audio, and automatic-speech-recognition (text/output/chunks/segments parsing).
  • Verified end-to-end against the live fal API: nemotron ASR (correct transcript) and PersonaPlex batch audio-to-audio (audio returned).
  • tsc --noEmit, eslint, and the new vitest specs all pass.

Note

Medium Risk
Touches live inference request/response paths for fal queue audio and ASR parsing; behavior changes for audio-to-audio callers but scope is provider-specific with new tests.

Overview
Expands fal-ai audio support: fixes audio-to-audio so queue jobs get url, headers, and signal (via makeRequestOptions + provider preparePayloadAsync), adds queue-based text-to-audio, and hardens ASR parsing for NeMo-style { output } plus optional chunks / segments timestamps.

Shared buildFalAiAudioDataUrl centralizes MIME remapping for fal data URLs (ASR and audio-to-audio). The public textToAudio() task is wired through the tasks barrel; @huggingface/tasks now exports text-to-audio inference types. Provider helper interfaces gain AbortSignal on text-to-audio / audio-to-audio getResponse and preparePayloadAsync on audio-to-audio (including hf-inference).

Unit tests cover fal queue polling, payload mapping, and ASR response variants.

Reviewed by Cursor Bugbot for commit 6ad8df7. Bugbot is set up for automated code reviews on this repo. Configure here.

jetjodh and others added 6 commits June 15, 2026 10:37
Adds an audio-to-audio task handler for the fal-ai provider so that HF partner mappings with that task (e.g. nvidia/personaplex-7b-v1 -> fal-ai/personaplex) can be promoted to live via the partner API.

- AudioToAudioTaskHelper now requires preparePayloadAsync (mirrors the ASR helper) so providers that need to async-encode the input blob can hook in there.
- audioToAudio.ts now calls providerHelper.preparePayloadAsync(args) instead of the sync preparePayload util from audio/utils.ts.
- HFInferenceAudioToAudioTask gets a passthrough preparePayloadAsync that returns { data: Blob, ... }, preserving the existing raw binary body behavior for hf-inference.
- New FalAIAudioToAudioTask extends FalAiQueueTask: preparePayloadAsync validates blob type against FAL_AI_SUPPORTED_BLOB_TYPES and base64-encodes the audio into audio_url: data:audio/...;base64,... for the fal queue payload. getResponse polls the queue, fetches the result audio URL, and returns [{ blob, content-type, label }] where label is the generated transcript when the fal app returns one, else "speech".
- Wires audio-to-audio into the fal-ai entry of PROVIDERS in getProviderHelper.ts.

Made-with: Cursor
Ensure audio-to-audio resolves providers with the same model and endpoint context as other binary audio tasks, and allow common audio MIME types that fal endpoints can normalize.

Made-with: Cursor
…nse + tests

The fal-ai audio-to-audio task is a queue task, so getResponse needs url and
headers to poll the status/result endpoints. Wire them through audioToAudio()
via makeRequestOptions (mirroring imageSegmentation), widen the
AudioToAudioTaskHelper.getResponse signature accordingly, and add unit tests
covering the queue happy-path, malformed response, and MIME remap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add FalAITextToAudioTask (queue-based, handles `audio_file`/`audio` result
shapes) and register fal-ai for the text-to-audio task. Add a textToAudio()
task function — which is auto-exposed on InferenceClient via the tasks barrel —
widen TextToAudioTaskHelper.getResponse to forward outputType/signal, and export
the text-to-audio inference types from @huggingface/tasks (previously defined
but not re-exported). Includes unit tests for the queue happy-path, the `audio`
fallback, malformed responses, and prompt/parameter payload mapping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…amps

fal's nemotron ASR endpoint (nvidia/nemotron-asr-multilingual/asr) returns the
transcript under `output` (with a `partial` flag), not `text` like fal whisper —
so the existing helper would reject it. Parse both `text` and `output`, and
normalize timestamps from `chunks` (whisper) or `segments` to HF's
AutomaticSpeechRecognitionOutput.chunks. Add a dev hardcoded mapping for
nvidia/nemotron-3.5-asr-streaming-0.6b -> the fal slug until it's registered for
fal-ai on huggingface.co. Verified end-to-end against the live fal API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nvidia/personaplex-7b-v1 (pipeline_tag: audio-to-audio) is served by fal's batch
endpoint fal-ai/personaplex, which returns { audio: { url }, text } — already
handled by FalAIAudioToAudioTask. Add a dev hardcoded mapping until it's
registered for fal-ai on huggingface.co. This covers the one-shot speech-to-speech
turn; the real-time full-duplex mode is WebSocket-only and out of scope for this
HTTP/queue client. Verified end-to-end against the live fal API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jetjodh and others added 2 commits June 15, 2026 12:11
Drop the nemotron / personaplex entries from HARDCODED_MODEL_INFERENCE_MAPPING;
these models should be wired up via the HF partner mapping for fal-ai instead of
hardcoded dev stopgaps. The provider helpers already handle their request/response
shapes, so no code change is needed once the models are registered on huggingface.co.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant