[Inference] fal-ai: audio-to-audio fix + text-to-audio + NeMo ASR & PersonaPlex by jetjodh · Pull Request #2234 · huggingface/huggingface.js

jetjodh · 2026-06-15T19:05:42Z

Summary

Adds and fixes audio model support for the fal-ai inference provider.

Fix audio-to-audio for fal-ai audioToAudio() now resolves url/headers/signal via makeRequestOptions and forwards them to getResponse (mirroring imageSegmentation), so the fal queue task can poll for results instead of throwing "URL and headers are required for audio-to-audio task". Input data-URL MIME handling reuses FAL_AI_AUDIO_MIME_MAP (consistent with the ASR fix), so audio/wav/audio/webm/etc. aren't rejected by fal's data-URL decoder.
Add text-to-audio support. New queue-based FalAITextToAudioTask (handles both audio_file and audio result shapes), a textToAudio() task function (auto-exposed on InferenceClient via the tasks barrel), a widened TextToAudioTaskHelper.getResponse signature, and export of the (previously defined-but-unexported) text-to-audio inference types from @huggingface/tasks.
ASR: handle NeMo/nemotron output shape + timestamps. fal's nemotron ASR endpoint returns the transcript under output (with a partial flag), not text like whisper — so the existing helper would reject it. getResponse now parses both text and output, and normalizes timestamps from chunks (whisper) or segments to HF's AutomaticSpeechRecognitionOutput.chunks.

Testing

New unit tests: fal-ai audio-to-audio, text-to-audio, and automatic-speech-recognition (text/output/chunks/segments parsing).
Verified end-to-end against the live fal API: nemotron ASR (correct transcript) and PersonaPlex batch audio-to-audio (audio returned).
tsc --noEmit, eslint, and the new vitest specs all pass.

Note

Medium Risk
Touches live inference request/response paths for fal queue audio and ASR parsing; behavior changes for audio-to-audio callers but scope is provider-specific with new tests.

Overview
Expands fal-ai audio support: fixes audio-to-audio so queue jobs get url, headers, and signal (via makeRequestOptions + provider preparePayloadAsync), adds queue-based text-to-audio, and hardens ASR parsing for NeMo-style { output } plus optional chunks / segments timestamps.

Shared buildFalAiAudioDataUrl centralizes MIME remapping for fal data URLs (ASR and audio-to-audio). The public textToAudio() task is wired through the tasks barrel; @huggingface/tasks now exports text-to-audio inference types. Provider helper interfaces gain AbortSignal on text-to-audio / audio-to-audio getResponse and preparePayloadAsync on audio-to-audio (including hf-inference).

Unit tests cover fal queue polling, payload mapping, and ASR response variants.

^{Reviewed by Cursor Bugbot for commit 6ad8df7. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds an audio-to-audio task handler for the fal-ai provider so that HF partner mappings with that task (e.g. nvidia/personaplex-7b-v1 -> fal-ai/personaplex) can be promoted to live via the partner API. - AudioToAudioTaskHelper now requires preparePayloadAsync (mirrors the ASR helper) so providers that need to async-encode the input blob can hook in there. - audioToAudio.ts now calls providerHelper.preparePayloadAsync(args) instead of the sync preparePayload util from audio/utils.ts. - HFInferenceAudioToAudioTask gets a passthrough preparePayloadAsync that returns { data: Blob, ... }, preserving the existing raw binary body behavior for hf-inference. - New FalAIAudioToAudioTask extends FalAiQueueTask: preparePayloadAsync validates blob type against FAL_AI_SUPPORTED_BLOB_TYPES and base64-encodes the audio into audio_url: data:audio/...;base64,... for the fal queue payload. getResponse polls the queue, fetches the result audio URL, and returns [{ blob, content-type, label }] where label is the generated transcript when the fal app returns one, else "speech". - Wires audio-to-audio into the fal-ai entry of PROVIDERS in getProviderHelper.ts. Made-with: Cursor

Ensure audio-to-audio resolves providers with the same model and endpoint context as other binary audio tasks, and allow common audio MIME types that fal endpoints can normalize. Made-with: Cursor

…nse + tests The fal-ai audio-to-audio task is a queue task, so getResponse needs url and headers to poll the status/result endpoints. Wire them through audioToAudio() via makeRequestOptions (mirroring imageSegmentation), widen the AudioToAudioTaskHelper.getResponse signature accordingly, and add unit tests covering the queue happy-path, malformed response, and MIME remap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add FalAITextToAudioTask (queue-based, handles `audio_file`/`audio` result shapes) and register fal-ai for the text-to-audio task. Add a textToAudio() task function — which is auto-exposed on InferenceClient via the tasks barrel — widen TextToAudioTaskHelper.getResponse to forward outputType/signal, and export the text-to-audio inference types from @huggingface/tasks (previously defined but not re-exported). Includes unit tests for the queue happy-path, the `audio` fallback, malformed responses, and prompt/parameter payload mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…amps fal's nemotron ASR endpoint (nvidia/nemotron-asr-multilingual/asr) returns the transcript under `output` (with a `partial` flag), not `text` like fal whisper — so the existing helper would reject it. Parse both `text` and `output`, and normalize timestamps from `chunks` (whisper) or `segments` to HF's AutomaticSpeechRecognitionOutput.chunks. Add a dev hardcoded mapping for nvidia/nemotron-3.5-asr-streaming-0.6b -> the fal slug until it's registered for fal-ai on huggingface.co. Verified end-to-end against the live fal API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

nvidia/personaplex-7b-v1 (pipeline_tag: audio-to-audio) is served by fal's batch endpoint fal-ai/personaplex, which returns { audio: { url }, text } — already handled by FalAIAudioToAudioTask. Add a dev hardcoded mapping until it's registered for fal-ai on huggingface.co. This covers the one-shot speech-to-speech turn; the real-time full-duplex mode is WebSocket-only and out of scope for this HTTP/queue client. Verified end-to-end against the live fal API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop the nemotron / personaplex entries from HARDCODED_MODEL_INFERENCE_MAPPING; these models should be wired up via the HF partner mapping for fal-ai instead of hardcoded dev stopgaps. The provider helpers already handle their request/response shapes, so no code change is needed once the models are registered on huggingface.co. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jetjodh and others added 6 commits June 15, 2026 10:37

[Inference] Fix fal-ai audio-to-audio request handling

d9169d5

Ensure audio-to-audio resolves providers with the same model and endpoint context as other binary audio tasks, and allow common audio MIME types that fal endpoints can normalize. Made-with: Cursor

jetjodh requested review from SBrandeis, Wauplin, gary149, hanouticelina, julien-c, ngxson and pcuenca as code owners June 15, 2026 19:05

jetjodh and others added 2 commits June 15, 2026 12:11

[Inference] Trim redundant comments in fal-ai audio helpers

6ad8df7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] fal-ai: audio-to-audio fix + text-to-audio + NeMo ASR & PersonaPlex#2234

[Inference] fal-ai: audio-to-audio fix + text-to-audio + NeMo ASR & PersonaPlex#2234
jetjodh wants to merge 8 commits into
huggingface:mainfrom
jetjodh:jetjodh/review-audio-model-support

jetjodh commented Jun 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jetjodh commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jetjodh commented Jun 15, 2026 •

edited by cursor Bot

Loading