feat: CUDA backend swap via binary download and restart#252
feat: CUDA backend swap via binary download and restart#252
Conversation
📝 WalkthroughWalkthroughAdds end-to-end CUDA backend support: CI to build/split CUDA server binaries, backend endpoints and download/verify/streaming logic, frontend UI and API hooks to manage CUDA binaries, and Tauri commands to restart and launch the CUDA variant. Changes
Sequence DiagramsequenceDiagram
participant User
participant UI as GpuAcceleration<br/>Component
participant Client as Frontend<br/>API Client
participant Backend as Backend<br/>API
participant Storage as Release/R2<br/>Storage
participant Tauri as Tauri<br/>Runtime
User->>UI: Click "Download CUDA"
UI->>Client: downloadCudaBackend()
Client->>Backend: POST /backend/download-cuda
Backend->>Client: { progress_key }
Client->>Backend: EventSource /backend/cuda-progress
Storage-->>Backend: Serve parts / manifest
Backend->>Backend: Assemble & verify binary
Backend-->>Client: SSE progress events / complete
User->>UI: Click "Restart Server"
UI->>Tauri: restartServer()
Tauri->>Tauri: Stop current backend, wait for port release
Tauri->>Tauri: Check data_dir/backends for CUDA binary
Tauri->>Tauri: Verify version (--version)
Tauri->>Tauri: Launch CUDA binary (or fallback)
Tauri-->>UI: Return server URL/result
Client->>Backend: Poll /health until ready
Backend-->>Client: { backend_variant: "cuda", gpu_type: "..." }
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan for PR comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add the ability to download a CUDA-enabled backend binary (~2.4 GB) and swap it in via a backend-only restart, solving the #1 user pain point (19 open 'GPU not detected' issues caused by GitHub's 2 GB asset limit). Backend: - cuda_download.py: download from R2 (primary) or GitHub split-parts (fallback), SHA-256 verification, atomic writes, progress via SSE - 4 new endpoints: GET/POST/DELETE /backend/cuda-*, GET cuda-progress - server.py: --version flag, auto-detect variant from binary name - build_binary.py: --cuda flag for CUDA PyInstaller builds - split_binary.py: split large binaries into <2GB GitHub Release assets - CI workflow for building CUDA binary Tauri: - restart_server command (stop -> wait -> start) - start_server prefers CUDA binary from {data_dir}/backends/ if present - Version mismatch check: runs --version before launching CUDA binary Frontend: - GpuAcceleration component: download, progress, restart, switch, delete - API client + types for CUDA status and management - Platform lifecycle: restartServer() on Tauri/Web - Aggressive 1s health polling during restart for fast reconnection
44df09a to
dccded4
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (12)
web/src/platform/lifecycle.ts (1)
18-21: MirrorstartServer()callback semantics here.
startServer()invokesonServerReady, butrestartServer()currently just returns the URL. Keeping both methods consistent avoids platform-specific behavior if shared UI ever relies on the restart path to mark the server as ready again.🩹 Suggested fix
async restartServer(): Promise<string> { // No-op for web - server is managed externally - return import.meta.env.VITE_SERVER_URL || 'http://localhost:17493'; + const serverUrl = import.meta.env.VITE_SERVER_URL || 'http://localhost:17493'; + this.onServerReady?.(); + return serverUrl; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@web/src/platform/lifecycle.ts` around lines 18 - 21, The restartServer() implementation should mirror startServer() by invoking the onServerReady callback instead of just returning the URL; update the async restartServer() in lifecycle.ts to call the same onServerReady handler (the one used by startServer()) with the resolved URL (import.meta.env.VITE_SERVER_URL || 'http://localhost:17493') before returning it so both startServer() and restartServer() have consistent callback semantics.docs/plans/PR33_CUDA_PROVIDER_REVIEW.md (1)
25-37: Add language specifier to fenced code block.The ASCII diagram block lacks a language identifier. While this is cosmetic for documentation, adding a specifier improves markdown lint compliance and rendering consistency.
Suggested fix
-``` +```text ┌──────────────────────────────────────┐ │ Main App (~150MB Win/Lin, ~300 Mac) │🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/plans/PR33_CUDA_PROVIDER_REVIEW.md` around lines 25 - 37, The fenced ASCII diagram block in the doc (the triple-backtick block containing the box diagram) is missing a language specifier; update the opening fence from ``` to ```text so the block becomes ```text followed by the diagram (e.g., the block starting with "┌──────────────────────────────────────┐" and the lines describing "Main App", "HTTP (localhost)", and "Provider Binary"); this will satisfy markdown linting and ensure consistent rendering.app/src/lib/api/client.ts (1)
320-331: Remove debug logging before merge.The added
console.logstatements appear to be debugging artifacts. Production code should not include verbose logging with timestamps for routine API calls.Remove debug logs
async triggerModelDownload(modelName: string): Promise<{ message: string }> { - console.log( - '[API] triggerModelDownload called for:', - modelName, - 'at', - new Date().toISOString(), - ); const result = await this.request<{ message: string }>('/models/download', { method: 'POST', body: JSON.stringify({ model_name: modelName } as ModelDownloadRequest), }); - console.log('[API] triggerModelDownload response:', result); return result; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/lib/api/client.ts` around lines 320 - 331, The debug console.log calls inside the triggerModelDownload flow should be removed: delete the two console.log statements that log "[API] triggerModelDownload called for:" with the timestamp and "[API] triggerModelDownload response:" so the method only calls this.request and returns the result; ensure the surrounding code in the same function/method (the POST to '/models/download' and return of result) remains unchanged and check for any other stray console.log debug lines in the same file.docs/plans/CUDA_BACKEND_SWAP_FINAL.md (1)
11-19: Add language specifier to fenced code blocks.The workflow diagram and pseudocode blocks at lines 11 and 98 lack language identifiers. Adding
textorplaintextimproves linter compliance.Suggested fix for line 11
-``` +```text User clicks "Download CUDA Backend" in Settings🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/plans/CUDA_BACKEND_SWAP_FINAL.md` around lines 11 - 19, The fenced code blocks for the workflow ("User clicks \"Download CUDA Backend\" in Settings → Backend downloads binary...") and the later pseudocode block are missing language specifiers; update each opening fence to include a plaintext language (e.g., replace ``` with ```text or ```plaintext) so the linter recognizes them and formatting is preserved—apply this change to the block starting with 'User clicks "Download CUDA Backend"' and the corresponding pseudocode block further down.app/src/components/ServerSettings/GpuAcceleration.tsx (3)
92-115: Health polling lacks cleanup for in-flight requests.The interval-based polling creates a new
apiClient.getHealth()request every second. If requests are slow, multiple requests can be in flight simultaneously. When the server comes back, the first successful response clears the interval, but other in-flight requests may still resolve and trigger duplicate state updates.This is unlikely to cause visible bugs but could lead to redundant
invalidateQueries()calls.Add request deduplication
const startHealthPolling = useCallback(() => { if (healthPollRef.current) return; + let isPolling = true; healthPollRef.current = setInterval(async () => { + if (!isPolling) return; try { const result = await apiClient.getHealth(); - if (result.status === 'healthy') { + if (result.status === 'healthy' && isPolling) { + isPolling = false; // Server is back up if (healthPollRef.current) { clearInterval(healthPollRef.current);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/components/ServerSettings/GpuAcceleration.tsx` around lines 92 - 115, The polling loop in startHealthPolling spawns concurrent apiClient.getHealth() calls and doesn't cancel or dedupe in-flight requests; add a mechanism (e.g., an inFlightRequest ref or AbortController per request) to skip starting a new request if one is pending and to abort any pending request when you clear the interval; update the healthPollRef cleanup to both clearInterval(healthPollRef.current) and cancel the current in-flight request, and ensure the apiClient.getHealth call respects cancellation so that subsequent resolving requests cannot cause duplicate setRestartPhase or queryClient.invalidateQueries calls.
32-38: Clarify refetchInterval logic.The condition
cudaStatusLoading ? false : 10000disables refetch while loading. However,cudaStatusLoadingis onlytrueduring the initial load, not during background refetches. After the first successful fetch,cudaStatusLoadingbecomesfalsepermanently, so this condition effectively always evaluates to10000.If the intent is to pause polling during active operations, consider using a dedicated state flag.
Clarification
const { data: cudaStatus, isLoading: cudaStatusLoading, refetch: refetchCudaStatus, } = useQuery({ queryKey: ['cuda-status', serverUrl], queryFn: () => apiClient.getCudaStatus(), - refetchInterval: cudaStatusLoading ? false : 10000, + refetchInterval: 10000, // Poll every 10s when enabled retry: 1, - enabled: !!health, // Only fetch when backend is reachable + enabled: !!health && restartPhase === 'idle', // Pause during restart });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/components/ServerSettings/GpuAcceleration.tsx` around lines 32 - 38, The refetchInterval currently uses cudaStatusLoading which is only true during the initial load, so after the first fetch the condition always returns 10000; change the logic to pause polling using a state or react-query's fetch status instead. Replace the use of cudaStatusLoading in refetchInterval with a dedicated flag (e.g., isPollingPaused) or the query status flag (e.g., isFetching / isRefetching returned by useQuery) so polling is actually paused during active operations; update the refetchInterval to use that flag (refetchInterval: isPollingPaused ? false : 10000 or refetchInterval: isFetching ? false : 10000) and set/clear the isPollingPaused flag around the operations that should suspend polling. Ensure references to useQuery, refetchInterval, cudaStatusLoading, and apiClient.getCudaStatus are updated accordingly.
56-90: EventSource uses captured serverUrl — may become stale after restart.The
serverUrlis read once when the component mounts (line 19). If the server restarts and the URL changes (unlikely but possible in some configurations), the EventSource at line 61 would use the stale URL.This is a minor edge case since the port is typically fixed, but worth noting for robustness.
Consider reading the current URL from the store when creating the EventSource:
const currentUrl = useServerStore.getState().serverUrl; const eventSource = new EventSource(`${currentUrl}/backend/cuda-progress`);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/src/components/ServerSettings/GpuAcceleration.tsx` around lines 56 - 90, The EventSource is created using the captured serverUrl inside the useEffect (and can become stale); change the EventSource creation to read the current value from the store at time of creation (e.g., call useServerStore.getState().serverUrl) rather than using the captured serverUrl variable, validate it is non-null before constructing the EventSource, and keep existing cleanup (eventSource.close()) and refetchCudaStatus/error handling in the same useEffect (symbols: useEffect, serverUrl, useServerStore.getState, EventSource, refetchCudaStatus, setDownloadProgress, setError)..github/workflows/build-cuda.yml (1)
62-73: R2 upload failures are silently ignored.If the
aws s3 cpcommand fails (e.g., due to credential issues or network problems), the workflow continues without error. This could result in a release where the GitHub Release has split parts but R2 has no full binary.Add error handling
- name: Upload full binary to R2 if: startsWith(github.ref, 'refs/tags/') shell: bash run: | + set -e VERSION="${GITHUB_REF#refs/tags/}" aws s3 cp backend/dist/voicebox-server-cuda.exe \ "s3://voicebox-downloads/cuda/${VERSION}/voicebox-server-cuda.exe" \ --endpoint-url "${{ secrets.R2_ENDPOINT }}" + echo "Successfully uploaded to R2: cuda/${VERSION}/voicebox-server-cuda.exe"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/build-cuda.yml around lines 62 - 73, The "Upload full binary to R2" job step currently runs the aws s3 cp command without failing the workflow on errors; update the step's run block to fail fast by enabling strict bash error handling (e.g., add set -euo pipefail at start) or append an explicit check to the aws s3 cp invocation (e.g., aws s3 cp ... || { processLogger? no—use echo "R2 upload failed" >&2; exit 1; }) so any upload error causes the job to fail and surface credentials/network errors; target the "Upload full binary to R2" step and the aws s3 cp command in the run block when applying the change.backend/cuda_download.py (2)
136-142: Uselogging.exceptionto capture traceback on failure.When the download fails,
logging.errorloses the stack trace. Usinglogging.exceptionpreserves the full traceback for debugging.📝 Proposed fix
except Exception as e: # Clean up on failure if temp_path.exists(): temp_path.unlink() - logger.error(f"CUDA backend download failed: {e}") + logger.exception(f"CUDA backend download failed: {e}") progress.mark_error(PROGRESS_KEY, str(e)) raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/cuda_download.py` around lines 136 - 142, The except block in backend/cuda_download.py currently logs the error with logger.error which discards the traceback; update the handler (in the except Exception as e: block that cleans up temp_path, calls progress.mark_error(PROGRESS_KEY, str(e)) and re-raises) to call logger.exception(...) instead of logger.error(...) so the full traceback is recorded while preserving the temp_path cleanup, progress.mark_error call, and the raise.
177-212: No error handling for individual part download failures.If a part download fails (non-404 HTTP error, network issue),
response.raise_for_status()raises an exception, but the partial file is left on disk. The cleanup in the caller handles this, but the error message won't indicate which part failed.Consider wrapping with more context:
💡 Proposed enhancement
async with client.stream("GET", part_url) as response: - response.raise_for_status() + try: + response.raise_for_status() + except httpx.HTTPStatusError as e: + raise RuntimeError(f"Failed to download part {i+1}/{len(parts)}: {part_name}") from e async for chunk in response.aiter_bytes(chunk_size=1024 * 1024):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/cuda_download.py` around lines 177 - 212, _download_split_parts currently lets exceptions from response.raise_for_status() bubble up with no per-part context and leaves only a generic error; wrap the per-part download (the async with client.stream(...) block inside the for loop that writes to open file f) in a try/except Exception as e to catch failures for each part_name, call progress.update_progress(PROGRESS_KEY, current=total_downloaded, total=0, filename=f"Part {i+1}/{len(parts)}", status="failed") and logger.error or logger.exception including the part_name and version, then re-raise a new RuntimeError or raise from the caught exception with a message like f"Failed downloading part {part_name} for {get_cuda_binary_name()}@{version}" so the caller gets clear context while allowing existing cleanup to run.docs/plans/PROJECT_STATUS.md (1)
22-43: Add language specifier to fenced code blocks.The ASCII diagram code blocks are missing language specifiers. While these are architecture diagrams rather than code, adding a language hint (e.g.,
textorplaintext) satisfies markdown linters and improves consistency.📝 Proposed fix
-``` +```text ┌─────────────────────────────────────────────────────┐ │ Tauri Shell (Rust) │Apply similarly at line 64.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/plans/PROJECT_STATUS.md` around lines 22 - 43, Add a language specifier to the fenced ASCII diagram blocks in PROJECT_STATUS.md so markdown linters accept them: locate the fenced block that starts with the ASCII diagram containing "Tauri Shell (Rust)" / "FastAPI Backend (backend/)" and the similar block around line 64, and change their opening fences from ``` to a fenced code block with a plain text hint (e.g., ```text or ```plaintext) so the diagram is treated as plain text.docs/plans/CUDA_BACKEND_SWAP.md (1)
105-128: Document CUDA version requirements prominently.The CI workflow example pins to CUDA 12.1 (
cu121). Consider adding a note in the Problem or Solution section about supported CUDA versions and driver requirements, as this affects which NVIDIA GPUs will work.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/plans/CUDA_BACKEND_SWAP.md` around lines 105 - 128, The CI workflow example currently pins to CUDA 12.1 via the pip line "pip install torch --index-url https://download.pytorch.org/whl/cu121"; update the docs' Problem or Solution sections to prominently state supported CUDA versions and minimum NVIDIA driver requirements, call out that the example uses CUDA 12.1 (cu121) so users with older/newer GPUs may need different wheels, and add guidance or links for selecting compatible torch/cu versions and verifying driver compatibility so readers know which NVIDIA GPUs will work with the CUDA backend.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/build-cuda.yml:
- Around line 35-36: The YAML step "Verify CUDA support in torch" currently
embeds a Python one-liner containing colons which breaks YAML parsing; fix by
quoting the entire run string or converting it to a multi-line script block so
the colons are not interpreted as YAML mappings (i.e., replace the inline run
value with a quoted single-line string or a | multi-line block containing the
python -c command), ensuring the step name "Verify CUDA support in torch" and
its run content are preserved.
In `@backend/cuda_download.py`:
- Around line 229-237: The loop over checksum_urls that does "resp = await
client.get(url)" currently swallows all exceptions; change the except block to
catch Exception as e and log the failure (including the URL and exception
details/stack) before continuing so failures (network errors, parsing issues)
are visible; also log non-200 responses or malformed resp.text when expected_sha
can't be parsed (reference the checksum_urls loop, client.get call,
resp.status_code check, and expected_sha extraction) so you have clear
diagnostics for each attempted URL.
In `@backend/main.py`:
- Around line 1772-1780: The created background task for _download is currently
discarded and its errors are logged with logging.error (losing traceback) and
never surfaced to the frontend; change to store the returned task from
asyncio.create_task into a shared registry (e.g., a global dict like
backend_tasks or backend_errors) keyed by "cuda-backend", attach a
task.add_done_callback to the Task to call logging.exception on any exception
raised by cuda_download.download_cuda_binary (or remove the inner try and let
the callback handle exceptions), and write the exception/result into the shared
registry so the SSE/progress endpoint (which reads progress_key "cuda-backend")
can surface the failure to the user if they connect after the failure. Ensure
references in your changes include the async def _download function,
cuda_download.download_cuda_binary, asyncio.create_task, and the progress_key
"cuda-backend".
In `@backend/server.py`:
- Around line 67-77: Move the --version handling so argument parsing happens
before the heavy module-level imports (avoid importing backend.main or DB stack
when only printing version); parse args early using the existing
parser.add_argument("--version", action="store_true", ...) and immediately check
args.version, and if true import only backend.__version__ (not backend.main) and
print then sys.exit(0); ensure no other module-level imports that pull in
ML/runtime dependencies run before this check.
- Around line 79-88: The code in server.py uses
os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu") which preserves an
inherited CUDA value when the current binary name indicates CPU; change the
logic in the startup detection (where binary_name is derived from
sys.executable) to explicitly set os.environ["VOICEBOX_BACKEND_VARIANT"] = "cpu"
when "cuda" is not in binary_name so a CPU binary cannot report an inherited
"cuda" value (update the block that checks binary_name and sets
VOICEBOX_BACKEND_VARIANT accordingly).
In `@scripts/split_binary.py`:
- Around line 21-38: The current split function reads the entire file into
memory (data = input_path.read_bytes()), which must be replaced with a streaming
approach: open the input file in binary mode, create a hashlib.sha256() object
and update it incrementally as you read fixed-size blocks, and simultaneously
write each block to successive part files (use a counter for part_index instead
of len(parts)). Keep output_dir.mkdir(...) but remove the large in-memory data
variable; after the read loop finalize the hex digest and write checksum_file
with f"{sha256} {input_path.name}\n". Ensure part naming logic
(f"{input_path.stem}.part{part_index:02d}{input_path.suffix}") and writing uses
the chunk_size argument and that files are written in binary mode so memory
usage stays bounded.
In `@tauri/src-tauri/src/main.rs`:
- Around line 612-629: The 1s sleep in restart_server is fragile; replace it
with a retry loop that waits for the server port to be free before calling
start_server: after stop_server(state.clone()).await?, poll up to N times (e.g.,
10 attempts) using tokio::time::sleep with a short backoff (e.g., 200-500ms) and
on each attempt check port availability by trying to bind a TcpListener to the
server port (implement a small helper like try_bind_port(port) that returns
Ok(()) if bind succeeds then immediately drop the listener), only proceed to
call start_server(app, state, None). If start_server fails after the retries
return the error upward (i.e., propagate the Err<String>) so the frontend can
retry; keep function name restart_server and use stop_server and start_server
symbols as-is.
---
Nitpick comments:
In @.github/workflows/build-cuda.yml:
- Around line 62-73: The "Upload full binary to R2" job step currently runs the
aws s3 cp command without failing the workflow on errors; update the step's run
block to fail fast by enabling strict bash error handling (e.g., add set -euo
pipefail at start) or append an explicit check to the aws s3 cp invocation
(e.g., aws s3 cp ... || { processLogger? no—use echo "R2 upload failed" >&2;
exit 1; }) so any upload error causes the job to fail and surface
credentials/network errors; target the "Upload full binary to R2" step and the
aws s3 cp command in the run block when applying the change.
In `@app/src/components/ServerSettings/GpuAcceleration.tsx`:
- Around line 92-115: The polling loop in startHealthPolling spawns concurrent
apiClient.getHealth() calls and doesn't cancel or dedupe in-flight requests; add
a mechanism (e.g., an inFlightRequest ref or AbortController per request) to
skip starting a new request if one is pending and to abort any pending request
when you clear the interval; update the healthPollRef cleanup to both
clearInterval(healthPollRef.current) and cancel the current in-flight request,
and ensure the apiClient.getHealth call respects cancellation so that subsequent
resolving requests cannot cause duplicate setRestartPhase or
queryClient.invalidateQueries calls.
- Around line 32-38: The refetchInterval currently uses cudaStatusLoading which
is only true during the initial load, so after the first fetch the condition
always returns 10000; change the logic to pause polling using a state or
react-query's fetch status instead. Replace the use of cudaStatusLoading in
refetchInterval with a dedicated flag (e.g., isPollingPaused) or the query
status flag (e.g., isFetching / isRefetching returned by useQuery) so polling is
actually paused during active operations; update the refetchInterval to use that
flag (refetchInterval: isPollingPaused ? false : 10000 or refetchInterval:
isFetching ? false : 10000) and set/clear the isPollingPaused flag around the
operations that should suspend polling. Ensure references to useQuery,
refetchInterval, cudaStatusLoading, and apiClient.getCudaStatus are updated
accordingly.
- Around line 56-90: The EventSource is created using the captured serverUrl
inside the useEffect (and can become stale); change the EventSource creation to
read the current value from the store at time of creation (e.g., call
useServerStore.getState().serverUrl) rather than using the captured serverUrl
variable, validate it is non-null before constructing the EventSource, and keep
existing cleanup (eventSource.close()) and refetchCudaStatus/error handling in
the same useEffect (symbols: useEffect, serverUrl, useServerStore.getState,
EventSource, refetchCudaStatus, setDownloadProgress, setError).
In `@app/src/lib/api/client.ts`:
- Around line 320-331: The debug console.log calls inside the
triggerModelDownload flow should be removed: delete the two console.log
statements that log "[API] triggerModelDownload called for:" with the timestamp
and "[API] triggerModelDownload response:" so the method only calls this.request
and returns the result; ensure the surrounding code in the same function/method
(the POST to '/models/download' and return of result) remains unchanged and
check for any other stray console.log debug lines in the same file.
In `@backend/cuda_download.py`:
- Around line 136-142: The except block in backend/cuda_download.py currently
logs the error with logger.error which discards the traceback; update the
handler (in the except Exception as e: block that cleans up temp_path, calls
progress.mark_error(PROGRESS_KEY, str(e)) and re-raises) to call
logger.exception(...) instead of logger.error(...) so the full traceback is
recorded while preserving the temp_path cleanup, progress.mark_error call, and
the raise.
- Around line 177-212: _download_split_parts currently lets exceptions from
response.raise_for_status() bubble up with no per-part context and leaves only a
generic error; wrap the per-part download (the async with client.stream(...)
block inside the for loop that writes to open file f) in a try/except Exception
as e to catch failures for each part_name, call
progress.update_progress(PROGRESS_KEY, current=total_downloaded, total=0,
filename=f"Part {i+1}/{len(parts)}", status="failed") and logger.error or
logger.exception including the part_name and version, then re-raise a new
RuntimeError or raise from the caught exception with a message like f"Failed
downloading part {part_name} for {get_cuda_binary_name()}@{version}" so the
caller gets clear context while allowing existing cleanup to run.
In `@docs/plans/CUDA_BACKEND_SWAP_FINAL.md`:
- Around line 11-19: The fenced code blocks for the workflow ("User clicks
\"Download CUDA Backend\" in Settings → Backend downloads binary...") and the
later pseudocode block are missing language specifiers; update each opening
fence to include a plaintext language (e.g., replace ``` with ```text or
```plaintext) so the linter recognizes them and formatting is preserved—apply
this change to the block starting with 'User clicks "Download CUDA Backend"' and
the corresponding pseudocode block further down.
In `@docs/plans/CUDA_BACKEND_SWAP.md`:
- Around line 105-128: The CI workflow example currently pins to CUDA 12.1 via
the pip line "pip install torch --index-url
https://download.pytorch.org/whl/cu121"; update the docs' Problem or Solution
sections to prominently state supported CUDA versions and minimum NVIDIA driver
requirements, call out that the example uses CUDA 12.1 (cu121) so users with
older/newer GPUs may need different wheels, and add guidance or links for
selecting compatible torch/cu versions and verifying driver compatibility so
readers know which NVIDIA GPUs will work with the CUDA backend.
In `@docs/plans/PR33_CUDA_PROVIDER_REVIEW.md`:
- Around line 25-37: The fenced ASCII diagram block in the doc (the
triple-backtick block containing the box diagram) is missing a language
specifier; update the opening fence from ``` to ```text so the block becomes
```text followed by the diagram (e.g., the block starting with
"┌──────────────────────────────────────┐" and the lines describing "Main App",
"HTTP (localhost)", and "Provider Binary"); this will satisfy markdown linting
and ensure consistent rendering.
In `@docs/plans/PROJECT_STATUS.md`:
- Around line 22-43: Add a language specifier to the fenced ASCII diagram blocks
in PROJECT_STATUS.md so markdown linters accept them: locate the fenced block
that starts with the ASCII diagram containing "Tauri Shell (Rust)" / "FastAPI
Backend (backend/)" and the similar block around line 64, and change their
opening fences from ``` to a fenced code block with a plain text hint (e.g.,
```text or ```plaintext) so the diagram is treated as plain text.
In `@web/src/platform/lifecycle.ts`:
- Around line 18-21: The restartServer() implementation should mirror
startServer() by invoking the onServerReady callback instead of just returning
the URL; update the async restartServer() in lifecycle.ts to call the same
onServerReady handler (the one used by startServer()) with the resolved URL
(import.meta.env.VITE_SERVER_URL || 'http://localhost:17493') before returning
it so both startServer() and restartServer() have consistent callback semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6d55ea08-83e7-44ad-82ee-d774fb39ca09
📒 Files selected for processing (20)
.github/workflows/build-cuda.ymlapp/src/components/ServerSettings/GpuAcceleration.tsxapp/src/components/ServerTab/ServerTab.tsxapp/src/lib/api/client.tsapp/src/lib/api/types.tsapp/src/platform/types.tsbackend/build_binary.pybackend/cuda_download.pybackend/main.pybackend/models.pybackend/requirements.txtbackend/server.pydocs/plans/CUDA_BACKEND_SWAP.mddocs/plans/CUDA_BACKEND_SWAP_FINAL.mddocs/plans/PR33_CUDA_PROVIDER_REVIEW.mddocs/plans/PROJECT_STATUS.mdscripts/split_binary.pytauri/src-tauri/src/main.rstauri/src/platform/lifecycle.tsweb/src/platform/lifecycle.ts
backend/cuda_download.py
Outdated
| for url in checksum_urls: | ||
| try: | ||
| resp = await client.get(url) | ||
| if resp.status_code == 200: | ||
| # Format: "sha256hex filename\n" | ||
| expected_sha = resp.text.strip().split()[0] | ||
| break | ||
| except Exception: | ||
| continue |
There was a problem hiding this comment.
Log exceptions when checksum fetch fails.
Silent except Exception: continue swallows errors without any visibility. If both checksum URLs fail for non-404 reasons (e.g., network issues, malformed response), there's no indication of what went wrong.
🔍 Proposed fix
try:
resp = await client.get(url)
if resp.status_code == 200:
# Format: "sha256hex filename\n"
expected_sha = resp.text.strip().split()[0]
break
- except Exception:
+ except Exception as e:
+ logger.debug(f"Failed to fetch checksum from {url}: {e}")
continue
🧰 Tools
🪛 Ruff (0.15.5)
[error] 236-237: try-except-continue detected, consider logging the exception
(S112)
[warning] 236-236: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/cuda_download.py` around lines 229 - 237, The loop over checksum_urls
that does "resp = await client.get(url)" currently swallows all exceptions;
change the except block to catch Exception as e and log the failure (including
the URL and exception details/stack) before continuing so failures (network
errors, parsing issues) are visible; also log non-200 responses or malformed
resp.text when expected_sha can't be parsed (reference the checksum_urls loop,
client.get call, resp.status_code check, and expected_sha extraction) so you
have clear diagnostics for each attempted URL.
| async def _download(): | ||
| try: | ||
| await cuda_download.download_cuda_binary() | ||
| except Exception as e: | ||
| import logging | ||
| logging.getLogger(__name__).error(f"CUDA download failed: {e}") | ||
|
|
||
| asyncio.create_task(_download()) | ||
| return {"message": "CUDA backend download started", "progress_key": "cuda-backend"} |
There was a problem hiding this comment.
Store task reference and improve error propagation.
The static analysis correctly identified issues here:
asyncio.create_task()return value is discarded — if the task fails, the exception is silently swallowed after logging- Using
logging.errorinstead oflogging.exceptionloses the traceback
More importantly, if the download fails before the frontend connects to the SSE stream, the error is logged but never surfaced to the user.
Suggested improvements
+ _cuda_download_task: asyncio.Task | None = None # Module-level or use a proper task registry
async def _download():
try:
await cuda_download.download_cuda_binary()
except Exception as e:
- import logging
- logging.getLogger(__name__).error(f"CUDA download failed: {e}")
+ import logging
+ logging.getLogger(__name__).exception("CUDA download failed")
+ # Ensure error is propagated to progress manager for SSE consumers
+ from .utils.progress import get_progress_manager
+ get_progress_manager().update_progress(
+ model_name="cuda-backend",
+ current=0,
+ total=0,
+ filename="",
+ status="error",
+ error=str(e),
+ )
- asyncio.create_task(_download())
+ task = asyncio.create_task(_download())
+ task.add_done_callback(lambda t: t.exception() if not t.cancelled() else None)🧰 Tools
🪛 Ruff (0.15.5)
[warning] 1775-1775: Do not catch blind exception: Exception
(BLE001)
[warning] 1777-1777: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
[warning] 1779-1779: Store a reference to the return value of asyncio.create_task
(RUF006)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/main.py` around lines 1772 - 1780, The created background task for
_download is currently discarded and its errors are logged with logging.error
(losing traceback) and never surfaced to the frontend; change to store the
returned task from asyncio.create_task into a shared registry (e.g., a global
dict like backend_tasks or backend_errors) keyed by "cuda-backend", attach a
task.add_done_callback to the Task to call logging.exception on any exception
raised by cuda_download.download_cuda_binary (or remove the inner try and let
the callback handle exceptions), and write the exception/result into the shared
registry so the SSE/progress endpoint (which reads progress_key "cuda-backend")
can surface the failure to the user if they connect after the failure. Ensure
references in your changes include the async def _download function,
cuda_download.download_cuda_binary, asyncio.create_task, and the progress_key
"cuda-backend".
| # Detect backend variant from binary name | ||
| # voicebox-server-cuda → sets VOICEBOX_BACKEND_VARIANT=cuda | ||
| import os | ||
| binary_name = os.path.basename(sys.executable).lower() | ||
| if "cuda" in binary_name: | ||
| os.environ["VOICEBOX_BACKEND_VARIANT"] = "cuda" | ||
| logger.info("Backend variant: CUDA") | ||
| else: | ||
| os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu") | ||
| logger.info(f"Backend variant: {os.environ['VOICEBOX_BACKEND_VARIANT']}") |
There was a problem hiding this comment.
Don’t preserve an inherited CUDA variant for CPU binaries.
Line 87 uses setdefault, so an inherited VOICEBOX_BACKEND_VARIANT=cuda leaves the CPU binary reporting itself as CUDA. Since backend/main.py:200-210 surfaces this value in health, the UI can show the wrong backend after a CPU restart.
🩹 Suggested fix
if "cuda" in binary_name:
os.environ["VOICEBOX_BACKEND_VARIANT"] = "cuda"
logger.info("Backend variant: CUDA")
else:
- os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu")
+ os.environ["VOICEBOX_BACKEND_VARIANT"] = "cpu"
logger.info(f"Backend variant: {os.environ['VOICEBOX_BACKEND_VARIANT']}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` around lines 79 - 88, The code in server.py uses
os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu") which preserves an
inherited CUDA value when the current binary name indicates CPU; change the
logic in the startup detection (where binary_name is derived from
sys.executable) to explicitly set os.environ["VOICEBOX_BACKEND_VARIANT"] = "cpu"
when "cuda" is not in binary_name so a CPU binary cannot report an inherited
"cuda" value (update the block that checks binary_name and sets
VOICEBOX_BACKEND_VARIANT accordingly).
| def split(input_path: Path, chunk_size: int, output_dir: Path): | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
| data = input_path.read_bytes() | ||
| total_size = len(data) | ||
|
|
||
| # Write SHA-256 of the complete file | ||
| sha256 = hashlib.sha256(data).hexdigest() | ||
| checksum_file = output_dir / f"{input_path.stem}.sha256" | ||
| checksum_file.write_text(f"{sha256} {input_path.name}\n") | ||
|
|
||
| # Split into chunks | ||
| parts = [] | ||
| for i in range(0, total_size, chunk_size): | ||
| part_index = len(parts) | ||
| part_name = f"{input_path.stem}.part{part_index:02d}{input_path.suffix}" | ||
| part_path = output_dir / part_name | ||
| part_path.write_bytes(data[i:i + chunk_size]) | ||
| parts.append(part_name) |
There was a problem hiding this comment.
Stream the input file instead of read_bytes().
Lines 23-37 load the whole CUDA binary into memory before hashing and splitting. For the ~2.4 GB artifact in this PR, that can easily exhaust RAM or heavily thrash CI runners. Hash and write parts incrementally from a buffered stream.
🩹 Suggested direction
def split(input_path: Path, chunk_size: int, output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
- data = input_path.read_bytes()
- total_size = len(data)
+ total_size = input_path.stat().st_size
+ hasher = hashlib.sha256()
+ parts = []
+
+ with input_path.open("rb") as src:
+ part_index = 0
+ current_size = 0
+ part_file = None
+ try:
+ while chunk := src.read(8 * 1024 * 1024):
+ hasher.update(chunk)
+ offset = 0
+ while offset < len(chunk):
+ if part_file is None or current_size == chunk_size:
+ if part_file is not None:
+ part_file.close()
+ part_name = f"{input_path.stem}.part{part_index:02d}{input_path.suffix}"
+ parts.append(part_name)
+ part_file = (output_dir / part_name).open("wb")
+ part_index += 1
+ current_size = 0
+
+ write_size = min(chunk_size - current_size, len(chunk) - offset)
+ part_file.write(chunk[offset:offset + write_size])
+ current_size += write_size
+ offset += write_size
+ finally:
+ if part_file is not None:
+ part_file.close()
# Write SHA-256 of the complete file
- sha256 = hashlib.sha256(data).hexdigest()
+ sha256 = hasher.hexdigest()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def split(input_path: Path, chunk_size: int, output_dir: Path): | |
| output_dir.mkdir(parents=True, exist_ok=True) | |
| data = input_path.read_bytes() | |
| total_size = len(data) | |
| # Write SHA-256 of the complete file | |
| sha256 = hashlib.sha256(data).hexdigest() | |
| checksum_file = output_dir / f"{input_path.stem}.sha256" | |
| checksum_file.write_text(f"{sha256} {input_path.name}\n") | |
| # Split into chunks | |
| parts = [] | |
| for i in range(0, total_size, chunk_size): | |
| part_index = len(parts) | |
| part_name = f"{input_path.stem}.part{part_index:02d}{input_path.suffix}" | |
| part_path = output_dir / part_name | |
| part_path.write_bytes(data[i:i + chunk_size]) | |
| parts.append(part_name) | |
| def split(input_path: Path, chunk_size: int, output_dir: Path): | |
| output_dir.mkdir(parents=True, exist_ok=True) | |
| total_size = input_path.stat().st_size | |
| hasher = hashlib.sha256() | |
| parts = [] | |
| with input_path.open("rb") as src: | |
| part_index = 0 | |
| current_size = 0 | |
| part_file = None | |
| try: | |
| while chunk := src.read(8 * 1024 * 1024): | |
| hasher.update(chunk) | |
| offset = 0 | |
| while offset < len(chunk): | |
| if part_file is None or current_size == chunk_size: | |
| if part_file is not None: | |
| part_file.close() | |
| part_name = f"{input_path.stem}.part{part_index:02d}{input_path.suffix}" | |
| parts.append(part_name) | |
| part_file = (output_dir / part_name).open("wb") | |
| part_index += 1 | |
| current_size = 0 | |
| write_size = min(chunk_size - current_size, len(chunk) - offset) | |
| part_file.write(chunk[offset:offset + write_size]) | |
| current_size += write_size | |
| offset += write_size | |
| finally: | |
| if part_file is not None: | |
| part_file.close() | |
| # Write SHA-256 of the complete file | |
| sha256 = hasher.hexdigest() | |
| checksum_file = output_dir / f"{input_path.stem}.sha256" | |
| checksum_file.write_text(f"{sha256} {input_path.name}\n") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/split_binary.py` around lines 21 - 38, The current split function
reads the entire file into memory (data = input_path.read_bytes()), which must
be replaced with a streaming approach: open the input file in binary mode,
create a hashlib.sha256() object and update it incrementally as you read
fixed-size blocks, and simultaneously write each block to successive part files
(use a counter for part_index instead of len(parts)). Keep output_dir.mkdir(...)
but remove the large in-memory data variable; after the read loop finalize the
hex digest and write checksum_file with f"{sha256} {input_path.name}\n". Ensure
part naming logic (f"{input_path.stem}.part{part_index:02d}{input_path.suffix}")
and writing uses the chunk_size argument and that files are written in binary
mode so memory usage stays bounded.
| #[command] | ||
| async fn restart_server( | ||
| app: tauri::AppHandle, | ||
| state: State<'_, ServerState>, | ||
| ) -> Result<String, String> { | ||
| println!("restart_server: stopping current server..."); | ||
|
|
||
| // Stop the current server | ||
| stop_server(state.clone()).await?; | ||
|
|
||
| // Wait for port to be released | ||
| println!("restart_server: waiting for port release..."); | ||
| tokio::time::sleep(tokio::time::Duration::from_millis(1000)).await; | ||
|
|
||
| // Start server again (will auto-detect CUDA binary) | ||
| println!("restart_server: starting server..."); | ||
| start_server(app, state, None).await | ||
| } |
There was a problem hiding this comment.
1-second wait may be insufficient for reliable port release.
The restart_server command waits 1000ms between stopping and starting. On Windows, port release can take longer, especially if:
- The process is slow to terminate
- The OS holds the port in TIME_WAIT state
- Antivirus software delays process cleanup
If the port isn't released, start_server may fail or connect to a stale/zombie process.
Consider either:
- Increasing the wait time (2-3 seconds)
- Adding a retry loop that checks port availability
- Returning an error if start fails so frontend can retry
Suggested improvement — retry with port check
async fn restart_server(
app: tauri::AppHandle,
state: State<'_, ServerState>,
) -> Result<String, String> {
println!("restart_server: stopping current server...");
stop_server(state.clone()).await?;
- // Wait for port to be released
- println!("restart_server: waiting for port release...");
- tokio::time::sleep(tokio::time::Duration::from_millis(1000)).await;
+ // Wait for port to be released with retry
+ println!("restart_server: waiting for port release...");
+ for attempt in 0..10 {
+ tokio::time::sleep(tokio::time::Duration::from_millis(300)).await;
+ use std::net::TcpStream;
+ if TcpStream::connect_timeout(
+ &format!("127.0.0.1:{}", SERVER_PORT).parse().unwrap(),
+ std::time::Duration::from_millis(100),
+ ).is_err() {
+ // Port is free
+ println!("Port released after {}ms", (attempt + 1) * 300);
+ break;
+ }
+ if attempt == 9 {
+ println!("Warning: Port still in use after 3s, proceeding anyway");
+ }
+ }
println!("restart_server: starting server...");
start_server(app, state, None).await
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tauri/src-tauri/src/main.rs` around lines 612 - 629, The 1s sleep in
restart_server is fragile; replace it with a retry loop that waits for the
server port to be free before calling start_server: after
stop_server(state.clone()).await?, poll up to N times (e.g., 10 attempts) using
tokio::time::sleep with a short backoff (e.g., 200-500ms) and on each attempt
check port availability by trying to bind a TcpListener to the server port
(implement a small helper like try_bind_port(port) that returns Ok(()) if bind
succeeds then immediately drop the listener), only proceed to call
start_server(app, state, None). If start_server fails after the retries return
the error upward (i.e., propagate the Err<String>) so the frontend can retry;
keep function name restart_server and use stop_server and start_server symbols
as-is.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tauri/src-tauri/src/main.rs (1)
231-265:⚠️ Potential issue | 🟠 MajorDefer CPU sidecar resolution until the fallback branch.
This returns early if
sidecar("voicebox-server")fails, so the CUDA path at Line 277 is never attempted even when a valid downloaded CUDA binary is present. Right now, CUDA startup still depends on the bundled CPU sidecar being healthy.Proposed fix
- let sidecar_result = app.shell().sidecar("voicebox-server"); - - let mut sidecar = match sidecar_result { - Ok(s) => s, - Err(e) => { - eprintln!("Failed to get sidecar: {}", e); - ... - return Err(format!("Failed to start server. In dev mode, run 'bun run dev:server' in a separate terminal.")); - } - }; - - println!("Sidecar command created successfully"); - // Build common args let data_dir_str = data_dir .to_str() @@ let spawn_result = if let Some(ref cuda_path) = cuda_binary { println!("Launching CUDA backend: {:?}", cuda_path); let mut cmd = app.shell().command(cuda_path.to_str().unwrap()); cmd = cmd.args(["--data-dir", &data_dir_str, "--port", &port_str]); if is_remote { cmd = cmd.args(["--host", "0.0.0.0"]); } cmd.spawn() } else { + let sidecar_result = app.shell().sidecar("voicebox-server"); + let mut sidecar = match sidecar_result { + Ok(s) => s, + Err(e) => { + eprintln!("Failed to get sidecar: {}", e); + ... + return Err(format!("Failed to start server. In dev mode, run 'bun run dev:server' in a separate terminal.")); + } + }; + + println!("Sidecar command created successfully"); // Use the bundled CPU sidecar sidecar = sidecar.args(["--data-dir", &data_dir_str, "--port", &port_str]); if is_remote { sidecar = sidecar.args(["--host", "0.0.0.0"]); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tauri/src-tauri/src/main.rs` around lines 231 - 265, The current code returns early when app.shell().sidecar("voicebox-server") fails, preventing the CUDA/downloaded-binary startup path from being attempted; instead, change the handling so sidecar_result is converted into an Option (e.g., let sidecar: Option<_> = match sidecar_result { Ok(s) => Some(s), Err(e) => { log the error and run the dev-mode TCP check using SERVER_PORT as before; None } }); do not return Err there—only return an error after attempting the CUDA/downloaded-binary startup path (the code that currently runs after the CPU sidecar block) if that fallback also fails; reference the sidecar variable (Option) where the CPU-sidecar-specific startup was previously used so the code tries CUDA/downloaded binary when sidecar is None.
♻️ Duplicate comments (4)
.github/workflows/build-cuda.yml (1)
35-36:⚠️ Potential issue | 🔴 CriticalQuote the CUDA verification command so the workflow parses.
Line 36 still embeds
:inside an inlinerun:value, so the workflow is invalid YAML and GitHub will reject it before any job starts.🐛 Proposed fix
- name: Verify CUDA support in torch - run: python -c "import torch; print(f'CUDA available in build: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')" + run: | + python -c "import torch; print(f'CUDA available in build: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')"#!/bin/bash python -m pip install --quiet pyyaml python - <<'PY' from pathlib import Path import sys import yaml path = Path(".github/workflows/build-cuda.yml") try: yaml.safe_load(path.read_text()) print("YAML parsed successfully") except Exception as exc: print(f"YAML parse failed: {exc}") sys.exit(1) PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/build-cuda.yml around lines 35 - 36, The "Verify CUDA support in torch" job step's run value contains unquoted colons which break YAML parsing; fix by making the run value a proper YAML string or a multiline block scalar (e.g., wrap the python command in quotes or use a |-style block) so the embedded ":" in "torch.version.cuda" doesn't invalidate the workflow; update the step named "Verify CUDA support in torch" accordingly.backend/main.py (1)
1768-1779:⚠️ Potential issue | 🟠 MajorSerialize CUDA download requests and keep the active task.
get_cuda_binary_path()only guards the completed binary. Two callers can reach Line 1779 before the first download finishes, andbackend/cuda_download.py:76-190then has both tasks writing the same.downloadfile and racing the final rename. Keeping a single shared task here also fixes the existing lost-exception path if the background job fails.🛠️ One way to guard the endpoint
+_cuda_download_task: asyncio.Task | None = None + `@app.post`("/backend/download-cuda") async def download_cuda_backend(): """Download the CUDA backend binary. Returns immediately; track progress via SSE.""" from . import cuda_download + global _cuda_download_task # Check if already downloaded if cuda_download.get_cuda_binary_path() is not None: raise HTTPException(status_code=409, detail="CUDA backend already downloaded") + + if _cuda_download_task and not _cuda_download_task.done(): + raise HTTPException(status_code=409, detail="CUDA backend download already in progress") async def _download(): - try: - await cuda_download.download_cuda_binary() - except Exception as e: - import logging - logging.getLogger(__name__).error(f"CUDA download failed: {e}") + await cuda_download.download_cuda_binary() - asyncio.create_task(_download()) + _cuda_download_task = asyncio.create_task(_download()) + _cuda_download_task.add_done_callback(lambda t: t.exception() if not t.cancelled() else None) return {"message": "CUDA backend download started", "progress_key": "cuda-backend"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/main.py` around lines 1768 - 1779, Race occurs because get_cuda_binary_path() only detects completed installs; fix by serializing downloads and retaining the active background task. Introduce a module-level asyncio.Lock (e.g., _cuda_download_lock) and a shared task variable (e.g., _cuda_download_task); in the endpoint acquire the lock, if get_cuda_binary_path() is set return 409, if _cuda_download_task exists return an appropriate 202/ongoing response or re-use it, otherwise create and store a task that runs download_cuda_binary(); ensure the task has an exception handler or add a done callback to log/propagate errors so failures are not silently lost and clear _cuda_download_task on completion.backend/server.py (2)
79-88:⚠️ Potential issue | 🟠 MajorOverwrite the variant for CPU binaries instead of using
setdefault.If the parent process inherited
VOICEBOX_BACKEND_VARIANT=cuda, Lines 87-88 preserve that value and/healthwill report the CPU binary as CUDA. Theelsebranch should always write"cpu".🩹 Proposed fix
if "cuda" in binary_name: os.environ["VOICEBOX_BACKEND_VARIANT"] = "cuda" logger.info("Backend variant: CUDA") else: - os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu") - logger.info(f"Backend variant: {os.environ['VOICEBOX_BACKEND_VARIANT']}") + os.environ["VOICEBOX_BACKEND_VARIANT"] = "cpu" + logger.info("Backend variant: CPU")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 79 - 88, The current logic in the binary detection block uses os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu") which preserves an inherited VOICEBOX_BACKEND_VARIANT (e.g., "cuda") for CPU binaries; change this so the else branch unconditionally sets os.environ["VOICEBOX_BACKEND_VARIANT"] = "cpu" (update the code around binary_name detection in backend/server.py), and keep the logger.info call that logs the new value (logger.info(...)) so /health reflects the actual CPU binary.
67-77:⚠️ Potential issue | 🟠 Major
--versionstill pays the full startup cost.By the time Lines 74-77 run, Lines 35-40 have already imported
backend.main, sovoicebox-server --versionstays slow and can still fail on machines missing runtime deps. Parse the flag before those imports and exit early.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` around lines 67 - 77, The --version flag is parsed after expensive imports (including backend.main), so running voicebox-server --version still triggers full startup; move the argparse creation and args = parser.parse_args() near the top of the module before importing heavy modules (avoid importing backend.main or other runtime deps), check if args.version immediately, and if true import backend.__version__ only, print the version and sys.exit(0); ensure no other imports that pull in backend.main occur before this early-exit check so the CLI can return quickly and safely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/src/components/ServerSettings/GpuAcceleration.tsx`:
- Around line 28-38: The refetchInterval option currently references
cudaStatusLoading before that variable exists (causing a ReferenceError); remove
the self-reference and use a concrete expression instead — e.g., in the useQuery
call for apiClient.getCudaStatus() (queryKey ['cuda-status', serverUrl]) replace
refetchInterval: cudaStatusLoading ? false : 10000 with a constant or an
expression based on existing props like health (for example refetchInterval:
health ? 10000 : false) so the options object no longer reads cudaStatusLoading
before destructuring completes.
In `@backend/cuda_download.py`:
- Around line 172-175: The current code first unlinks final_path then renames
temp_path, which can leave no valid binary if the rename fails; replace that
sequence with an atomic swap by calling temp_path.replace(final_path) (use the
Path.replace/os.replace API) so the temp file atomically overwrites final_path;
update the block that references final_path and temp_path in
backend/cuda_download.py to use temp_path.replace(final_path) and remove the
unlink/rename calls.
- Around line 120-131: The checksum fetch currently swallows errors and
continues; change this to "fail closed": when performing the fetch with
client.get (sha_url / sha_resp) ensure non-200 responses, network exceptions, or
parse results that are not a 64-character hex (the expected_sha parsed from
sha_resp.text.strip().split()[0]) are treated as fatal—log an error with details
via logger.error and raise an exception (or exit) instead of logger.warning so
the install is aborted if the .sha256 is missing, malformed, or unreachable;
validate expected_sha format explicitly before proceeding.
- Around line 73-97: The download_cuda_binary function can run concurrently and
clobber shared files; serialize access by introducing a module-level async lock
(e.g. _cuda_download_lock = asyncio.Lock()) and wrap the critical
section—checking/removing temp_path, performing the download, writing to
temp_path, checksum verification, and the atomic rename to final_path—inside
"async with _cuda_download_lock:" so only one coroutine touches the .download
file, final_path, and progress key at a time; if cross-process safety is
required instead, use a filesystem lock (e.g. portalocker or flock on a .lock
file adjacent to binary_name) around the same critical section.
In `@docs/plans/PROJECT_STATUS.md`:
- Around line 440-460: The API quick reference table is out of sync with
backend/main.py; update the entries so they match the actual route signatures:
change the `/generate/stream` entry to indicate it streams audio/wav (not SSE),
remove the non-existent `GET /stories/{id}/items` row, and replace
`/stories/{id}/export` with the correct `/stories/{id}/export-audio` endpoint
and verbs; while editing, cross-check other rows (e.g., `/stories`,
`/stories/{id}`, `/models/*`) against backend/main.py to ensure methods and
descriptions exactly match the implementation.
---
Outside diff comments:
In `@tauri/src-tauri/src/main.rs`:
- Around line 231-265: The current code returns early when
app.shell().sidecar("voicebox-server") fails, preventing the
CUDA/downloaded-binary startup path from being attempted; instead, change the
handling so sidecar_result is converted into an Option (e.g., let sidecar:
Option<_> = match sidecar_result { Ok(s) => Some(s), Err(e) => { log the error
and run the dev-mode TCP check using SERVER_PORT as before; None } }); do not
return Err there—only return an error after attempting the
CUDA/downloaded-binary startup path (the code that currently runs after the CPU
sidecar block) if that fallback also fails; reference the sidecar variable
(Option) where the CPU-sidecar-specific startup was previously used so the code
tries CUDA/downloaded binary when sidecar is None.
---
Duplicate comments:
In @.github/workflows/build-cuda.yml:
- Around line 35-36: The "Verify CUDA support in torch" job step's run value
contains unquoted colons which break YAML parsing; fix by making the run value a
proper YAML string or a multiline block scalar (e.g., wrap the python command in
quotes or use a |-style block) so the embedded ":" in "torch.version.cuda"
doesn't invalidate the workflow; update the step named "Verify CUDA support in
torch" accordingly.
In `@backend/main.py`:
- Around line 1768-1779: Race occurs because get_cuda_binary_path() only detects
completed installs; fix by serializing downloads and retaining the active
background task. Introduce a module-level asyncio.Lock (e.g.,
_cuda_download_lock) and a shared task variable (e.g., _cuda_download_task); in
the endpoint acquire the lock, if get_cuda_binary_path() is set return 409, if
_cuda_download_task exists return an appropriate 202/ongoing response or re-use
it, otherwise create and store a task that runs download_cuda_binary(); ensure
the task has an exception handler or add a done callback to log/propagate errors
so failures are not silently lost and clear _cuda_download_task on completion.
In `@backend/server.py`:
- Around line 79-88: The current logic in the binary detection block uses
os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu") which preserves an
inherited VOICEBOX_BACKEND_VARIANT (e.g., "cuda") for CPU binaries; change this
so the else branch unconditionally sets os.environ["VOICEBOX_BACKEND_VARIANT"] =
"cpu" (update the code around binary_name detection in backend/server.py), and
keep the logger.info call that logs the new value (logger.info(...)) so /health
reflects the actual CPU binary.
- Around line 67-77: The --version flag is parsed after expensive imports
(including backend.main), so running voicebox-server --version still triggers
full startup; move the argparse creation and args = parser.parse_args() near the
top of the module before importing heavy modules (avoid importing backend.main
or other runtime deps), check if args.version immediately, and if true import
backend.__version__ only, print the version and sys.exit(0); ensure no other
imports that pull in backend.main occur before this early-exit check so the CLI
can return quickly and safely.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: df835041-b635-49d5-9746-79baa842f849
📒 Files selected for processing (20)
.github/workflows/build-cuda.ymlapp/src/components/ServerSettings/GpuAcceleration.tsxapp/src/components/ServerTab/ServerTab.tsxapp/src/lib/api/client.tsapp/src/lib/api/types.tsapp/src/platform/types.tsbackend/build_binary.pybackend/cuda_download.pybackend/main.pybackend/models.pybackend/requirements.txtbackend/server.pydocs/plans/CUDA_BACKEND_SWAP.mddocs/plans/CUDA_BACKEND_SWAP_FINAL.mddocs/plans/PR33_CUDA_PROVIDER_REVIEW.mddocs/plans/PROJECT_STATUS.mdscripts/split_binary.pytauri/src-tauri/src/main.rstauri/src/platform/lifecycle.tsweb/src/platform/lifecycle.ts
🚧 Files skipped from review as they are similar to previous changes (7)
- backend/requirements.txt
- scripts/split_binary.py
- tauri/src/platform/lifecycle.ts
- web/src/platform/lifecycle.ts
- backend/models.py
- app/src/lib/api/types.ts
- app/src/lib/api/client.ts
| const { | ||
| data: cudaStatus, | ||
| isLoading: cudaStatusLoading, | ||
| refetch: refetchCudaStatus, | ||
| } = useQuery({ | ||
| queryKey: ['cuda-status', serverUrl], | ||
| queryFn: () => apiClient.getCudaStatus(), | ||
| refetchInterval: cudaStatusLoading ? false : 10000, | ||
| retry: 1, | ||
| enabled: !!health, // Only fetch when backend is reachable | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "GpuAcceleration.tsx" -type fRepository: jamiepine/voicebox
Length of output: 117
🏁 Script executed:
cat -n ./app/src/components/ServerSettings/GpuAcceleration.tsx | head -50Repository: jamiepine/voicebox
Length of output: 2500
🏁 Script executed:
rg "cudaStatusLoading" ./app/src/components/ServerSettings/GpuAcceleration.tsxRepository: jamiepine/voicebox
Length of output: 151
🏁 Script executed:
cat -n ./app/src/components/ServerSettings/GpuAcceleration.tsx | sed -n '1,40p'Repository: jamiepine/voicebox
Length of output: 2056
Remove the self-reference from refetchInterval.
Line 35 reads cudaStatusLoading before it exists. The destructuring assignment on lines 28-31 hasn't completed when the options object (lines 32-38) is evaluated, so cudaStatusLoading is not yet in scope. This will throw a ReferenceError at runtime. Use a constant interval or refactor to a separate query variable.
Proposed fix
- const {
- data: cudaStatus,
- isLoading: cudaStatusLoading,
- refetch: refetchCudaStatus,
- } = useQuery({
+ const cudaStatusQuery = useQuery({
queryKey: ['cuda-status', serverUrl],
queryFn: () => apiClient.getCudaStatus(),
- refetchInterval: cudaStatusLoading ? false : 10000,
+ refetchInterval: 10000,
retry: 1,
enabled: !!health, // Only fetch when backend is reachable
});
+
+ const {
+ data: cudaStatus,
+ isLoading: cudaStatusLoading,
+ refetch: refetchCudaStatus,
+ } = cudaStatusQuery;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/src/components/ServerSettings/GpuAcceleration.tsx` around lines 28 - 38,
The refetchInterval option currently references cudaStatusLoading before that
variable exists (causing a ReferenceError); remove the self-reference and use a
concrete expression instead — e.g., in the useQuery call for
apiClient.getCudaStatus() (queryKey ['cuda-status', serverUrl]) replace
refetchInterval: cudaStatusLoading ? false : 10000 with a constant or an
expression based on existing props like health (for example refetchInterval:
health ? 10000 : false) so the options object no longer reads cudaStatusLoading
before destructuring completes.
| async def download_cuda_binary(version: Optional[str] = None): | ||
| """Download the CUDA backend binary from GitHub Releases. | ||
|
|
||
| Downloads split parts listed in a manifest file, concatenates them, | ||
| and verifies the SHA-256 checksum for integrity. Atomic write | ||
| (temp file -> rename). | ||
|
|
||
| Args: | ||
| version: Version tag (e.g. "v0.2.0"). Defaults to current app version. | ||
| """ | ||
| import httpx | ||
|
|
||
| if version is None: | ||
| version = f"v{__version__}" | ||
|
|
||
| progress = get_progress_manager() | ||
| binary_name = get_cuda_binary_name() | ||
| dest_dir = get_backends_dir() | ||
| final_path = dest_dir / binary_name | ||
| temp_path = dest_dir / f"{binary_name}.download" | ||
|
|
||
| # Clean up any leftover partial download | ||
| if temp_path.exists(): | ||
| temp_path.unlink() | ||
|
|
There was a problem hiding this comment.
Serialize CUDA downloads before touching these shared paths.
The endpoint pre-check is non-atomic, so two requests can still reach this function together. Both coroutines would reuse the same .download file, the same final path, and the same cuda-backend progress key, which makes corruption and state clobbering possible.
Proposed fix
+import asyncio
import hashlib
import logging
import os
import sys
@@
PROGRESS_KEY = "cuda-backend"
+_cuda_download_lock = asyncio.Lock()
@@
async def download_cuda_binary(version: Optional[str] = None):
@@
- if version is None:
- version = f"v{__version__}"
-
- progress = get_progress_manager()
- binary_name = get_cuda_binary_name()
- dest_dir = get_backends_dir()
- final_path = dest_dir / binary_name
- temp_path = dest_dir / f"{binary_name}.download"
-
- # Clean up any leftover partial download
- if temp_path.exists():
- temp_path.unlink()
+ if _cuda_download_lock.locked():
+ raise RuntimeError("CUDA backend download already in progress")
+
+ async with _cuda_download_lock:
+ if version is None:
+ version = f"v{__version__}"
+
+ progress = get_progress_manager()
+ binary_name = get_cuda_binary_name()
+ dest_dir = get_backends_dir()
+ final_path = dest_dir / binary_name
+ temp_path = dest_dir / f"{binary_name}.download"
+
+ if temp_path.exists():
+ temp_path.unlink()
- logger.info(f"Starting CUDA backend download for {version}")
- progress.update_progress(
- PROGRESS_KEY, current=0, total=0,
- filename="Fetching manifest...", status="downloading",
- )
+ logger.info(f"Starting CUDA backend download for {version}")
+ progress.update_progress(
+ PROGRESS_KEY, current=0, total=0,
+ filename="Fetching manifest...", status="downloading",
+ )
- base_url = f"{GITHUB_RELEASES_URL}/{version}"
- stem = Path(binary_name).stem # voicebox-server-cuda
+ base_url = f"{GITHUB_RELEASES_URL}/{version}"
+ stem = Path(binary_name).stem # voicebox-server-cuda
- try:
- async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client:
+ try:
+ async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client:
...
- except Exception as e:
- # Clean up on failure
- if temp_path.exists():
- temp_path.unlink()
- logger.error(f"CUDA backend download failed: {e}")
- progress.mark_error(PROGRESS_KEY, str(e))
- raise
+ except Exception as e:
+ if temp_path.exists():
+ temp_path.unlink()
+ logger.error(f"CUDA backend download failed: {e}")
+ progress.mark_error(PROGRESS_KEY, str(e))
+ raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/cuda_download.py` around lines 73 - 97, The download_cuda_binary
function can run concurrently and clobber shared files; serialize access by
introducing a module-level async lock (e.g. _cuda_download_lock =
asyncio.Lock()) and wrap the critical section—checking/removing temp_path,
performing the download, writing to temp_path, checksum verification, and the
atomic rename to final_path—inside "async with _cuda_download_lock:" so only one
coroutine touches the .download file, final_path, and progress key at a time; if
cross-process safety is required instead, use a filesystem lock (e.g.
portalocker or flock on a .lock file adjacent to binary_name) around the same
critical section.
| # Fetch expected checksum (optional — for integrity verification) | ||
| expected_sha = None | ||
| try: | ||
| sha_url = f"{base_url}/{stem}.sha256" | ||
| sha_resp = await client.get(sha_url) | ||
| if sha_resp.status_code == 200: | ||
| # Format: "sha256hex filename\n" | ||
| expected_sha = sha_resp.text.strip().split()[0] | ||
| logger.info(f"Expected SHA-256: {expected_sha[:16]}...") | ||
| except Exception: | ||
| logger.warning("Could not fetch checksum file — skipping verification") | ||
|
|
There was a problem hiding this comment.
Fail closed if checksum retrieval or parsing fails.
This code installs an executable that tauri/src-tauri/src/main.rs later launches from disk. If the .sha256 fetch is missing, malformed, or temporarily unreachable, silently “skipping verification” turns a transient network problem into running an unauthenticated binary on the next restart.
Proposed fix
- # Fetch expected checksum (optional — for integrity verification)
- expected_sha = None
- try:
- sha_url = f"{base_url}/{stem}.sha256"
- sha_resp = await client.get(sha_url)
- if sha_resp.status_code == 200:
- # Format: "sha256hex filename\n"
- expected_sha = sha_resp.text.strip().split()[0]
- logger.info(f"Expected SHA-256: {expected_sha[:16]}...")
- except Exception:
- logger.warning("Could not fetch checksum file — skipping verification")
+ sha_url = f"{base_url}/{stem}.sha256"
+ sha_resp = await client.get(sha_url)
+ sha_resp.raise_for_status()
+ parts = sha_resp.text.strip().split()
+ if not parts:
+ raise ValueError("Checksum file is empty or malformed")
+ expected_sha = parts[0]
+ logger.info(f"Expected SHA-256: {expected_sha[:16]}...")🧰 Tools
🪛 Ruff (0.15.5)
[warning] 129-129: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/cuda_download.py` around lines 120 - 131, The checksum fetch
currently swallows errors and continues; change this to "fail closed": when
performing the fetch with client.get (sha_url / sha_resp) ensure non-200
responses, network exceptions, or parse results that are not a 64-character hex
(the expected_sha parsed from sha_resp.text.strip().split()[0]) are treated as
fatal—log an error with details via logger.error and raise an exception (or
exit) instead of logger.warning so the install is aborted if the .sha256 is
missing, malformed, or unreachable; validate expected_sha format explicitly
before proceeding.
| # Atomic move into place | ||
| if final_path.exists(): | ||
| final_path.unlink() | ||
| temp_path.rename(final_path) |
There was a problem hiding this comment.
Use an atomic replace here.
Deleting the current binary before renaming the temp file leaves a failure window where a known-good CUDA backend is gone if the rename fails. Path.replace()/os.replace() preserves the atomic swap semantics this comment is aiming for.
Proposed fix
- # Atomic move into place
- if final_path.exists():
- final_path.unlink()
- temp_path.rename(final_path)
+ # Atomic move into place
+ temp_path.replace(final_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/cuda_download.py` around lines 172 - 175, The current code first
unlinks final_path then renames temp_path, which can leave no valid binary if
the rename fails; replace that sequence with an atomic swap by calling
temp_path.replace(final_path) (use the Path.replace/os.replace API) so the temp
file atomically overwrites final_path; update the block that references
final_path and temp_path in backend/cuda_download.py to use
temp_path.replace(final_path) and remove the unlink/rename calls.
| | `/generate` | POST | Generate speech | | ||
| | `/generate/stream` | POST | Stream speech (SSE) | | ||
| | `/history` | GET | List generation history | | ||
| | `/history/{id}` | GET, DELETE | Get/delete generation | | ||
| | `/history/{id}/export` | GET | Export generation ZIP | | ||
| | `/history/{id}/export-audio` | GET | Export audio only | | ||
| | `/transcribe` | POST | Transcribe audio (Whisper) | | ||
| | `/models/status` | GET | All model statuses | | ||
| | `/models/download` | POST | Trigger model download | | ||
| | `/models/{name}` | DELETE | Delete downloaded model | | ||
| | `/models/load` | POST | Load model into memory | | ||
| | `/models/unload` | POST | Unload model | | ||
| | `/models/progress/{name}` | GET | SSE download progress | | ||
| | `/tasks/active` | GET | Active downloads/generations | | ||
| | `/stories` | POST, GET | Create/list stories | | ||
| | `/stories/{id}` | GET, PUT, DELETE | Story CRUD | | ||
| | `/stories/{id}/items` | POST, GET | Story items CRUD | | ||
| | `/stories/{id}/export` | GET | Export story audio | | ||
| | `/channels` | POST, GET | Audio channel CRUD | | ||
| | `/channels/{id}` | PUT, DELETE | Channel update/delete | | ||
| | `/cache/clear` | POST | Clear voice prompt cache | |
There was a problem hiding this comment.
The API quick reference is out of sync with the backend.
A few rows here don't match backend/main.py: /generate/stream streams audio/wav, not SSE; there is no GET /stories/{id}/items; and the story export route is /stories/{id}/export-audio, not /stories/{id}/export. Since this table reads like public API docs, those mismatches will mislead integrators.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/plans/PROJECT_STATUS.md` around lines 440 - 460, The API quick reference
table is out of sync with backend/main.py; update the entries so they match the
actual route signatures: change the `/generate/stream` entry to indicate it
streams audio/wav (not SSE), remove the non-existent `GET /stories/{id}/items`
row, and replace `/stories/{id}/export` with the correct
`/stories/{id}/export-audio` endpoint and verbs; while editing, cross-check
other rows (e.g., `/stories`, `/stories/{id}`, `/models/*`) against
backend/main.py to ensure methods and descriptions exactly match the
implementation.
Summary
How it works
User clicks Download CUDA Backend in Settings → backend fetches manifest from GitHub Releases → downloads split parts, concatenates them → SHA-256 integrity check → placed in
{app_data_dir}/backends/→ user clicks Switch to CUDA Backend → Tauri kills CPU process, launches CUDA binary → frontend reconnects via health polling.What changed
Backend (Python)
backend/cuda_download.py— Download module: fetches split parts from GitHub Releases, reassembles, integrity check, atomic writes, progress tracking via existing SSE systembackend/main.py— 4 new endpoints:GET /backend/cuda-status,POST /backend/download-cuda,DELETE /backend/cuda,GET /backend/cuda-progress. Health endpoint now returnsbackend_variantfield.backend/server.py—--versionflag, auto-detect backend variant from binary name (setsVOICEBOX_BACKEND_VARIANTenv var)backend/build_binary.py—--cudaflag for CUDA PyInstaller buildsbackend/requirements.txt— Addedhttpx>=0.27.0Build & CI
scripts/split_binary.py— Splits large binaries into <2 GB chunks with SHA-256 manifest for GitHub Releases.github/workflows/build-cuda.yml— CI workflow for building CUDA binary, splitting, and uploading to GitHub ReleasesTauri (Rust)
restart_servercommand: stop → wait 1s → startstart_servermodified: checks for CUDA binary in{data_dir}/backends/, launches viashell().command()instead of sidecar--versionbefore launch, falls back to CPU on mismatchFrontend (React/TypeScript)
GpuAcceleration.tsx— New component: shows backend variant (CPU/CUDA badge), GPU info, download with SSE progress bar, restart to switch, switch back to CPU, delete binaryCudaStatus,CudaDownloadProgressinterfaces;HealthResponseupdatedgetCudaStatus(),downloadCudaBackend(),deleteCudaBackend()restartServer()added to interface, implemented for Tauri + WebGpuAccelerationcard wired in (Tauri-only)Docs
docs/plans/CUDA_BACKEND_SWAP.md— Original 5-phase implementation plandocs/plans/CUDA_BACKEND_SWAP_FINAL.md— Final implementation summary with testing checklistdocs/plans/PROJECT_STATUS.md— Full project triagedocs/plans/PR33_CUDA_PROVIDER_REVIEW.md— Review of existing PR CUDA GPU Support - External Provider Binaries #33 (22 bugs, will not merge)Design decisions
start_serveralways prefers CUDA if present, so switching to CPU removes the binary. User can re-download later. Avoids persistent config.Testing checklist
voicebox-server-cuda --versionprints correct version{data_dir}/backends/→ app auto-detects and uses it/healthshowsbackend_variant: "cuda"backend_variant: "cpu"Summary by CodeRabbit