Skip to content

feat: CUDA backend swap via binary download and restart#252

Open
jamiepine wants to merge 1 commit intomainfrom
feat/cuda-backend-swap
Open

feat: CUDA backend swap via binary download and restart#252
jamiepine wants to merge 1 commit intomainfrom
feat/cuda-backend-swap

Conversation

@jamiepine
Copy link
Owner

@jamiepine jamiepine commented Mar 12, 2026

Summary

  • Adds the ability to download a CUDA-enabled backend binary (~2.4 GB) and swap it in via a backend-only restart — solving the Improvements #1 user pain point (19 open "GPU not detected" issues caused by GitHub's 2 GB release asset limit)
  • Binary is split into <2 GB chunks on GitHub Releases, downloaded and reassembled by the app with SHA-256 integrity verification
  • Frontend stays running during the swap. All UI state is preserved. One process at a time — no subprocess management, no provider architecture.
  • On all future app launches, the CUDA binary is auto-detected and used automatically.

How it works

User clicks Download CUDA Backend in Settings → backend fetches manifest from GitHub Releases → downloads split parts, concatenates them → SHA-256 integrity check → placed in {app_data_dir}/backends/ → user clicks Switch to CUDA Backend → Tauri kills CPU process, launches CUDA binary → frontend reconnects via health polling.

What changed

Backend (Python)

  • backend/cuda_download.py — Download module: fetches split parts from GitHub Releases, reassembles, integrity check, atomic writes, progress tracking via existing SSE system
  • backend/main.py — 4 new endpoints: GET /backend/cuda-status, POST /backend/download-cuda, DELETE /backend/cuda, GET /backend/cuda-progress. Health endpoint now returns backend_variant field.
  • backend/server.py--version flag, auto-detect backend variant from binary name (sets VOICEBOX_BACKEND_VARIANT env var)
  • backend/build_binary.py--cuda flag for CUDA PyInstaller builds
  • backend/requirements.txt — Added httpx>=0.27.0

Build & CI

  • scripts/split_binary.py — Splits large binaries into <2 GB chunks with SHA-256 manifest for GitHub Releases
  • .github/workflows/build-cuda.yml — CI workflow for building CUDA binary, splitting, and uploading to GitHub Releases

Tauri (Rust)

  • restart_server command: stop → wait 1s → start
  • start_server modified: checks for CUDA binary in {data_dir}/backends/, launches via shell().command() instead of sidecar
  • Version mismatch protection: runs --version before launch, falls back to CPU on mismatch

Frontend (React/TypeScript)

  • GpuAcceleration.tsx — New component: shows backend variant (CPU/CUDA badge), GPU info, download with SSE progress bar, restart to switch, switch back to CPU, delete binary
  • API typesCudaStatus, CudaDownloadProgress interfaces; HealthResponse updated
  • API clientgetCudaStatus(), downloadCudaBackend(), deleteCudaBackend()
  • Platform lifecyclerestartServer() added to interface, implemented for Tauri + Web
  • ServerTabGpuAcceleration card wired in (Tauri-only)

Docs

  • docs/plans/CUDA_BACKEND_SWAP.md — Original 5-phase implementation plan
  • docs/plans/CUDA_BACKEND_SWAP_FINAL.md — Final implementation summary with testing checklist
  • docs/plans/PROJECT_STATUS.md — Full project triage
  • docs/plans/PR33_CUDA_PROVIDER_REVIEW.md — Review of existing PR CUDA GPU Support - External Provider Binaries #33 (22 bugs, will not merge)

Design decisions

  • Backend-only restart, not full app restart — Frontend survives, UI state preserved
  • No provider/subprocess architecture — One monolithic process, just a different binary. Explicitly not the PR CUDA GPU Support - External Provider Binaries #33 approach.
  • GitHub Releases only — Split parts with manifest, no external CDN needed
  • Data directory storage — CUDA binary persists across app updates, outside the app bundle
  • Switch to CPU = delete + restartstart_server always prefers CUDA if present, so switching to CPU removes the binary. User can re-download later. Avoids persistent config.

Testing checklist

  • voicebox-server-cuda --version prints correct version
  • Place CUDA binary in {data_dir}/backends/ → app auto-detects and uses it
  • Version mismatch → falls back to CPU
  • GpuAcceleration card shows correct state for each scenario
  • Download flow: POST triggers download, SSE progress, completion updates status
  • Switch to CUDA: restart works, /health shows backend_variant: "cuda"
  • Switch to CPU: deletes binary, restarts, shows backend_variant: "cpu"
  • Native GPU (macOS MPS): shows info message, no download section

Summary by CodeRabbit

  • New Features
    • GPU Acceleration panel: view GPU/VRAM, stream download progress, download/remove CUDA backend, switch between CPU/CUDA, and restart the server to apply changes (desktop).
  • Chores
    • Added CI workflow to build CUDA-enabled server binaries and publish artifacts.
    • Added a utility to split large binaries for release distribution.
  • Documentation
    • New CUDA implementation and project roadmap docs added.

@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

Adds end-to-end CUDA backend support: CI to build/split CUDA server binaries, backend endpoints and download/verify/streaming logic, frontend UI and API hooks to manage CUDA binaries, and Tauri commands to restart and launch the CUDA variant.

Changes

Cohort / File(s) Summary
CI & Build Tools
.github/workflows/build-cuda.yml, backend/build_binary.py, scripts/split_binary.py
New GitHub Actions workflow to build CUDA binary, modified build script to support --cuda binary naming, and a splitter utility to produce Release-friendly chunked artifacts.
Backend: CUDA management
backend/cuda_download.py, backend/main.py, backend/models.py, backend/server.py, backend/requirements.txt
New module to download/assemble/verify CUDA binary, SSE progress streaming and endpoints (/backend/cuda-status, /backend/download-cuda, /backend/cuda, /backend/cuda-progress), health response extended with backend_variant, and added httpx dependency.
Frontend types & client
app/src/lib/api/types.ts, app/src/lib/api/client.ts
Added CudaStatus and CudaDownloadProgress types; extended HealthResponse with gpu_type, backend_type, backend_variant; added client methods getCudaStatus(), downloadCudaBackend(), deleteCudaBackend().
Frontend UI
app/src/components/ServerSettings/GpuAcceleration.tsx, app/src/components/ServerTab/ServerTab.tsx
New GPU Acceleration UI component with SSE progress, download/delete/restart handlers and aggressive health polling; integrated into ServerTab for Tauri platforms.
Platform lifecycle / Tauri integration
app/src/platform/types.ts, tauri/src/platform/lifecycle.ts, web/src/platform/lifecycle.ts, tauri/src-tauri/src/main.rs
Added restartServer() to platform lifecycle; Tauri implements restart_server command and startup now detects/validates/launches CUDA binary (version check) or falls back to CPU; web lifecycle restart is a no-op.
Runtime entry & versioning
backend/server.py
Added --version flag and runtime setting of VOICEBOX_BACKEND_VARIANT based on binary name (cuda vs cpu).
Documentation & planning
docs/plans/* (new files)
New and expanded docs describing CUDA backend swap design, implementation, review, and project status.

Sequence Diagram

sequenceDiagram
    participant User
    participant UI as GpuAcceleration<br/>Component
    participant Client as Frontend<br/>API Client
    participant Backend as Backend<br/>API
    participant Storage as Release/R2<br/>Storage
    participant Tauri as Tauri<br/>Runtime

    User->>UI: Click "Download CUDA"
    UI->>Client: downloadCudaBackend()
    Client->>Backend: POST /backend/download-cuda
    Backend->>Client: { progress_key }
    Client->>Backend: EventSource /backend/cuda-progress
    Storage-->>Backend: Serve parts / manifest
    Backend->>Backend: Assemble & verify binary
    Backend-->>Client: SSE progress events / complete

    User->>UI: Click "Restart Server"
    UI->>Tauri: restartServer()
    Tauri->>Tauri: Stop current backend, wait for port release
    Tauri->>Tauri: Check data_dir/backends for CUDA binary
    Tauri->>Tauri: Verify version (--version)
    Tauri->>Tauri: Launch CUDA binary (or fallback)
    Tauri-->>UI: Return server URL/result
    Client->>Backend: Poll /health until ready
    Backend-->>Client: { backend_variant: "cuda", gpu_type: "..." }
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibbled bytes in moonlit code,

split the binaries down the road,
stitched CUDA pieces, set them free,
then hopped to restart the server tree —
faster hops, a GPU jubilee!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding CUDA backend swap functionality via binary download and restart mechanism, which is the core feature delivered in this comprehensive PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/cuda-backend-swap
📝 Coding Plan for PR comments
  • Generate coding plan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Add the ability to download a CUDA-enabled backend binary (~2.4 GB) and
swap it in via a backend-only restart, solving the #1 user pain point
(19 open 'GPU not detected' issues caused by GitHub's 2 GB asset limit).

Backend:
- cuda_download.py: download from R2 (primary) or GitHub split-parts
  (fallback), SHA-256 verification, atomic writes, progress via SSE
- 4 new endpoints: GET/POST/DELETE /backend/cuda-*, GET cuda-progress
- server.py: --version flag, auto-detect variant from binary name
- build_binary.py: --cuda flag for CUDA PyInstaller builds
- split_binary.py: split large binaries into <2GB GitHub Release assets
- CI workflow for building CUDA binary

Tauri:
- restart_server command (stop -> wait -> start)
- start_server prefers CUDA binary from {data_dir}/backends/ if present
- Version mismatch check: runs --version before launching CUDA binary

Frontend:
- GpuAcceleration component: download, progress, restart, switch, delete
- API client + types for CUDA status and management
- Platform lifecycle: restartServer() on Tauri/Web
- Aggressive 1s health polling during restart for fast reconnection
@jamiepine jamiepine force-pushed the feat/cuda-backend-swap branch from 44df09a to dccded4 Compare March 12, 2026 15:05
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (12)
web/src/platform/lifecycle.ts (1)

18-21: Mirror startServer() callback semantics here.

startServer() invokes onServerReady, but restartServer() currently just returns the URL. Keeping both methods consistent avoids platform-specific behavior if shared UI ever relies on the restart path to mark the server as ready again.

🩹 Suggested fix
   async restartServer(): Promise<string> {
     // No-op for web - server is managed externally
-    return import.meta.env.VITE_SERVER_URL || 'http://localhost:17493';
+    const serverUrl = import.meta.env.VITE_SERVER_URL || 'http://localhost:17493';
+    this.onServerReady?.();
+    return serverUrl;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/platform/lifecycle.ts` around lines 18 - 21, The restartServer()
implementation should mirror startServer() by invoking the onServerReady
callback instead of just returning the URL; update the async restartServer() in
lifecycle.ts to call the same onServerReady handler (the one used by
startServer()) with the resolved URL (import.meta.env.VITE_SERVER_URL ||
'http://localhost:17493') before returning it so both startServer() and
restartServer() have consistent callback semantics.
docs/plans/PR33_CUDA_PROVIDER_REVIEW.md (1)

25-37: Add language specifier to fenced code block.

The ASCII diagram block lacks a language identifier. While this is cosmetic for documentation, adding a specifier improves markdown lint compliance and rendering consistency.

Suggested fix
-```
+```text
 ┌──────────────────────────────────────┐
 │  Main App (~150MB Win/Lin, ~300 Mac) │
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/plans/PR33_CUDA_PROVIDER_REVIEW.md` around lines 25 - 37, The fenced
ASCII diagram block in the doc (the triple-backtick block containing the box
diagram) is missing a language specifier; update the opening fence from ``` to
```text so the block becomes ```text followed by the diagram (e.g., the block
starting with "┌──────────────────────────────────────┐" and the lines
describing "Main App", "HTTP (localhost)", and "Provider Binary"); this will
satisfy markdown linting and ensure consistent rendering.
app/src/lib/api/client.ts (1)

320-331: Remove debug logging before merge.

The added console.log statements appear to be debugging artifacts. Production code should not include verbose logging with timestamps for routine API calls.

Remove debug logs
   async triggerModelDownload(modelName: string): Promise<{ message: string }> {
-    console.log(
-      '[API] triggerModelDownload called for:',
-      modelName,
-      'at',
-      new Date().toISOString(),
-    );
     const result = await this.request<{ message: string }>('/models/download', {
       method: 'POST',
       body: JSON.stringify({ model_name: modelName } as ModelDownloadRequest),
     });
-    console.log('[API] triggerModelDownload response:', result);
     return result;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/lib/api/client.ts` around lines 320 - 331, The debug console.log
calls inside the triggerModelDownload flow should be removed: delete the two
console.log statements that log "[API] triggerModelDownload called for:" with
the timestamp and "[API] triggerModelDownload response:" so the method only
calls this.request and returns the result; ensure the surrounding code in the
same function/method (the POST to '/models/download' and return of result)
remains unchanged and check for any other stray console.log debug lines in the
same file.
docs/plans/CUDA_BACKEND_SWAP_FINAL.md (1)

11-19: Add language specifier to fenced code blocks.

The workflow diagram and pseudocode blocks at lines 11 and 98 lack language identifiers. Adding text or plaintext improves linter compliance.

Suggested fix for line 11
-```
+```text
 User clicks "Download CUDA Backend" in Settings
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/plans/CUDA_BACKEND_SWAP_FINAL.md` around lines 11 - 19, The fenced code
blocks for the workflow ("User clicks \"Download CUDA Backend\" in Settings →
Backend downloads binary...") and the later pseudocode block are missing
language specifiers; update each opening fence to include a plaintext language
(e.g., replace ``` with ```text or ```plaintext) so the linter recognizes them
and formatting is preserved—apply this change to the block starting with 'User
clicks "Download CUDA Backend"' and the corresponding pseudocode block further
down.
app/src/components/ServerSettings/GpuAcceleration.tsx (3)

92-115: Health polling lacks cleanup for in-flight requests.

The interval-based polling creates a new apiClient.getHealth() request every second. If requests are slow, multiple requests can be in flight simultaneously. When the server comes back, the first successful response clears the interval, but other in-flight requests may still resolve and trigger duplicate state updates.

This is unlikely to cause visible bugs but could lead to redundant invalidateQueries() calls.

Add request deduplication
   const startHealthPolling = useCallback(() => {
     if (healthPollRef.current) return;
+    let isPolling = true;

     healthPollRef.current = setInterval(async () => {
+      if (!isPolling) return;
       try {
         const result = await apiClient.getHealth();
-        if (result.status === 'healthy') {
+        if (result.status === 'healthy' && isPolling) {
+          isPolling = false;
           // Server is back up
           if (healthPollRef.current) {
             clearInterval(healthPollRef.current);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/ServerSettings/GpuAcceleration.tsx` around lines 92 - 115,
The polling loop in startHealthPolling spawns concurrent apiClient.getHealth()
calls and doesn't cancel or dedupe in-flight requests; add a mechanism (e.g., an
inFlightRequest ref or AbortController per request) to skip starting a new
request if one is pending and to abort any pending request when you clear the
interval; update the healthPollRef cleanup to both
clearInterval(healthPollRef.current) and cancel the current in-flight request,
and ensure the apiClient.getHealth call respects cancellation so that subsequent
resolving requests cannot cause duplicate setRestartPhase or
queryClient.invalidateQueries calls.

32-38: Clarify refetchInterval logic.

The condition cudaStatusLoading ? false : 10000 disables refetch while loading. However, cudaStatusLoading is only true during the initial load, not during background refetches. After the first successful fetch, cudaStatusLoading becomes false permanently, so this condition effectively always evaluates to 10000.

If the intent is to pause polling during active operations, consider using a dedicated state flag.

Clarification
   const {
     data: cudaStatus,
     isLoading: cudaStatusLoading,
     refetch: refetchCudaStatus,
   } = useQuery({
     queryKey: ['cuda-status', serverUrl],
     queryFn: () => apiClient.getCudaStatus(),
-    refetchInterval: cudaStatusLoading ? false : 10000,
+    refetchInterval: 10000, // Poll every 10s when enabled
     retry: 1,
-    enabled: !!health, // Only fetch when backend is reachable
+    enabled: !!health && restartPhase === 'idle', // Pause during restart
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/ServerSettings/GpuAcceleration.tsx` around lines 32 - 38,
The refetchInterval currently uses cudaStatusLoading which is only true during
the initial load, so after the first fetch the condition always returns 10000;
change the logic to pause polling using a state or react-query's fetch status
instead. Replace the use of cudaStatusLoading in refetchInterval with a
dedicated flag (e.g., isPollingPaused) or the query status flag (e.g.,
isFetching / isRefetching returned by useQuery) so polling is actually paused
during active operations; update the refetchInterval to use that flag
(refetchInterval: isPollingPaused ? false : 10000 or refetchInterval: isFetching
? false : 10000) and set/clear the isPollingPaused flag around the operations
that should suspend polling. Ensure references to useQuery, refetchInterval,
cudaStatusLoading, and apiClient.getCudaStatus are updated accordingly.

56-90: EventSource uses captured serverUrl — may become stale after restart.

The serverUrl is read once when the component mounts (line 19). If the server restarts and the URL changes (unlikely but possible in some configurations), the EventSource at line 61 would use the stale URL.

This is a minor edge case since the port is typically fixed, but worth noting for robustness.

Consider reading the current URL from the store when creating the EventSource:

const currentUrl = useServerStore.getState().serverUrl;
const eventSource = new EventSource(`${currentUrl}/backend/cuda-progress`);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/ServerSettings/GpuAcceleration.tsx` around lines 56 - 90,
The EventSource is created using the captured serverUrl inside the useEffect
(and can become stale); change the EventSource creation to read the current
value from the store at time of creation (e.g., call
useServerStore.getState().serverUrl) rather than using the captured serverUrl
variable, validate it is non-null before constructing the EventSource, and keep
existing cleanup (eventSource.close()) and refetchCudaStatus/error handling in
the same useEffect (symbols: useEffect, serverUrl, useServerStore.getState,
EventSource, refetchCudaStatus, setDownloadProgress, setError).
.github/workflows/build-cuda.yml (1)

62-73: R2 upload failures are silently ignored.

If the aws s3 cp command fails (e.g., due to credential issues or network problems), the workflow continues without error. This could result in a release where the GitHub Release has split parts but R2 has no full binary.

Add error handling
       - name: Upload full binary to R2
         if: startsWith(github.ref, 'refs/tags/')
         shell: bash
         run: |
+          set -e
           VERSION="${GITHUB_REF#refs/tags/}"
           aws s3 cp backend/dist/voicebox-server-cuda.exe \
             "s3://voicebox-downloads/cuda/${VERSION}/voicebox-server-cuda.exe" \
             --endpoint-url "${{ secrets.R2_ENDPOINT }}"
+          echo "Successfully uploaded to R2: cuda/${VERSION}/voicebox-server-cuda.exe"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/build-cuda.yml around lines 62 - 73, The "Upload full
binary to R2" job step currently runs the aws s3 cp command without failing the
workflow on errors; update the step's run block to fail fast by enabling strict
bash error handling (e.g., add set -euo pipefail at start) or append an explicit
check to the aws s3 cp invocation (e.g., aws s3 cp ... || { processLogger?
no—use echo "R2 upload failed" >&2; exit 1; }) so any upload error causes the
job to fail and surface credentials/network errors; target the "Upload full
binary to R2" step and the aws s3 cp command in the run block when applying the
change.
backend/cuda_download.py (2)

136-142: Use logging.exception to capture traceback on failure.

When the download fails, logging.error loses the stack trace. Using logging.exception preserves the full traceback for debugging.

📝 Proposed fix
     except Exception as e:
         # Clean up on failure
         if temp_path.exists():
             temp_path.unlink()
-        logger.error(f"CUDA backend download failed: {e}")
+        logger.exception(f"CUDA backend download failed: {e}")
         progress.mark_error(PROGRESS_KEY, str(e))
         raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/cuda_download.py` around lines 136 - 142, The except block in
backend/cuda_download.py currently logs the error with logger.error which
discards the traceback; update the handler (in the except Exception as e: block
that cleans up temp_path, calls progress.mark_error(PROGRESS_KEY, str(e)) and
re-raises) to call logger.exception(...) instead of logger.error(...) so the
full traceback is recorded while preserving the temp_path cleanup,
progress.mark_error call, and the raise.

177-212: No error handling for individual part download failures.

If a part download fails (non-404 HTTP error, network issue), response.raise_for_status() raises an exception, but the partial file is left on disk. The cleanup in the caller handles this, but the error message won't indicate which part failed.

Consider wrapping with more context:

💡 Proposed enhancement
                 async with client.stream("GET", part_url) as response:
-                    response.raise_for_status()
+                    try:
+                        response.raise_for_status()
+                    except httpx.HTTPStatusError as e:
+                        raise RuntimeError(f"Failed to download part {i+1}/{len(parts)}: {part_name}") from e
                     async for chunk in response.aiter_bytes(chunk_size=1024 * 1024):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/cuda_download.py` around lines 177 - 212, _download_split_parts
currently lets exceptions from response.raise_for_status() bubble up with no
per-part context and leaves only a generic error; wrap the per-part download
(the async with client.stream(...) block inside the for loop that writes to open
file f) in a try/except Exception as e to catch failures for each part_name,
call progress.update_progress(PROGRESS_KEY, current=total_downloaded, total=0,
filename=f"Part {i+1}/{len(parts)}", status="failed") and logger.error or
logger.exception including the part_name and version, then re-raise a new
RuntimeError or raise from the caught exception with a message like f"Failed
downloading part {part_name} for {get_cuda_binary_name()}@{version}" so the
caller gets clear context while allowing existing cleanup to run.
docs/plans/PROJECT_STATUS.md (1)

22-43: Add language specifier to fenced code blocks.

The ASCII diagram code blocks are missing language specifiers. While these are architecture diagrams rather than code, adding a language hint (e.g., text or plaintext) satisfies markdown linters and improves consistency.

📝 Proposed fix
-```
+```text
 ┌─────────────────────────────────────────────────────┐
 │  Tauri Shell (Rust)                                 │

Apply similarly at line 64.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/plans/PROJECT_STATUS.md` around lines 22 - 43, Add a language specifier
to the fenced ASCII diagram blocks in PROJECT_STATUS.md so markdown linters
accept them: locate the fenced block that starts with the ASCII diagram
containing "Tauri Shell (Rust)" / "FastAPI Backend (backend/)" and the similar
block around line 64, and change their opening fences from ``` to a fenced code
block with a plain text hint (e.g., ```text or ```plaintext) so the diagram is
treated as plain text.
docs/plans/CUDA_BACKEND_SWAP.md (1)

105-128: Document CUDA version requirements prominently.

The CI workflow example pins to CUDA 12.1 (cu121). Consider adding a note in the Problem or Solution section about supported CUDA versions and driver requirements, as this affects which NVIDIA GPUs will work.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/plans/CUDA_BACKEND_SWAP.md` around lines 105 - 128, The CI workflow
example currently pins to CUDA 12.1 via the pip line "pip install torch
--index-url https://download.pytorch.org/whl/cu121"; update the docs' Problem or
Solution sections to prominently state supported CUDA versions and minimum
NVIDIA driver requirements, call out that the example uses CUDA 12.1 (cu121) so
users with older/newer GPUs may need different wheels, and add guidance or links
for selecting compatible torch/cu versions and verifying driver compatibility so
readers know which NVIDIA GPUs will work with the CUDA backend.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/build-cuda.yml:
- Around line 35-36: The YAML step "Verify CUDA support in torch" currently
embeds a Python one-liner containing colons which breaks YAML parsing; fix by
quoting the entire run string or converting it to a multi-line script block so
the colons are not interpreted as YAML mappings (i.e., replace the inline run
value with a quoted single-line string or a | multi-line block containing the
python -c command), ensuring the step name "Verify CUDA support in torch" and
its run content are preserved.

In `@backend/cuda_download.py`:
- Around line 229-237: The loop over checksum_urls that does "resp = await
client.get(url)" currently swallows all exceptions; change the except block to
catch Exception as e and log the failure (including the URL and exception
details/stack) before continuing so failures (network errors, parsing issues)
are visible; also log non-200 responses or malformed resp.text when expected_sha
can't be parsed (reference the checksum_urls loop, client.get call,
resp.status_code check, and expected_sha extraction) so you have clear
diagnostics for each attempted URL.

In `@backend/main.py`:
- Around line 1772-1780: The created background task for _download is currently
discarded and its errors are logged with logging.error (losing traceback) and
never surfaced to the frontend; change to store the returned task from
asyncio.create_task into a shared registry (e.g., a global dict like
backend_tasks or backend_errors) keyed by "cuda-backend", attach a
task.add_done_callback to the Task to call logging.exception on any exception
raised by cuda_download.download_cuda_binary (or remove the inner try and let
the callback handle exceptions), and write the exception/result into the shared
registry so the SSE/progress endpoint (which reads progress_key "cuda-backend")
can surface the failure to the user if they connect after the failure. Ensure
references in your changes include the async def _download function,
cuda_download.download_cuda_binary, asyncio.create_task, and the progress_key
"cuda-backend".

In `@backend/server.py`:
- Around line 67-77: Move the --version handling so argument parsing happens
before the heavy module-level imports (avoid importing backend.main or DB stack
when only printing version); parse args early using the existing
parser.add_argument("--version", action="store_true", ...) and immediately check
args.version, and if true import only backend.__version__ (not backend.main) and
print then sys.exit(0); ensure no other module-level imports that pull in
ML/runtime dependencies run before this check.
- Around line 79-88: The code in server.py uses
os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu") which preserves an
inherited CUDA value when the current binary name indicates CPU; change the
logic in the startup detection (where binary_name is derived from
sys.executable) to explicitly set os.environ["VOICEBOX_BACKEND_VARIANT"] = "cpu"
when "cuda" is not in binary_name so a CPU binary cannot report an inherited
"cuda" value (update the block that checks binary_name and sets
VOICEBOX_BACKEND_VARIANT accordingly).

In `@scripts/split_binary.py`:
- Around line 21-38: The current split function reads the entire file into
memory (data = input_path.read_bytes()), which must be replaced with a streaming
approach: open the input file in binary mode, create a hashlib.sha256() object
and update it incrementally as you read fixed-size blocks, and simultaneously
write each block to successive part files (use a counter for part_index instead
of len(parts)). Keep output_dir.mkdir(...) but remove the large in-memory data
variable; after the read loop finalize the hex digest and write checksum_file
with f"{sha256}  {input_path.name}\n". Ensure part naming logic
(f"{input_path.stem}.part{part_index:02d}{input_path.suffix}") and writing uses
the chunk_size argument and that files are written in binary mode so memory
usage stays bounded.

In `@tauri/src-tauri/src/main.rs`:
- Around line 612-629: The 1s sleep in restart_server is fragile; replace it
with a retry loop that waits for the server port to be free before calling
start_server: after stop_server(state.clone()).await?, poll up to N times (e.g.,
10 attempts) using tokio::time::sleep with a short backoff (e.g., 200-500ms) and
on each attempt check port availability by trying to bind a TcpListener to the
server port (implement a small helper like try_bind_port(port) that returns
Ok(()) if bind succeeds then immediately drop the listener), only proceed to
call start_server(app, state, None). If start_server fails after the retries
return the error upward (i.e., propagate the Err<String>) so the frontend can
retry; keep function name restart_server and use stop_server and start_server
symbols as-is.

---

Nitpick comments:
In @.github/workflows/build-cuda.yml:
- Around line 62-73: The "Upload full binary to R2" job step currently runs the
aws s3 cp command without failing the workflow on errors; update the step's run
block to fail fast by enabling strict bash error handling (e.g., add set -euo
pipefail at start) or append an explicit check to the aws s3 cp invocation
(e.g., aws s3 cp ... || { processLogger? no—use echo "R2 upload failed" >&2;
exit 1; }) so any upload error causes the job to fail and surface
credentials/network errors; target the "Upload full binary to R2" step and the
aws s3 cp command in the run block when applying the change.

In `@app/src/components/ServerSettings/GpuAcceleration.tsx`:
- Around line 92-115: The polling loop in startHealthPolling spawns concurrent
apiClient.getHealth() calls and doesn't cancel or dedupe in-flight requests; add
a mechanism (e.g., an inFlightRequest ref or AbortController per request) to
skip starting a new request if one is pending and to abort any pending request
when you clear the interval; update the healthPollRef cleanup to both
clearInterval(healthPollRef.current) and cancel the current in-flight request,
and ensure the apiClient.getHealth call respects cancellation so that subsequent
resolving requests cannot cause duplicate setRestartPhase or
queryClient.invalidateQueries calls.
- Around line 32-38: The refetchInterval currently uses cudaStatusLoading which
is only true during the initial load, so after the first fetch the condition
always returns 10000; change the logic to pause polling using a state or
react-query's fetch status instead. Replace the use of cudaStatusLoading in
refetchInterval with a dedicated flag (e.g., isPollingPaused) or the query
status flag (e.g., isFetching / isRefetching returned by useQuery) so polling is
actually paused during active operations; update the refetchInterval to use that
flag (refetchInterval: isPollingPaused ? false : 10000 or refetchInterval:
isFetching ? false : 10000) and set/clear the isPollingPaused flag around the
operations that should suspend polling. Ensure references to useQuery,
refetchInterval, cudaStatusLoading, and apiClient.getCudaStatus are updated
accordingly.
- Around line 56-90: The EventSource is created using the captured serverUrl
inside the useEffect (and can become stale); change the EventSource creation to
read the current value from the store at time of creation (e.g., call
useServerStore.getState().serverUrl) rather than using the captured serverUrl
variable, validate it is non-null before constructing the EventSource, and keep
existing cleanup (eventSource.close()) and refetchCudaStatus/error handling in
the same useEffect (symbols: useEffect, serverUrl, useServerStore.getState,
EventSource, refetchCudaStatus, setDownloadProgress, setError).

In `@app/src/lib/api/client.ts`:
- Around line 320-331: The debug console.log calls inside the
triggerModelDownload flow should be removed: delete the two console.log
statements that log "[API] triggerModelDownload called for:" with the timestamp
and "[API] triggerModelDownload response:" so the method only calls this.request
and returns the result; ensure the surrounding code in the same function/method
(the POST to '/models/download' and return of result) remains unchanged and
check for any other stray console.log debug lines in the same file.

In `@backend/cuda_download.py`:
- Around line 136-142: The except block in backend/cuda_download.py currently
logs the error with logger.error which discards the traceback; update the
handler (in the except Exception as e: block that cleans up temp_path, calls
progress.mark_error(PROGRESS_KEY, str(e)) and re-raises) to call
logger.exception(...) instead of logger.error(...) so the full traceback is
recorded while preserving the temp_path cleanup, progress.mark_error call, and
the raise.
- Around line 177-212: _download_split_parts currently lets exceptions from
response.raise_for_status() bubble up with no per-part context and leaves only a
generic error; wrap the per-part download (the async with client.stream(...)
block inside the for loop that writes to open file f) in a try/except Exception
as e to catch failures for each part_name, call
progress.update_progress(PROGRESS_KEY, current=total_downloaded, total=0,
filename=f"Part {i+1}/{len(parts)}", status="failed") and logger.error or
logger.exception including the part_name and version, then re-raise a new
RuntimeError or raise from the caught exception with a message like f"Failed
downloading part {part_name} for {get_cuda_binary_name()}@{version}" so the
caller gets clear context while allowing existing cleanup to run.

In `@docs/plans/CUDA_BACKEND_SWAP_FINAL.md`:
- Around line 11-19: The fenced code blocks for the workflow ("User clicks
\"Download CUDA Backend\" in Settings → Backend downloads binary...") and the
later pseudocode block are missing language specifiers; update each opening
fence to include a plaintext language (e.g., replace ``` with ```text or
```plaintext) so the linter recognizes them and formatting is preserved—apply
this change to the block starting with 'User clicks "Download CUDA Backend"' and
the corresponding pseudocode block further down.

In `@docs/plans/CUDA_BACKEND_SWAP.md`:
- Around line 105-128: The CI workflow example currently pins to CUDA 12.1 via
the pip line "pip install torch --index-url
https://download.pytorch.org/whl/cu121"; update the docs' Problem or Solution
sections to prominently state supported CUDA versions and minimum NVIDIA driver
requirements, call out that the example uses CUDA 12.1 (cu121) so users with
older/newer GPUs may need different wheels, and add guidance or links for
selecting compatible torch/cu versions and verifying driver compatibility so
readers know which NVIDIA GPUs will work with the CUDA backend.

In `@docs/plans/PR33_CUDA_PROVIDER_REVIEW.md`:
- Around line 25-37: The fenced ASCII diagram block in the doc (the
triple-backtick block containing the box diagram) is missing a language
specifier; update the opening fence from ``` to ```text so the block becomes
```text followed by the diagram (e.g., the block starting with
"┌──────────────────────────────────────┐" and the lines describing "Main App",
"HTTP (localhost)", and "Provider Binary"); this will satisfy markdown linting
and ensure consistent rendering.

In `@docs/plans/PROJECT_STATUS.md`:
- Around line 22-43: Add a language specifier to the fenced ASCII diagram blocks
in PROJECT_STATUS.md so markdown linters accept them: locate the fenced block
that starts with the ASCII diagram containing "Tauri Shell (Rust)" / "FastAPI
Backend (backend/)" and the similar block around line 64, and change their
opening fences from ``` to a fenced code block with a plain text hint (e.g.,
```text or ```plaintext) so the diagram is treated as plain text.

In `@web/src/platform/lifecycle.ts`:
- Around line 18-21: The restartServer() implementation should mirror
startServer() by invoking the onServerReady callback instead of just returning
the URL; update the async restartServer() in lifecycle.ts to call the same
onServerReady handler (the one used by startServer()) with the resolved URL
(import.meta.env.VITE_SERVER_URL || 'http://localhost:17493') before returning
it so both startServer() and restartServer() have consistent callback semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6d55ea08-83e7-44ad-82ee-d774fb39ca09

📥 Commits

Reviewing files that changed from the base of the PR and between 38bf96f and 44df09a.

📒 Files selected for processing (20)
  • .github/workflows/build-cuda.yml
  • app/src/components/ServerSettings/GpuAcceleration.tsx
  • app/src/components/ServerTab/ServerTab.tsx
  • app/src/lib/api/client.ts
  • app/src/lib/api/types.ts
  • app/src/platform/types.ts
  • backend/build_binary.py
  • backend/cuda_download.py
  • backend/main.py
  • backend/models.py
  • backend/requirements.txt
  • backend/server.py
  • docs/plans/CUDA_BACKEND_SWAP.md
  • docs/plans/CUDA_BACKEND_SWAP_FINAL.md
  • docs/plans/PR33_CUDA_PROVIDER_REVIEW.md
  • docs/plans/PROJECT_STATUS.md
  • scripts/split_binary.py
  • tauri/src-tauri/src/main.rs
  • tauri/src/platform/lifecycle.ts
  • web/src/platform/lifecycle.ts

Comment on lines +229 to +237
for url in checksum_urls:
try:
resp = await client.get(url)
if resp.status_code == 200:
# Format: "sha256hex filename\n"
expected_sha = resp.text.strip().split()[0]
break
except Exception:
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Log exceptions when checksum fetch fails.

Silent except Exception: continue swallows errors without any visibility. If both checksum URLs fail for non-404 reasons (e.g., network issues, malformed response), there's no indication of what went wrong.

🔍 Proposed fix
             try:
                 resp = await client.get(url)
                 if resp.status_code == 200:
                     # Format: "sha256hex  filename\n"
                     expected_sha = resp.text.strip().split()[0]
                     break
-            except Exception:
+            except Exception as e:
+                logger.debug(f"Failed to fetch checksum from {url}: {e}")
                 continue
🧰 Tools
🪛 Ruff (0.15.5)

[error] 236-237: try-except-continue detected, consider logging the exception

(S112)


[warning] 236-236: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/cuda_download.py` around lines 229 - 237, The loop over checksum_urls
that does "resp = await client.get(url)" currently swallows all exceptions;
change the except block to catch Exception as e and log the failure (including
the URL and exception details/stack) before continuing so failures (network
errors, parsing issues) are visible; also log non-200 responses or malformed
resp.text when expected_sha can't be parsed (reference the checksum_urls loop,
client.get call, resp.status_code check, and expected_sha extraction) so you
have clear diagnostics for each attempted URL.

Comment on lines +1772 to +1780
async def _download():
try:
await cuda_download.download_cuda_binary()
except Exception as e:
import logging
logging.getLogger(__name__).error(f"CUDA download failed: {e}")

asyncio.create_task(_download())
return {"message": "CUDA backend download started", "progress_key": "cuda-backend"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Store task reference and improve error propagation.

The static analysis correctly identified issues here:

  1. asyncio.create_task() return value is discarded — if the task fails, the exception is silently swallowed after logging
  2. Using logging.error instead of logging.exception loses the traceback

More importantly, if the download fails before the frontend connects to the SSE stream, the error is logged but never surfaced to the user.

Suggested improvements
+    _cuda_download_task: asyncio.Task | None = None  # Module-level or use a proper task registry

     async def _download():
         try:
             await cuda_download.download_cuda_binary()
         except Exception as e:
-            import logging
-            logging.getLogger(__name__).error(f"CUDA download failed: {e}")
+            import logging
+            logging.getLogger(__name__).exception("CUDA download failed")
+            # Ensure error is propagated to progress manager for SSE consumers
+            from .utils.progress import get_progress_manager
+            get_progress_manager().update_progress(
+                model_name="cuda-backend",
+                current=0,
+                total=0,
+                filename="",
+                status="error",
+                error=str(e),
+            )

-    asyncio.create_task(_download())
+    task = asyncio.create_task(_download())
+    task.add_done_callback(lambda t: t.exception() if not t.cancelled() else None)
🧰 Tools
🪛 Ruff (0.15.5)

[warning] 1775-1775: Do not catch blind exception: Exception

(BLE001)


[warning] 1777-1777: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


[warning] 1779-1779: Store a reference to the return value of asyncio.create_task

(RUF006)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/main.py` around lines 1772 - 1780, The created background task for
_download is currently discarded and its errors are logged with logging.error
(losing traceback) and never surfaced to the frontend; change to store the
returned task from asyncio.create_task into a shared registry (e.g., a global
dict like backend_tasks or backend_errors) keyed by "cuda-backend", attach a
task.add_done_callback to the Task to call logging.exception on any exception
raised by cuda_download.download_cuda_binary (or remove the inner try and let
the callback handle exceptions), and write the exception/result into the shared
registry so the SSE/progress endpoint (which reads progress_key "cuda-backend")
can surface the failure to the user if they connect after the failure. Ensure
references in your changes include the async def _download function,
cuda_download.download_cuda_binary, asyncio.create_task, and the progress_key
"cuda-backend".

Comment on lines +79 to +88
# Detect backend variant from binary name
# voicebox-server-cuda → sets VOICEBOX_BACKEND_VARIANT=cuda
import os
binary_name = os.path.basename(sys.executable).lower()
if "cuda" in binary_name:
os.environ["VOICEBOX_BACKEND_VARIANT"] = "cuda"
logger.info("Backend variant: CUDA")
else:
os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu")
logger.info(f"Backend variant: {os.environ['VOICEBOX_BACKEND_VARIANT']}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t preserve an inherited CUDA variant for CPU binaries.

Line 87 uses setdefault, so an inherited VOICEBOX_BACKEND_VARIANT=cuda leaves the CPU binary reporting itself as CUDA. Since backend/main.py:200-210 surfaces this value in health, the UI can show the wrong backend after a CPU restart.

🩹 Suggested fix
         if "cuda" in binary_name:
             os.environ["VOICEBOX_BACKEND_VARIANT"] = "cuda"
             logger.info("Backend variant: CUDA")
         else:
-            os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu")
+            os.environ["VOICEBOX_BACKEND_VARIANT"] = "cpu"
             logger.info(f"Backend variant: {os.environ['VOICEBOX_BACKEND_VARIANT']}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 79 - 88, The code in server.py uses
os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu") which preserves an
inherited CUDA value when the current binary name indicates CPU; change the
logic in the startup detection (where binary_name is derived from
sys.executable) to explicitly set os.environ["VOICEBOX_BACKEND_VARIANT"] = "cpu"
when "cuda" is not in binary_name so a CPU binary cannot report an inherited
"cuda" value (update the block that checks binary_name and sets
VOICEBOX_BACKEND_VARIANT accordingly).

Comment on lines +21 to +38
def split(input_path: Path, chunk_size: int, output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
data = input_path.read_bytes()
total_size = len(data)

# Write SHA-256 of the complete file
sha256 = hashlib.sha256(data).hexdigest()
checksum_file = output_dir / f"{input_path.stem}.sha256"
checksum_file.write_text(f"{sha256} {input_path.name}\n")

# Split into chunks
parts = []
for i in range(0, total_size, chunk_size):
part_index = len(parts)
part_name = f"{input_path.stem}.part{part_index:02d}{input_path.suffix}"
part_path = output_dir / part_name
part_path.write_bytes(data[i:i + chunk_size])
parts.append(part_name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stream the input file instead of read_bytes().

Lines 23-37 load the whole CUDA binary into memory before hashing and splitting. For the ~2.4 GB artifact in this PR, that can easily exhaust RAM or heavily thrash CI runners. Hash and write parts incrementally from a buffered stream.

🩹 Suggested direction
 def split(input_path: Path, chunk_size: int, output_dir: Path):
     output_dir.mkdir(parents=True, exist_ok=True)
-    data = input_path.read_bytes()
-    total_size = len(data)
+    total_size = input_path.stat().st_size
+    hasher = hashlib.sha256()
+    parts = []
+
+    with input_path.open("rb") as src:
+        part_index = 0
+        current_size = 0
+        part_file = None
+        try:
+            while chunk := src.read(8 * 1024 * 1024):
+                hasher.update(chunk)
+                offset = 0
+                while offset < len(chunk):
+                    if part_file is None or current_size == chunk_size:
+                        if part_file is not None:
+                            part_file.close()
+                        part_name = f"{input_path.stem}.part{part_index:02d}{input_path.suffix}"
+                        parts.append(part_name)
+                        part_file = (output_dir / part_name).open("wb")
+                        part_index += 1
+                        current_size = 0
+
+                    write_size = min(chunk_size - current_size, len(chunk) - offset)
+                    part_file.write(chunk[offset:offset + write_size])
+                    current_size += write_size
+                    offset += write_size
+        finally:
+            if part_file is not None:
+                part_file.close()
 
     # Write SHA-256 of the complete file
-    sha256 = hashlib.sha256(data).hexdigest()
+    sha256 = hasher.hexdigest()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def split(input_path: Path, chunk_size: int, output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
data = input_path.read_bytes()
total_size = len(data)
# Write SHA-256 of the complete file
sha256 = hashlib.sha256(data).hexdigest()
checksum_file = output_dir / f"{input_path.stem}.sha256"
checksum_file.write_text(f"{sha256} {input_path.name}\n")
# Split into chunks
parts = []
for i in range(0, total_size, chunk_size):
part_index = len(parts)
part_name = f"{input_path.stem}.part{part_index:02d}{input_path.suffix}"
part_path = output_dir / part_name
part_path.write_bytes(data[i:i + chunk_size])
parts.append(part_name)
def split(input_path: Path, chunk_size: int, output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
total_size = input_path.stat().st_size
hasher = hashlib.sha256()
parts = []
with input_path.open("rb") as src:
part_index = 0
current_size = 0
part_file = None
try:
while chunk := src.read(8 * 1024 * 1024):
hasher.update(chunk)
offset = 0
while offset < len(chunk):
if part_file is None or current_size == chunk_size:
if part_file is not None:
part_file.close()
part_name = f"{input_path.stem}.part{part_index:02d}{input_path.suffix}"
parts.append(part_name)
part_file = (output_dir / part_name).open("wb")
part_index += 1
current_size = 0
write_size = min(chunk_size - current_size, len(chunk) - offset)
part_file.write(chunk[offset:offset + write_size])
current_size += write_size
offset += write_size
finally:
if part_file is not None:
part_file.close()
# Write SHA-256 of the complete file
sha256 = hasher.hexdigest()
checksum_file = output_dir / f"{input_path.stem}.sha256"
checksum_file.write_text(f"{sha256} {input_path.name}\n")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/split_binary.py` around lines 21 - 38, The current split function
reads the entire file into memory (data = input_path.read_bytes()), which must
be replaced with a streaming approach: open the input file in binary mode,
create a hashlib.sha256() object and update it incrementally as you read
fixed-size blocks, and simultaneously write each block to successive part files
(use a counter for part_index instead of len(parts)). Keep output_dir.mkdir(...)
but remove the large in-memory data variable; after the read loop finalize the
hex digest and write checksum_file with f"{sha256}  {input_path.name}\n". Ensure
part naming logic (f"{input_path.stem}.part{part_index:02d}{input_path.suffix}")
and writing uses the chunk_size argument and that files are written in binary
mode so memory usage stays bounded.

Comment on lines +612 to +629
#[command]
async fn restart_server(
app: tauri::AppHandle,
state: State<'_, ServerState>,
) -> Result<String, String> {
println!("restart_server: stopping current server...");

// Stop the current server
stop_server(state.clone()).await?;

// Wait for port to be released
println!("restart_server: waiting for port release...");
tokio::time::sleep(tokio::time::Duration::from_millis(1000)).await;

// Start server again (will auto-detect CUDA binary)
println!("restart_server: starting server...");
start_server(app, state, None).await
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

1-second wait may be insufficient for reliable port release.

The restart_server command waits 1000ms between stopping and starting. On Windows, port release can take longer, especially if:

  • The process is slow to terminate
  • The OS holds the port in TIME_WAIT state
  • Antivirus software delays process cleanup

If the port isn't released, start_server may fail or connect to a stale/zombie process.

Consider either:

  1. Increasing the wait time (2-3 seconds)
  2. Adding a retry loop that checks port availability
  3. Returning an error if start fails so frontend can retry
Suggested improvement — retry with port check
 async fn restart_server(
     app: tauri::AppHandle,
     state: State<'_, ServerState>,
 ) -> Result<String, String> {
     println!("restart_server: stopping current server...");
     stop_server(state.clone()).await?;

-    // Wait for port to be released
-    println!("restart_server: waiting for port release...");
-    tokio::time::sleep(tokio::time::Duration::from_millis(1000)).await;
+    // Wait for port to be released with retry
+    println!("restart_server: waiting for port release...");
+    for attempt in 0..10 {
+        tokio::time::sleep(tokio::time::Duration::from_millis(300)).await;
+        use std::net::TcpStream;
+        if TcpStream::connect_timeout(
+            &format!("127.0.0.1:{}", SERVER_PORT).parse().unwrap(),
+            std::time::Duration::from_millis(100),
+        ).is_err() {
+            // Port is free
+            println!("Port released after {}ms", (attempt + 1) * 300);
+            break;
+        }
+        if attempt == 9 {
+            println!("Warning: Port still in use after 3s, proceeding anyway");
+        }
+    }

     println!("restart_server: starting server...");
     start_server(app, state, None).await
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tauri/src-tauri/src/main.rs` around lines 612 - 629, The 1s sleep in
restart_server is fragile; replace it with a retry loop that waits for the
server port to be free before calling start_server: after
stop_server(state.clone()).await?, poll up to N times (e.g., 10 attempts) using
tokio::time::sleep with a short backoff (e.g., 200-500ms) and on each attempt
check port availability by trying to bind a TcpListener to the server port
(implement a small helper like try_bind_port(port) that returns Ok(()) if bind
succeeds then immediately drop the listener), only proceed to call
start_server(app, state, None). If start_server fails after the retries return
the error upward (i.e., propagate the Err<String>) so the frontend can retry;
keep function name restart_server and use stop_server and start_server symbols
as-is.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tauri/src-tauri/src/main.rs (1)

231-265: ⚠️ Potential issue | 🟠 Major

Defer CPU sidecar resolution until the fallback branch.

This returns early if sidecar("voicebox-server") fails, so the CUDA path at Line 277 is never attempted even when a valid downloaded CUDA binary is present. Right now, CUDA startup still depends on the bundled CPU sidecar being healthy.

Proposed fix
-    let sidecar_result = app.shell().sidecar("voicebox-server");
-
-    let mut sidecar = match sidecar_result {
-        Ok(s) => s,
-        Err(e) => {
-            eprintln!("Failed to get sidecar: {}", e);
-            ...
-            return Err(format!("Failed to start server. In dev mode, run 'bun run dev:server' in a separate terminal."));
-        }
-    };
-
-    println!("Sidecar command created successfully");
-
     // Build common args
     let data_dir_str = data_dir
         .to_str()
@@
     let spawn_result = if let Some(ref cuda_path) = cuda_binary {
         println!("Launching CUDA backend: {:?}", cuda_path);
         let mut cmd = app.shell().command(cuda_path.to_str().unwrap());
         cmd = cmd.args(["--data-dir", &data_dir_str, "--port", &port_str]);
         if is_remote {
             cmd = cmd.args(["--host", "0.0.0.0"]);
         }
         cmd.spawn()
     } else {
+        let sidecar_result = app.shell().sidecar("voicebox-server");
+        let mut sidecar = match sidecar_result {
+            Ok(s) => s,
+            Err(e) => {
+                eprintln!("Failed to get sidecar: {}", e);
+                ...
+                return Err(format!("Failed to start server. In dev mode, run 'bun run dev:server' in a separate terminal."));
+            }
+        };
+
+        println!("Sidecar command created successfully");
         // Use the bundled CPU sidecar
         sidecar = sidecar.args(["--data-dir", &data_dir_str, "--port", &port_str]);
         if is_remote {
             sidecar = sidecar.args(["--host", "0.0.0.0"]);
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tauri/src-tauri/src/main.rs` around lines 231 - 265, The current code returns
early when app.shell().sidecar("voicebox-server") fails, preventing the
CUDA/downloaded-binary startup path from being attempted; instead, change the
handling so sidecar_result is converted into an Option (e.g., let sidecar:
Option<_> = match sidecar_result { Ok(s) => Some(s), Err(e) => { log the error
and run the dev-mode TCP check using SERVER_PORT as before; None } }); do not
return Err there—only return an error after attempting the
CUDA/downloaded-binary startup path (the code that currently runs after the CPU
sidecar block) if that fallback also fails; reference the sidecar variable
(Option) where the CPU-sidecar-specific startup was previously used so the code
tries CUDA/downloaded binary when sidecar is None.
♻️ Duplicate comments (4)
.github/workflows/build-cuda.yml (1)

35-36: ⚠️ Potential issue | 🔴 Critical

Quote the CUDA verification command so the workflow parses.

Line 36 still embeds : inside an inline run: value, so the workflow is invalid YAML and GitHub will reject it before any job starts.

🐛 Proposed fix
       - name: Verify CUDA support in torch
-        run: python -c "import torch; print(f'CUDA available in build: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')"
+        run: |
+          python -c "import torch; print(f'CUDA available in build: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')"
#!/bin/bash
python -m pip install --quiet pyyaml
python - <<'PY'
from pathlib import Path
import sys
import yaml

path = Path(".github/workflows/build-cuda.yml")
try:
    yaml.safe_load(path.read_text())
    print("YAML parsed successfully")
except Exception as exc:
    print(f"YAML parse failed: {exc}")
    sys.exit(1)
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/build-cuda.yml around lines 35 - 36, The "Verify CUDA
support in torch" job step's run value contains unquoted colons which break YAML
parsing; fix by making the run value a proper YAML string or a multiline block
scalar (e.g., wrap the python command in quotes or use a |-style block) so the
embedded ":" in "torch.version.cuda" doesn't invalidate the workflow; update the
step named "Verify CUDA support in torch" accordingly.
backend/main.py (1)

1768-1779: ⚠️ Potential issue | 🟠 Major

Serialize CUDA download requests and keep the active task.

get_cuda_binary_path() only guards the completed binary. Two callers can reach Line 1779 before the first download finishes, and backend/cuda_download.py:76-190 then has both tasks writing the same .download file and racing the final rename. Keeping a single shared task here also fixes the existing lost-exception path if the background job fails.

🛠️ One way to guard the endpoint
+_cuda_download_task: asyncio.Task | None = None
+
 `@app.post`("/backend/download-cuda")
 async def download_cuda_backend():
     """Download the CUDA backend binary. Returns immediately; track progress via SSE."""
     from . import cuda_download
+    global _cuda_download_task
 
     # Check if already downloaded
     if cuda_download.get_cuda_binary_path() is not None:
         raise HTTPException(status_code=409, detail="CUDA backend already downloaded")
+
+    if _cuda_download_task and not _cuda_download_task.done():
+        raise HTTPException(status_code=409, detail="CUDA backend download already in progress")
 
     async def _download():
-        try:
-            await cuda_download.download_cuda_binary()
-        except Exception as e:
-            import logging
-            logging.getLogger(__name__).error(f"CUDA download failed: {e}")
+        await cuda_download.download_cuda_binary()
 
-    asyncio.create_task(_download())
+    _cuda_download_task = asyncio.create_task(_download())
+    _cuda_download_task.add_done_callback(lambda t: t.exception() if not t.cancelled() else None)
     return {"message": "CUDA backend download started", "progress_key": "cuda-backend"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/main.py` around lines 1768 - 1779, Race occurs because
get_cuda_binary_path() only detects completed installs; fix by serializing
downloads and retaining the active background task. Introduce a module-level
asyncio.Lock (e.g., _cuda_download_lock) and a shared task variable (e.g.,
_cuda_download_task); in the endpoint acquire the lock, if
get_cuda_binary_path() is set return 409, if _cuda_download_task exists return
an appropriate 202/ongoing response or re-use it, otherwise create and store a
task that runs download_cuda_binary(); ensure the task has an exception handler
or add a done callback to log/propagate errors so failures are not silently lost
and clear _cuda_download_task on completion.
backend/server.py (2)

79-88: ⚠️ Potential issue | 🟠 Major

Overwrite the variant for CPU binaries instead of using setdefault.

If the parent process inherited VOICEBOX_BACKEND_VARIANT=cuda, Lines 87-88 preserve that value and /health will report the CPU binary as CUDA. The else branch should always write "cpu".

🩹 Proposed fix
         if "cuda" in binary_name:
             os.environ["VOICEBOX_BACKEND_VARIANT"] = "cuda"
             logger.info("Backend variant: CUDA")
         else:
-            os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu")
-            logger.info(f"Backend variant: {os.environ['VOICEBOX_BACKEND_VARIANT']}")
+            os.environ["VOICEBOX_BACKEND_VARIANT"] = "cpu"
+            logger.info("Backend variant: CPU")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 79 - 88, The current logic in the binary
detection block uses os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu")
which preserves an inherited VOICEBOX_BACKEND_VARIANT (e.g., "cuda") for CPU
binaries; change this so the else branch unconditionally sets
os.environ["VOICEBOX_BACKEND_VARIANT"] = "cpu" (update the code around
binary_name detection in backend/server.py), and keep the logger.info call that
logs the new value (logger.info(...)) so /health reflects the actual CPU binary.

67-77: ⚠️ Potential issue | 🟠 Major

--version still pays the full startup cost.

By the time Lines 74-77 run, Lines 35-40 have already imported backend.main, so voicebox-server --version stays slow and can still fail on machines missing runtime deps. Parse the flag before those imports and exit early.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 67 - 77, The --version flag is parsed after
expensive imports (including backend.main), so running voicebox-server --version
still triggers full startup; move the argparse creation and args =
parser.parse_args() near the top of the module before importing heavy modules
(avoid importing backend.main or other runtime deps), check if args.version
immediately, and if true import backend.__version__ only, print the version and
sys.exit(0); ensure no other imports that pull in backend.main occur before this
early-exit check so the CLI can return quickly and safely.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/src/components/ServerSettings/GpuAcceleration.tsx`:
- Around line 28-38: The refetchInterval option currently references
cudaStatusLoading before that variable exists (causing a ReferenceError); remove
the self-reference and use a concrete expression instead — e.g., in the useQuery
call for apiClient.getCudaStatus() (queryKey ['cuda-status', serverUrl]) replace
refetchInterval: cudaStatusLoading ? false : 10000 with a constant or an
expression based on existing props like health (for example refetchInterval:
health ? 10000 : false) so the options object no longer reads cudaStatusLoading
before destructuring completes.

In `@backend/cuda_download.py`:
- Around line 172-175: The current code first unlinks final_path then renames
temp_path, which can leave no valid binary if the rename fails; replace that
sequence with an atomic swap by calling temp_path.replace(final_path) (use the
Path.replace/os.replace API) so the temp file atomically overwrites final_path;
update the block that references final_path and temp_path in
backend/cuda_download.py to use temp_path.replace(final_path) and remove the
unlink/rename calls.
- Around line 120-131: The checksum fetch currently swallows errors and
continues; change this to "fail closed": when performing the fetch with
client.get (sha_url / sha_resp) ensure non-200 responses, network exceptions, or
parse results that are not a 64-character hex (the expected_sha parsed from
sha_resp.text.strip().split()[0]) are treated as fatal—log an error with details
via logger.error and raise an exception (or exit) instead of logger.warning so
the install is aborted if the .sha256 is missing, malformed, or unreachable;
validate expected_sha format explicitly before proceeding.
- Around line 73-97: The download_cuda_binary function can run concurrently and
clobber shared files; serialize access by introducing a module-level async lock
(e.g. _cuda_download_lock = asyncio.Lock()) and wrap the critical
section—checking/removing temp_path, performing the download, writing to
temp_path, checksum verification, and the atomic rename to final_path—inside
"async with _cuda_download_lock:" so only one coroutine touches the .download
file, final_path, and progress key at a time; if cross-process safety is
required instead, use a filesystem lock (e.g. portalocker or flock on a .lock
file adjacent to binary_name) around the same critical section.

In `@docs/plans/PROJECT_STATUS.md`:
- Around line 440-460: The API quick reference table is out of sync with
backend/main.py; update the entries so they match the actual route signatures:
change the `/generate/stream` entry to indicate it streams audio/wav (not SSE),
remove the non-existent `GET /stories/{id}/items` row, and replace
`/stories/{id}/export` with the correct `/stories/{id}/export-audio` endpoint
and verbs; while editing, cross-check other rows (e.g., `/stories`,
`/stories/{id}`, `/models/*`) against backend/main.py to ensure methods and
descriptions exactly match the implementation.

---

Outside diff comments:
In `@tauri/src-tauri/src/main.rs`:
- Around line 231-265: The current code returns early when
app.shell().sidecar("voicebox-server") fails, preventing the
CUDA/downloaded-binary startup path from being attempted; instead, change the
handling so sidecar_result is converted into an Option (e.g., let sidecar:
Option<_> = match sidecar_result { Ok(s) => Some(s), Err(e) => { log the error
and run the dev-mode TCP check using SERVER_PORT as before; None } }); do not
return Err there—only return an error after attempting the
CUDA/downloaded-binary startup path (the code that currently runs after the CPU
sidecar block) if that fallback also fails; reference the sidecar variable
(Option) where the CPU-sidecar-specific startup was previously used so the code
tries CUDA/downloaded binary when sidecar is None.

---

Duplicate comments:
In @.github/workflows/build-cuda.yml:
- Around line 35-36: The "Verify CUDA support in torch" job step's run value
contains unquoted colons which break YAML parsing; fix by making the run value a
proper YAML string or a multiline block scalar (e.g., wrap the python command in
quotes or use a |-style block) so the embedded ":" in "torch.version.cuda"
doesn't invalidate the workflow; update the step named "Verify CUDA support in
torch" accordingly.

In `@backend/main.py`:
- Around line 1768-1779: Race occurs because get_cuda_binary_path() only detects
completed installs; fix by serializing downloads and retaining the active
background task. Introduce a module-level asyncio.Lock (e.g.,
_cuda_download_lock) and a shared task variable (e.g., _cuda_download_task); in
the endpoint acquire the lock, if get_cuda_binary_path() is set return 409, if
_cuda_download_task exists return an appropriate 202/ongoing response or re-use
it, otherwise create and store a task that runs download_cuda_binary(); ensure
the task has an exception handler or add a done callback to log/propagate errors
so failures are not silently lost and clear _cuda_download_task on completion.

In `@backend/server.py`:
- Around line 79-88: The current logic in the binary detection block uses
os.environ.setdefault("VOICEBOX_BACKEND_VARIANT", "cpu") which preserves an
inherited VOICEBOX_BACKEND_VARIANT (e.g., "cuda") for CPU binaries; change this
so the else branch unconditionally sets os.environ["VOICEBOX_BACKEND_VARIANT"] =
"cpu" (update the code around binary_name detection in backend/server.py), and
keep the logger.info call that logs the new value (logger.info(...)) so /health
reflects the actual CPU binary.
- Around line 67-77: The --version flag is parsed after expensive imports
(including backend.main), so running voicebox-server --version still triggers
full startup; move the argparse creation and args = parser.parse_args() near the
top of the module before importing heavy modules (avoid importing backend.main
or other runtime deps), check if args.version immediately, and if true import
backend.__version__ only, print the version and sys.exit(0); ensure no other
imports that pull in backend.main occur before this early-exit check so the CLI
can return quickly and safely.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: df835041-b635-49d5-9746-79baa842f849

📥 Commits

Reviewing files that changed from the base of the PR and between 44df09a and dccded4.

📒 Files selected for processing (20)
  • .github/workflows/build-cuda.yml
  • app/src/components/ServerSettings/GpuAcceleration.tsx
  • app/src/components/ServerTab/ServerTab.tsx
  • app/src/lib/api/client.ts
  • app/src/lib/api/types.ts
  • app/src/platform/types.ts
  • backend/build_binary.py
  • backend/cuda_download.py
  • backend/main.py
  • backend/models.py
  • backend/requirements.txt
  • backend/server.py
  • docs/plans/CUDA_BACKEND_SWAP.md
  • docs/plans/CUDA_BACKEND_SWAP_FINAL.md
  • docs/plans/PR33_CUDA_PROVIDER_REVIEW.md
  • docs/plans/PROJECT_STATUS.md
  • scripts/split_binary.py
  • tauri/src-tauri/src/main.rs
  • tauri/src/platform/lifecycle.ts
  • web/src/platform/lifecycle.ts
🚧 Files skipped from review as they are similar to previous changes (7)
  • backend/requirements.txt
  • scripts/split_binary.py
  • tauri/src/platform/lifecycle.ts
  • web/src/platform/lifecycle.ts
  • backend/models.py
  • app/src/lib/api/types.ts
  • app/src/lib/api/client.ts

Comment on lines +28 to +38
const {
data: cudaStatus,
isLoading: cudaStatusLoading,
refetch: refetchCudaStatus,
} = useQuery({
queryKey: ['cuda-status', serverUrl],
queryFn: () => apiClient.getCudaStatus(),
refetchInterval: cudaStatusLoading ? false : 10000,
retry: 1,
enabled: !!health, // Only fetch when backend is reachable
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "GpuAcceleration.tsx" -type f

Repository: jamiepine/voicebox

Length of output: 117


🏁 Script executed:

cat -n ./app/src/components/ServerSettings/GpuAcceleration.tsx | head -50

Repository: jamiepine/voicebox

Length of output: 2500


🏁 Script executed:

rg "cudaStatusLoading" ./app/src/components/ServerSettings/GpuAcceleration.tsx

Repository: jamiepine/voicebox

Length of output: 151


🏁 Script executed:

cat -n ./app/src/components/ServerSettings/GpuAcceleration.tsx | sed -n '1,40p'

Repository: jamiepine/voicebox

Length of output: 2056


Remove the self-reference from refetchInterval.

Line 35 reads cudaStatusLoading before it exists. The destructuring assignment on lines 28-31 hasn't completed when the options object (lines 32-38) is evaluated, so cudaStatusLoading is not yet in scope. This will throw a ReferenceError at runtime. Use a constant interval or refactor to a separate query variable.

Proposed fix
-  const {
-    data: cudaStatus,
-    isLoading: cudaStatusLoading,
-    refetch: refetchCudaStatus,
-  } = useQuery({
+  const cudaStatusQuery = useQuery({
     queryKey: ['cuda-status', serverUrl],
     queryFn: () => apiClient.getCudaStatus(),
-    refetchInterval: cudaStatusLoading ? false : 10000,
+    refetchInterval: 10000,
     retry: 1,
     enabled: !!health, // Only fetch when backend is reachable
   });
+
+  const {
+    data: cudaStatus,
+    isLoading: cudaStatusLoading,
+    refetch: refetchCudaStatus,
+  } = cudaStatusQuery;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/ServerSettings/GpuAcceleration.tsx` around lines 28 - 38,
The refetchInterval option currently references cudaStatusLoading before that
variable exists (causing a ReferenceError); remove the self-reference and use a
concrete expression instead — e.g., in the useQuery call for
apiClient.getCudaStatus() (queryKey ['cuda-status', serverUrl]) replace
refetchInterval: cudaStatusLoading ? false : 10000 with a constant or an
expression based on existing props like health (for example refetchInterval:
health ? 10000 : false) so the options object no longer reads cudaStatusLoading
before destructuring completes.

Comment on lines +73 to +97
async def download_cuda_binary(version: Optional[str] = None):
"""Download the CUDA backend binary from GitHub Releases.

Downloads split parts listed in a manifest file, concatenates them,
and verifies the SHA-256 checksum for integrity. Atomic write
(temp file -> rename).

Args:
version: Version tag (e.g. "v0.2.0"). Defaults to current app version.
"""
import httpx

if version is None:
version = f"v{__version__}"

progress = get_progress_manager()
binary_name = get_cuda_binary_name()
dest_dir = get_backends_dir()
final_path = dest_dir / binary_name
temp_path = dest_dir / f"{binary_name}.download"

# Clean up any leftover partial download
if temp_path.exists():
temp_path.unlink()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Serialize CUDA downloads before touching these shared paths.

The endpoint pre-check is non-atomic, so two requests can still reach this function together. Both coroutines would reuse the same .download file, the same final path, and the same cuda-backend progress key, which makes corruption and state clobbering possible.

Proposed fix
+import asyncio
 import hashlib
 import logging
 import os
 import sys
@@
 PROGRESS_KEY = "cuda-backend"
+_cuda_download_lock = asyncio.Lock()
@@
 async def download_cuda_binary(version: Optional[str] = None):
@@
-    if version is None:
-        version = f"v{__version__}"
-
-    progress = get_progress_manager()
-    binary_name = get_cuda_binary_name()
-    dest_dir = get_backends_dir()
-    final_path = dest_dir / binary_name
-    temp_path = dest_dir / f"{binary_name}.download"
-
-    # Clean up any leftover partial download
-    if temp_path.exists():
-        temp_path.unlink()
+    if _cuda_download_lock.locked():
+        raise RuntimeError("CUDA backend download already in progress")
+
+    async with _cuda_download_lock:
+        if version is None:
+            version = f"v{__version__}"
+
+        progress = get_progress_manager()
+        binary_name = get_cuda_binary_name()
+        dest_dir = get_backends_dir()
+        final_path = dest_dir / binary_name
+        temp_path = dest_dir / f"{binary_name}.download"
+
+        if temp_path.exists():
+            temp_path.unlink()
 
-    logger.info(f"Starting CUDA backend download for {version}")
-    progress.update_progress(
-        PROGRESS_KEY, current=0, total=0,
-        filename="Fetching manifest...", status="downloading",
-    )
+        logger.info(f"Starting CUDA backend download for {version}")
+        progress.update_progress(
+            PROGRESS_KEY, current=0, total=0,
+            filename="Fetching manifest...", status="downloading",
+        )
 
-    base_url = f"{GITHUB_RELEASES_URL}/{version}"
-    stem = Path(binary_name).stem  # voicebox-server-cuda
+        base_url = f"{GITHUB_RELEASES_URL}/{version}"
+        stem = Path(binary_name).stem  # voicebox-server-cuda
 
-    try:
-        async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client:
+        try:
+            async with httpx.AsyncClient(follow_redirects=True, timeout=30.0) as client:
               ...
-    except Exception as e:
-        # Clean up on failure
-        if temp_path.exists():
-            temp_path.unlink()
-        logger.error(f"CUDA backend download failed: {e}")
-        progress.mark_error(PROGRESS_KEY, str(e))
-        raise
+        except Exception as e:
+            if temp_path.exists():
+                temp_path.unlink()
+            logger.error(f"CUDA backend download failed: {e}")
+            progress.mark_error(PROGRESS_KEY, str(e))
+            raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/cuda_download.py` around lines 73 - 97, The download_cuda_binary
function can run concurrently and clobber shared files; serialize access by
introducing a module-level async lock (e.g. _cuda_download_lock =
asyncio.Lock()) and wrap the critical section—checking/removing temp_path,
performing the download, writing to temp_path, checksum verification, and the
atomic rename to final_path—inside "async with _cuda_download_lock:" so only one
coroutine touches the .download file, final_path, and progress key at a time; if
cross-process safety is required instead, use a filesystem lock (e.g.
portalocker or flock on a .lock file adjacent to binary_name) around the same
critical section.

Comment on lines +120 to +131
# Fetch expected checksum (optional — for integrity verification)
expected_sha = None
try:
sha_url = f"{base_url}/{stem}.sha256"
sha_resp = await client.get(sha_url)
if sha_resp.status_code == 200:
# Format: "sha256hex filename\n"
expected_sha = sha_resp.text.strip().split()[0]
logger.info(f"Expected SHA-256: {expected_sha[:16]}...")
except Exception:
logger.warning("Could not fetch checksum file — skipping verification")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fail closed if checksum retrieval or parsing fails.

This code installs an executable that tauri/src-tauri/src/main.rs later launches from disk. If the .sha256 fetch is missing, malformed, or temporarily unreachable, silently “skipping verification” turns a transient network problem into running an unauthenticated binary on the next restart.

Proposed fix
-            # Fetch expected checksum (optional — for integrity verification)
-            expected_sha = None
-            try:
-                sha_url = f"{base_url}/{stem}.sha256"
-                sha_resp = await client.get(sha_url)
-                if sha_resp.status_code == 200:
-                    # Format: "sha256hex  filename\n"
-                    expected_sha = sha_resp.text.strip().split()[0]
-                    logger.info(f"Expected SHA-256: {expected_sha[:16]}...")
-            except Exception:
-                logger.warning("Could not fetch checksum file — skipping verification")
+            sha_url = f"{base_url}/{stem}.sha256"
+            sha_resp = await client.get(sha_url)
+            sha_resp.raise_for_status()
+            parts = sha_resp.text.strip().split()
+            if not parts:
+                raise ValueError("Checksum file is empty or malformed")
+            expected_sha = parts[0]
+            logger.info(f"Expected SHA-256: {expected_sha[:16]}...")
🧰 Tools
🪛 Ruff (0.15.5)

[warning] 129-129: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/cuda_download.py` around lines 120 - 131, The checksum fetch
currently swallows errors and continues; change this to "fail closed": when
performing the fetch with client.get (sha_url / sha_resp) ensure non-200
responses, network exceptions, or parse results that are not a 64-character hex
(the expected_sha parsed from sha_resp.text.strip().split()[0]) are treated as
fatal—log an error with details via logger.error and raise an exception (or
exit) instead of logger.warning so the install is aborted if the .sha256 is
missing, malformed, or unreachable; validate expected_sha format explicitly
before proceeding.

Comment on lines +172 to +175
# Atomic move into place
if final_path.exists():
final_path.unlink()
temp_path.rename(final_path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use an atomic replace here.

Deleting the current binary before renaming the temp file leaves a failure window where a known-good CUDA backend is gone if the rename fails. Path.replace()/os.replace() preserves the atomic swap semantics this comment is aiming for.

Proposed fix
-        # Atomic move into place
-        if final_path.exists():
-            final_path.unlink()
-        temp_path.rename(final_path)
+        # Atomic move into place
+        temp_path.replace(final_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/cuda_download.py` around lines 172 - 175, The current code first
unlinks final_path then renames temp_path, which can leave no valid binary if
the rename fails; replace that sequence with an atomic swap by calling
temp_path.replace(final_path) (use the Path.replace/os.replace API) so the temp
file atomically overwrites final_path; update the block that references
final_path and temp_path in backend/cuda_download.py to use
temp_path.replace(final_path) and remove the unlink/rename calls.

Comment on lines +440 to +460
| `/generate` | POST | Generate speech |
| `/generate/stream` | POST | Stream speech (SSE) |
| `/history` | GET | List generation history |
| `/history/{id}` | GET, DELETE | Get/delete generation |
| `/history/{id}/export` | GET | Export generation ZIP |
| `/history/{id}/export-audio` | GET | Export audio only |
| `/transcribe` | POST | Transcribe audio (Whisper) |
| `/models/status` | GET | All model statuses |
| `/models/download` | POST | Trigger model download |
| `/models/{name}` | DELETE | Delete downloaded model |
| `/models/load` | POST | Load model into memory |
| `/models/unload` | POST | Unload model |
| `/models/progress/{name}` | GET | SSE download progress |
| `/tasks/active` | GET | Active downloads/generations |
| `/stories` | POST, GET | Create/list stories |
| `/stories/{id}` | GET, PUT, DELETE | Story CRUD |
| `/stories/{id}/items` | POST, GET | Story items CRUD |
| `/stories/{id}/export` | GET | Export story audio |
| `/channels` | POST, GET | Audio channel CRUD |
| `/channels/{id}` | PUT, DELETE | Channel update/delete |
| `/cache/clear` | POST | Clear voice prompt cache |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

The API quick reference is out of sync with the backend.

A few rows here don't match backend/main.py: /generate/stream streams audio/wav, not SSE; there is no GET /stories/{id}/items; and the story export route is /stories/{id}/export-audio, not /stories/{id}/export. Since this table reads like public API docs, those mismatches will mislead integrators.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/plans/PROJECT_STATUS.md` around lines 440 - 460, The API quick reference
table is out of sync with backend/main.py; update the entries so they match the
actual route signatures: change the `/generate/stream` entry to indicate it
streams audio/wav (not SSE), remove the non-existent `GET /stories/{id}/items`
row, and replace `/stories/{id}/export` with the correct
`/stories/{id}/export-audio` endpoint and verbs; while editing, cross-check
other rows (e.g., `/stories`, `/stories/{id}`, `/models/*`) against
backend/main.py to ensure methods and descriptions exactly match the
implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant