Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions docs/superpowers/specs/2026-06-21-native-overlay-retry-audio-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Native Overlay, Retry, and Recording Asset Design

- Date: 2026-06-21
- Status: Approved
- Branch: `codex/native-cpal-capture`

## Goal

Make the macOS recording main path independent of WebView control logic, then make ASR failure and late-result behavior recoverable by using saved WAV recordings as retryable transcription assets.

The user-visible behavior outside recording, overlay feedback, retry, and history playback should remain unchanged.

## Execution Order

1. Native overlay and native cue playback.
2. Fix late ASR result handling.
3. Add recording asset, history playback, retry, and retention policy.

## Phase 1: Native Main Path

On macOS, the recording main path should no longer depend on WebView for recording lifecycle control or cue playback.

The existing native overlay remains the visual surface. It should handle actionable failure states directly, including a retry icon button. The retry control must be visually subtle and fit the current glass pill style. Its maximum visual footprint must not exceed the current recording waveform element, so the control remains refined rather than dominant.

Failure overlay behavior:

- Show the existing failure text style.
- Show a refresh-style icon button only, without text.
- Display a 5-second countdown affordance around the retry button.
- If the user does not click within 5 seconds, hide the overlay.
- The failed transcription attempt remains available in input history when a WAV exists.

Cue playback should move to native playback on macOS so start/end cues do not depend on the overlay WebView. Windows can keep the current WebView path unless the native implementation is naturally cross-platform.

## Phase 2: Late ASR Result Handling

The current Doubao flow can return a partial result when `commit_and_await_final` times out after 5 seconds, while the server may continue sending a more complete result afterward. This causes premature paste of incomplete text.

The fix should prefer correctness over premature paste:

- Do not paste a partial result merely because the 5-second commit wait elapsed.
- If the session has not produced a reliable final result by the deadline, mark the attempt as failed or retryable instead of pasting known-incomplete text.
- If a definite final result or terminal close arrives within the accepted completion window, paste normally.
- The saved WAV should make manual retry cheap, so retryable failure is better than silently pasting partial text.

## Phase 3: Recording Assets and Retry

Each transcription attempt should have a durable record that can represent success or failure.

History entries should support:

- `status`: success or failed.
- `text`: successful final text, or a short failure description.
- `audioPath`: saved WAV path when available.
- `error`: failure reason when applicable.
- `retryOf`: optional original entry timestamp or ID.

Successful entries continue to count toward usage statistics. Failed entries should appear in input history but should not increase total session or character counts.

Retry behavior:

- Retry uses the saved WAV, not the microphone.
- Retry can be triggered from the native failure overlay within 5 seconds.
- Retry can also be triggered from Settings home input history.
- A successful retry creates or updates a successful history record and follows the normal paste/clipboard/statistics path.
- If recording retention is disabled, the failed WAV is deleted after a retry succeeds.

History UI behavior:

- Successful rows show play, copy, and delete icon buttons.
- Failed rows show play, retry, and delete icon buttons.
- Buttons must match the current input-record action style: orange solid rounded-square icon buttons with white line icons.

## Recording Retention Setting

Add an app setting for whether to retain recordings.

Default: disabled.

When enabled:

- Keep successful and failed recordings for the most recent 1 month.
- Prune older recordings and references.

When disabled:

- Keep only recordings needed for failed retryable entries.
- Delete recordings after successful transcription or successful retry.

## Testing

Backend:

- Unit tests for history serialization/backward compatibility.
- Unit tests for retention pruning decisions.
- Tests for retrying a WAV through the same ASR path where practical.
- Tests for Doubao commit timeout behavior so partial text is not treated as successful final output.

Frontend/settings:

- Tests for history rows with success and failure states.
- Tests for play/retry button bridge calls.

Manual:

- Network timeout creates a failed history entry with WAV.
- Native overlay retry starts a transcription attempt from WAV.
- Settings history retry works after overlay disappears.
- Successful retry removes failed-only WAV when retention is disabled.
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
"scripts": {
"tauri": "tauri",
"dev": "tauri dev",
"dev:no-watch": "tauri dev --no-watch",
"dev:web": "vite",
"build:web": "vite build",
"pack": "tsx --env-file=.env scripts/pack.ts",
Expand Down
3 changes: 3 additions & 0 deletions src-tauri/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,6 @@ wiremock = "0.6"
objc2 = "0.6"
objc2-app-kit = "0.3"
objc2-foundation = "0.3"
# Native microphone capture is macOS-only (native_audio is cfg(macos)); keeping
# cpal off other targets avoids pulling ALSA (libasound2-dev) into the Linux CI.
cpal = "0.15"
10 changes: 10 additions & 0 deletions src-tauri/src/app_state.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ pub struct AppInner {
/// once it attaches. Always accessed while holding `asr_session` to stay
/// ordered against the drain.
pub pending_audio: Mutex<Vec<Vec<f32>>>,
/// Full-session 16k mono PCM captured from the same stream sent to ASR.
/// Saved as a WAV when a recording is finalized, for diagnostics and review.
pub recording_audio: Mutex<Vec<f32>>,
/// Resolves when the background ASR connect finishes (Ok) or fails (Err).
/// `stop_recording` awaits this when the user stops before the session is ready.
pub connect_rx: Mutex<Option<tokio::sync::oneshot::Receiver<Result<(), String>>>>,
Expand All @@ -44,6 +47,10 @@ pub struct AppInner {
/// audio, so already-recognized text is accumulated here and prepended to the
/// new session's output. Reset at the start of every recording.
pub accumulated_text: Mutex<String>,
/// Native microphone capture used on macOS to avoid WebView/WebRTC input
/// processing. Other platforms keep the renderer getUserMedia path.
#[cfg(target_os = "macos")]
pub native_audio: Mutex<Option<crate::native_audio::NativeAudioCapture>>,
}

pub type AppHandle = Arc<AppInner>;
Expand All @@ -67,8 +74,11 @@ pub fn create_app_state(
pending_audio_warmup: Mutex::new(None),
latest_transcript: Mutex::new((String::new(), String::new())),
pending_audio: Mutex::new(Vec::new()),
recording_audio: Mutex::new(Vec::new()),
connect_rx: Mutex::new(None),
session_epoch: std::sync::atomic::AtomicU64::new(0),
accumulated_text: Mutex::new(String::new()),
#[cfg(target_os = "macos")]
native_audio: Mutex::new(None),
})
}
78 changes: 49 additions & 29 deletions src-tauri/src/commands.rs
Original file line number Diff line number Diff line change
Expand Up @@ -289,49 +289,38 @@ fn compute_audio_level(samples: &[f32]) -> Option<f64> {
Some((rms * 13.0 + peak * 2.8).powf(0.82).min(1.0))
}

/// Receive an audio chunk from the renderer (base64-encoded i16 PCM),
/// decode to f32 samples and forward to the active ASR session.
#[tauri::command]
pub async fn send_audio_chunk(
_app: AppHandle,
state: State<'_, AppState>,
base64_chunk: String,
) -> Result<serde_json::Value, String> {
use base64::Engine as _;
pub(crate) async fn append_audio_samples(
app: &AppHandle,
state: &AppState,
samples: Vec<f32>,
) -> bool {
use std::sync::atomic::{AtomicU64, Ordering};
static CHUNK_COUNT: AtomicU64 = AtomicU64::new(0);
let n = CHUNK_COUNT.fetch_add(1, Ordering::Relaxed);
if n == 0 || n.is_multiple_of(50) {
log_audio!(
debug,
"Received chunk #{} ({} bytes base64)",
"Received audio chunk #{} ({} samples)",
n,
base64_chunk.len()
samples.len()
);
}

// Decode base64 → i16 PCM bytes → f32 samples
let bytes = match base64::engine::general_purpose::STANDARD.decode(&base64_chunk) {
Ok(data) => data,
Err(_) => {
log_audio!(warn, "Chunk #{} base64 decode failed", n);
return Ok(serde_json::json!({ "ok": false, "message": "音频数据解码失败" }));
}
};
let samples: Vec<f32> = bytes
.chunks_exact(2)
.map(|chunk| {
let sample = i16::from_le_bytes([chunk[0], chunk[1]]);
sample as f32 / 32768.0
})
.collect();
state
.recording_audio
.lock()
.await
.extend_from_slice(&samples);

// Drive the native waveform (macOS only) from the same PCM the ASR receives,
// whether the chunk is sent immediately or buffered.
#[cfg(target_os = "macos")]
if let Some(level) = compute_audio_level(&samples) {
crate::overlay::set_audio_level(&_app, level);
crate::overlay::set_audio_level(app, level);
}
// `app` only drives the macOS native waveform above; unused on other platforms.
#[cfg(not(target_os = "macos"))]
let _ = app;

// Hold the `asr_session` lock across the decision so buffering stays ordered
// against the background connect task's drain (same lock), guaranteeing no
Expand All @@ -340,7 +329,7 @@ pub async fn send_audio_chunk(
if let Some(ref session) = *session {
if session.is_ready() {
session.append_audio(&samples);
return Ok(serde_json::json!({ "ok": true }));
return false;
}
}

Expand All @@ -350,6 +339,7 @@ pub async fn send_audio_chunk(
let mut pending = state.pending_audio.lock().await;
if pending.len() < MAX_PENDING_CHUNKS {
pending.push(samples);
return true;
} else if n.is_multiple_of(50) {
log_audio!(
warn,
Expand All @@ -358,7 +348,37 @@ pub async fn send_audio_chunk(
n
);
}
Ok(serde_json::json!({ "ok": true, "buffered": true }))
true
}

/// Receive an audio chunk from the renderer (base64-encoded i16 PCM),
/// decode to f32 samples and forward to the active ASR session.
#[tauri::command]
pub async fn send_audio_chunk(
app: AppHandle,
state: State<'_, AppState>,
base64_chunk: String,
) -> Result<serde_json::Value, String> {
use base64::Engine as _;

// Decode base64 → i16 PCM bytes → f32 samples
let bytes = match base64::engine::general_purpose::STANDARD.decode(&base64_chunk) {
Ok(data) => data,
Err(_) => {
log_audio!(warn, "Audio chunk base64 decode failed");
return Ok(serde_json::json!({ "ok": false, "message": "音频数据解码失败" }));
}
};
let samples: Vec<f32> = bytes
.chunks_exact(2)
.map(|chunk| {
let sample = i16::from_le_bytes([chunk[0], chunk[1]]);
sample as f32 / 32768.0
})
.collect();

let buffered = append_audio_samples(&app, &state, samples).await;
Ok(serde_json::json!({ "ok": true, "buffered": buffered }))
}

/// Notify that audio has stopped in the renderer.
Expand Down
Loading