diff --git a/ROBUSTNESS-PLAN.md b/ROBUSTNESS-PLAN.md new file mode 100644 index 0000000..442dd4d --- /dev/null +++ b/ROBUSTNESS-PLAN.md @@ -0,0 +1,319 @@ +# LANStreamer robustness plan + +Breaking the system down by domain with explicit failure-mode analysis. Each layer has clear boundaries, enumerated failure modes, and testable mitigation. + +--- + +## Layer 1: Environment detection & setup + +### 1.1 Icecast discovery + +- **What could go wrong?** + - Not installed + - Installed but not in PATH + - Installed but wrong version (flags changed) + - Multiple versions on system + - Config file missing or corrupted + - Port conflicts (8000 already in use) + - Permissions issues (can't write to config/logs) +- **How do you test for it?** + - `which icecast` / `where icecast` + - `icecast --version` and parse output + - Try to load/warm-start the config and catch errors + - Check port availability before starting +- **Mitigation** + - Clear error messages with install links + - Prompt to download missing binary + - Auto-generate config from template + - Offer alternative port if 8000 taken + - Validate config before launching + +### 1.2 FFmpeg discovery + +- **What could go wrong?** + - Not installed + - In PATH but lacking codecs (libmp3lame, libopus, etc.) + - Wrong version (some flag changes between versions) + - FFmpeg exists but ffprobe doesn't (or vice versa) +- **How do you test for it?** + - `ffmpeg -version` and parse for codecs + - `ffprobe -version` for metadata capabilities + - Try a dummy encode to verify codec support +- **Mitigation** + - Codec-specific error messages ("You need FFmpeg with libmp3lame") + - Offer bundled/config instructions for common OSs + - Graceful degradation if some codecs missing + +### 1.3 Audio device discovery + +- **What could go wrong?** + - No audio devices found + - Device name has special characters that break CLI + - Device disconnected mid-stream + - Device name changes between sessions (Windows "Microphone (2)" vs "Microphone (3)") + - Permission denied (macOS/Windows privacy controls) +- **How do you test for it?** + - `ffmpeg -list_devices true -f dshow -i dummy` (Windows) + - `ffmpeg -f avfoundation -list_devices true -i ""` (macOS) + - `pactl list sources` (Linux PulseAudio) + - Store device fingerprint/hash to detect renaming +- **Mitigation** + - Cache device mappings with stable IDs + - Device aliasing/user-friendly names + - Graceful reconnection attempts + - Clear permission prompts + +--- + +## Layer 2: Stream lifecycle management + +### 2.1 Stream creation + +- **What could go wrong?** + - Duplicate stream name + - Invalid characters in stream name (Icecast mount point restrictions) + - Source limit reached (Icecast default: 2 sources) + - Port collision if assigning unique ports per stream +- **Testing & mitigation** + - Validate name against regex before creating + - Check Icecast `` limit before creating + - Pre-flight checks: "Can this stream actually start?" + +### 2.2 Stream start (spawn FFmpeg) + +- **What could go wrong?** + - FFmpeg exits immediately (codec issue, device busy) + - FFmpeg starts but no audio (silent stream) + - FFmpeg starts but wrong sample rate/channels + - Icecast rejects the connection (wrong password, mount exists) +- **Testing & mitigation** + - Health check: verify FFmpeg process stays alive > 5 seconds + - Probe Icecast mount point to confirm audio is flowing + - Parse FFmpeg stderr for known error patterns + - Exponential backoff on retry + +### 2.3 Stream stop/cleanup + +- **What could go wrong?** + - FFmpeg won't die (zombie process) + - Icecast mount point stays active + - Port not released + - State desync (UI thinks stopped, but still running) +- **Testing & mitigation** + - Force-kill if graceful shutdown fails + - Explicit Icecast mount unregister + - State reconciliation on startup ("What streams are actually running?") + - Cleanup orphaned processes on server start + +### 2.4 Stream restart (hot reload) + +- **What could go wrong?** + - Old process dies but new one fails to start + - Brief downtime that drops all listeners + - Config change invalidates active stream +- **Testing & mitigation** + - Atomic switch: start new before killing old (if source allows) + - Graceful handoff with listener notification + - Rollback on failure + +--- + +## Layer 3: Audio pipeline + +### 3.1 Audio capture + +- **What could go wrong?** + - Sample rate mismatch (device 48 kHz, expect 44.1 kHz) + - Channel mismatch (stereo vs mono) + - Device goes silent (mic unplugged, app stops granting permission) + - Clipping/distortion (gain too high) +- **Testing & mitigation** + - Auto-detect device capabilities and configure accordingly + - Audio level monitoring (alert if silent for > X seconds) + - Auto-normalize or gain controls + - Fall back to default device if selected fails + +### 3.2 Encoding + +- **What could go wrong?** + - Wrong codec for format (MP3 vs Opus vs AAC) + - Bitrate too high for listeners' bandwidth + - CPU overload (encoding quality too high) +- **Testing & mitigation** + - Codec compatibility matrix (what works where) + - Adaptive bitrate based on listener count + - CPU monitoring, quality stepping down if needed + +### 3.3 Icecast integration + +- **What could go wrong?** + - Authentication failure (wrong password in config) + - Mount point already exists + - Icecast crashes or restarts + - Network hiccup between FFmpeg and Icecast +- **Testing & mitigation** + - Connection health monitoring + - Auto-reconnect with backoff + - Heartbeat/ping to detect Icecast aliveness + - Config validation before use + +--- + +## Layer 4: Listener delivery + +### 4.1 Stream discovery + +- **What could go wrong?** + - Listener can't find server (wrong IP, firewall) + - Server IP changes (DHCP) + - mDNS/Bonjour not working +- **Testing & mitigation** + - Display server IP prominently, include QR code + - Link-local addresses (hostname.local) + - Simple connectivity test on listener page ("Can I reach the server?") + +### 4.2 Playback + +- **What could go wrong?** + - Browser doesn't support codec + - Buffer underruns (stuttering) + - High latency (not suitable for translation) + - Listener device goes to sleep, loses connection +- **Testing & mitigation** + - Codec fallback cascade (Opus → MP3 → AAC) + - Adaptive buffering based on network conditions + - Auto-reconnect with UI indication + - Latency display (so admin knows if there's an issue) + +--- + +## Layer 5: State & configuration + +### 5.1 Persistent state + +- **What could go wrong?** + - Config file corrupted (bad JSON) + - Config file permissions issue + - Multiple processes writing to same config + - Config schema changes break old configs +- **Testing & mitigation** + - Config validation on load (schema, JSON parse) + - Backup/rollback before writes + - Atomic file writes (write temp, then rename) + - Schema migration path + +### 5.2 Runtime state + +- **What could go wrong?** + - State desync (what's running vs. what we think is running) + - Memory leak in long-running process + - Event listeners not cleaned up +- **Testing & mitigation** + - State reconciliation function (ask OS "what's actually running") + - Regular health checks + - Explicit cleanup on all state changes + +--- + +## Layer 6: Error handling & diagnostics + +### 6.1 Error detection + +- **What could go wrong?** + - FFmpeg error patterns not recognized + - False positives (normal noise treated as error) + - Errors swallowed somewhere in the stack +- **Testing & mitigation** + - Comprehensive error pattern library (errorDiagnostics.js) + - Error taxonomy (fatal, recoverable, warning, info) + - Error context (stream name, device, timestamp) + +### 6.2 Error recovery + +- **What could go wrong?** + - Retry loop never succeeds (hammering resources) + - Recovery makes things worse + - User not informed what's happening +- **Testing & mitigation** + - Max retry limits with exponential backoff + - Recovery strategies per error type (some need manual intervention) + - User-facing messages with actionable next steps + +--- + +## Layer 7: Admin UI + +### 7.1 Dashboard + +- **What could go wrong?** + - UI shows stale state (not synced to backend) + - Controls don't work (but don't show error) + - Overwhelming amount of information +- **Testing & mitigation** + - Real-time state sync (Socket.io or polling) + - Optimistic UI updates with rollback on failure + - Progressive disclosure (show essentials, details on demand) + +### 7.2 Stream controls + +- **What could go wrong?** + - Click "Start" but nothing happens + - Click "Stop" and it asks "Are you sure?" every time + - No indication of what's happening (starting... started... failed) +- **Testing & mitigation** + - Loading states for all async actions + - Clear success/error feedback + - Keyboard shortcuts for power users + - Bulk operations (start all, stop all) + +--- + +## Layer 8: Listener UI + +### 8.1 Accessibility + +- **What could go wrong?** + - Not keyboard navigable + - No screen reader support + - Poor contrast + - Small touch targets on mobile +- **Testing & mitigation** + - WCAG 2.1 AA compliance + - Keyboard navigation audit + - Screen reader testing + - Touch target minimum 44×44 px + +### 8.2 Usability + +- **What could go wrong?** + - Too much information (what do I click?) + - No volume control + - Can't tell which stream is which + - No error state ("Why isn't it playing?") +- **Testing & mitigation** + - Simple primary action: "Play stream" + - Stream metadata (name, source, listener count) + - Clear error messages with next steps + - Mobile-first responsive design + +--- + +## Refactoring path + +1. **Layer 1** — Solid environment detection and validation. Everything else depends on this. +2. **Layer 2 (Lifecycle)** — Clean up how streams are created/started/stopped. Core orchestration. +3. **Layer 6 (Errors)** — Build the diagnostic system alongside each layer, not after. +4. **Layer 3 (Audio pipeline)** — Once FFmpeg is reliable, make the audio flow reliable. +5. **Layers 4, 7, 8 (UI/UX)** — Once the backend is solid, improve the UI thoughtfully. + +--- + +## Scope note: internet streaming + +Building internet streaming from scratch is reinventing the wheel. For streaming to the world, use: + +- YouTube Live (RTMP) +- Twitch (RTMP) +- Restream.io / Castr (multi-platform) + +**LANStreamer's strength is local, low-latency, private streaming. Lean into that.**