Feature/whisper speech recognition by pzauner · Pull Request #103 · palsoftware/pastiera

pzauner · 2026-01-04T23:56:54Z

feat: Complete Whisper speech recognition implementation

Implement OpenAI Whisper API integration with cloud-based transcription
Implement OpenRouter Audio API with multiple model support and pricing
Add local ONNX Whisper support (WIP) with DocWolle models (not yet working correctly)
Implement comprehensive usage statistics tracking:
- Total cost tracking (USD)
- Word count aggregation
- Words-per-minute (WPM) calculation
- Per-model breakdown with usage metrics
Add system language fallback for automatic locale detection
Add model compatibility warnings for unsupported variants
Integrate ONNX Runtime (v1.17.0) for local inference
Auto-enable speech engines on selection
Remove redundant toggles from sub-screens
Remove Audio Debug menu and related UI
Clean up development documentation files

Features:
✅ Three speech recognition engines: Google Stock, OpenAI API, OpenRouter API
✅ Dynamic model listing with real-time pricing from OpenRouter
✅ API key validation with visual feedback
✅ Usage statistics with model-level breakdown
✅ Proper error handling and logging
✅ Graceful fallbacks for missing configurations

Tested with:

Google Flash models on OpenRouter
OpenAI Whisper API

This commit adds the foundational components for Whisper-based offline speech recognition: Core Components: - WhisperModel: Enum defining available models (Tiny, Base, Small) - WhisperRecordBuffer: Audio buffer management for PCM samples - WhisperRecorder: Audio recording with Voice Activity Detection (VAD) - WhisperEngine: TensorFlow Lite inference engine for Whisper - WhisperUtil: Mel spectrogram calculation and token decoding - WhisperRecognitionManager: High-level API similar to SpeechRecognitionManager Dependencies: - TensorFlow Lite 2.15.0 (inference engine) - TensorFlow Lite Support 0.4.4 (tensor utilities) - Android VAD WebRTC 2.0.9 (voice activity detection) Settings Integration: - Added WhisperSettings to SettingsManager - getUseWhisper/setUseWhisper for toggle - getWhisperModel/setWhisperModel for model selection Localization: - Complete EN and DE translations for Whisper UI strings Technical Details: - Uses TFLite models from Hugging Face (DocWolle/whisper_tflite_models) - Supports multilingual (99 languages) and English-only models - Automatic silence detection with VAD for seamless UX - 16kHz audio sampling as required by Whisper - Max 30s recording duration Next Steps: - Model download UI - Integration into PhysicalKeyboardInputMethodService - Settings screen for model management Related to #whisper-integration

- Created WhisperModelDownloader for Hugging Face model downloads - Added Speech Recognition settings category (EN/DE) - Progress tracking for downloads - Atomic file writes with temp files - Shared vocab file management Next: Complete WhisperSettingsScreen UI and integrate into SettingsScreen

- Fixed VAD initialization (VadWebRTC instead of Vad) - Removed incorrect listener API (use isSpeech() directly) - Changed audio buffer from FloatArray to ByteArray (PCM16) - Aligned with Whisper+ implementation - Proper VAD frame size handling (480 samples = 30ms)

- Fix WhisperRecorder.kt to use correct Vad.builder() API - Fix WhisperRecordBuffer.kt to use ByteArray instead of FloatArray - Fix SettingsScreen.kt indentation and brace issues - Add Speech Recognition navigation to main settings - All menu items now properly visible in settings

- Implement WhisperSettingsScreen with model selection - Add model download progress tracking - Implement delete model functionality - Add toggle between Google and Whisper recognition - All strings and translations already in place - UI shows download status, model size, and description

- Add WhisperRecognitionManager variable to PhysicalKeyboardInputMethodService - Implement toggle logic between Google and Whisper based on settings - Update startSpeechRecognition to check useWhisper setting - Update stopSpeechRecognition to stop both managers - Add error toast for Whisper recognition failures - Maintain consistent UI callbacks for both recognition modes

…wnload - Add detailed Log.d statements for download progress - Add Toast notifications for success/error - Catch and display exceptions in download coroutine - Log Result status from downloadModel - Improve user feedback for download issues

- Change from whisper_tiny_en.tflite to whisper-tiny-en.tflite - Change from whisper_base.tflite to whisper-base.tflite - Change from whisper_small.tflite to whisper-small.tflite - Use correct filenames from DocWolle/whisper_tflite_models repo - Add comprehensive download logging and error messages - Fix Toast threading issues (moved to Main thread) - Add progress indicator in download button Fixes 404 errors when downloading models from Hugging Face

CRITICAL FIX - Previous filenames were completely wrong! Correct filenames from DocWolle/whisper_tflite_models: - whisper-tiny.en.tflite (with DOT between tiny and en!) - whisper-base.TOP_WORLD.tflite (not just whisper-base.tflite!) - whisper-small.tflite (this one was correct) Also updated file sizes to match actual models: - Tiny: 42 MB (was 75 MB) - Base TOP_WORLD: 108 MB (was 150 MB) - Small: 388 MB (was 500 MB) This should fix the 404 errors when downloading models.

- Change from global isDownloading to per-model downloadingModel - Show spinner only on the model being downloaded - Show progress bar inline in model card (always visible during download) - Display download percentage next to progress bar - Remove global progress indicator at bottom - Only disable other download buttons while one is active This fixes the UI issues where all buttons showed spinners and progress was only visible when scrolling.

Critical fix for Whisper audio processing: - Fix incorrect buffer size calculation (was dividing by 8 unnecessarily) - Use fold() to calculate total tensor size from shape - Add mel spectrogram size validation and adjustment - Pad or truncate mel spectrogram to match expected tensor size - Add comprehensive logging for tensor shapes and sizes - Handle size mismatches gracefully instead of crashing This fixes the BufferOverflowException at WhisperEngine.kt:150 when trying to process audio for speech recognition. The buffer now correctly allocates expectedInputSize * 4 bytes instead of using the incorrect Float.SIZE_BYTES / 8 formula.

- Implement OpenAI Whisper API integration with cloud-based transcription - Implement OpenRouter Audio API with multiple model support and pricing - Add local ONNX Whisper support (WIP) with DocWolle models - Implement comprehensive usage statistics tracking: * Total cost tracking (USD) * Word count aggregation * Words-per-minute (WPM) calculation * Per-model breakdown with usage metrics - Add system language fallback for automatic locale detection - Add model compatibility warnings for unsupported variants - Integrate ONNX Runtime (v1.17.0) for local inference - Support multiple audio formats (WAV, MP3, ONNX) - Auto-enable speech engines on selection - Remove redundant toggles from sub-screens - Remove Audio Debug menu and related UI - Clean up development documentation files Features: ✅ Three speech recognition engines: Google Stock, OpenAI API, OpenRouter API ✅ Dynamic model listing with real-time pricing from OpenRouter ✅ API key validation with visual feedback ✅ Usage statistics with model-level breakdown ✅ Proper error handling and logging ✅ Graceful fallbacks for missing configurations Tested with: - Google Flash models on OpenRouter - OpenAI Whisper API - System language detection across locales

pzauner · 2026-01-04T23:57:26Z

#98

pzauner · 2026-01-05T00:02:14Z

Language selection is only used for fallback, the models are multilingual by default

pzauner · 2026-01-05T00:35:52Z

TODO: Proper token calculation for audio;
currently using normal input token cost, however audio input is calculated a bit differently:
e.g. Flash 2.5 $0.30/M input tokens normal vs. $1/M audio tokens

However still dirt cheap and sooo much better than using androids default speech recognition

- **Onboarding AI Features**: Add AI Features and Voice Input Button toggles to tutorial screen for granular control during setup - **Speech Engine Defaults**: Change default engine to Google Speech Recognition instead of Whisper - **Long-Press Mic Button**: Long-press mic button navigates directly to Speech Recognition Settings - **Error Handling**: Suppress non-critical error toasts (NO_MATCH, SPEECH_TIMEOUT); only show critical errors - **SpeechRecognizer Lifecycle**: Properly destroy and recreate SpeechRecognizer after each recognition session to prevent ERROR_CLIENT - **OpenRouter Audio Filtering**: Filter models to show only those supporting audio input modalities - **OpenRouter Audio Pricing**: - Correctly parse per-token pricing from API and convert to per-million-tokens format - Display audio-specific pricing with proper formatting ($/M audio tokens) - Fallback to prompt pricing if audio pricing unavailable - **Availability Checks**: Show toast when speech recognition unavailable on device - **Voice Input Button Logic**: Display mic button when explicitly enabled, show warning toast when AI Features disabled - **Tests**: Add instrumented tests for OpenRouter audio model filtering, pricing extraction, and label formatting

palsoftware · 2026-01-06T23:26:02Z

I'm sorry and I thank you for your work and dedication but I don't think we need this in this phase of the project

pzauner added 13 commits January 4, 2026 02:22

fix: Add JitPack repository for android-vad dependency

7be4a22

pzauner force-pushed the main branch 10 times, most recently from fe7c448 to 35c9b8e Compare March 6, 2026 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/whisper speech recognition#103

Feature/whisper speech recognition#103
pzauner wants to merge 14 commits intopalsoftware:mainfrom
pzauner:feature/whisper-speech-recognition

pzauner commented Jan 4, 2026

Uh oh!

pzauner commented Jan 4, 2026

Uh oh!

pzauner commented Jan 5, 2026

Uh oh!

pzauner commented Jan 5, 2026

Uh oh!

palsoftware commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pzauner commented Jan 4, 2026

Uh oh!

pzauner commented Jan 4, 2026

Uh oh!

pzauner commented Jan 5, 2026

Uh oh!

pzauner commented Jan 5, 2026

Uh oh!

palsoftware commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants