Feature/whisper speech recognition#103
Open
pzauner wants to merge 14 commits intopalsoftware:mainfrom
Open
Conversation
This commit adds the foundational components for Whisper-based offline speech recognition: Core Components: - WhisperModel: Enum defining available models (Tiny, Base, Small) - WhisperRecordBuffer: Audio buffer management for PCM samples - WhisperRecorder: Audio recording with Voice Activity Detection (VAD) - WhisperEngine: TensorFlow Lite inference engine for Whisper - WhisperUtil: Mel spectrogram calculation and token decoding - WhisperRecognitionManager: High-level API similar to SpeechRecognitionManager Dependencies: - TensorFlow Lite 2.15.0 (inference engine) - TensorFlow Lite Support 0.4.4 (tensor utilities) - Android VAD WebRTC 2.0.9 (voice activity detection) Settings Integration: - Added WhisperSettings to SettingsManager - getUseWhisper/setUseWhisper for toggle - getWhisperModel/setWhisperModel for model selection Localization: - Complete EN and DE translations for Whisper UI strings Technical Details: - Uses TFLite models from Hugging Face (DocWolle/whisper_tflite_models) - Supports multilingual (99 languages) and English-only models - Automatic silence detection with VAD for seamless UX - 16kHz audio sampling as required by Whisper - Max 30s recording duration Next Steps: - Model download UI - Integration into PhysicalKeyboardInputMethodService - Settings screen for model management Related to #whisper-integration
- Created WhisperModelDownloader for Hugging Face model downloads - Added Speech Recognition settings category (EN/DE) - Progress tracking for downloads - Atomic file writes with temp files - Shared vocab file management Next: Complete WhisperSettingsScreen UI and integrate into SettingsScreen
- Fixed VAD initialization (VadWebRTC instead of Vad) - Removed incorrect listener API (use isSpeech() directly) - Changed audio buffer from FloatArray to ByteArray (PCM16) - Aligned with Whisper+ implementation - Proper VAD frame size handling (480 samples = 30ms)
- Fix WhisperRecorder.kt to use correct Vad.builder() API - Fix WhisperRecordBuffer.kt to use ByteArray instead of FloatArray - Fix SettingsScreen.kt indentation and brace issues - Add Speech Recognition navigation to main settings - All menu items now properly visible in settings
- Implement WhisperSettingsScreen with model selection - Add model download progress tracking - Implement delete model functionality - Add toggle between Google and Whisper recognition - All strings and translations already in place - UI shows download status, model size, and description
- Add WhisperRecognitionManager variable to PhysicalKeyboardInputMethodService - Implement toggle logic between Google and Whisper based on settings - Update startSpeechRecognition to check useWhisper setting - Update stopSpeechRecognition to stop both managers - Add error toast for Whisper recognition failures - Maintain consistent UI callbacks for both recognition modes
…wnload - Add detailed Log.d statements for download progress - Add Toast notifications for success/error - Catch and display exceptions in download coroutine - Log Result status from downloadModel - Improve user feedback for download issues
- Change from whisper_tiny_en.tflite to whisper-tiny-en.tflite - Change from whisper_base.tflite to whisper-base.tflite - Change from whisper_small.tflite to whisper-small.tflite - Use correct filenames from DocWolle/whisper_tflite_models repo - Add comprehensive download logging and error messages - Fix Toast threading issues (moved to Main thread) - Add progress indicator in download button Fixes 404 errors when downloading models from Hugging Face
CRITICAL FIX - Previous filenames were completely wrong! Correct filenames from DocWolle/whisper_tflite_models: - whisper-tiny.en.tflite (with DOT between tiny and en!) - whisper-base.TOP_WORLD.tflite (not just whisper-base.tflite!) - whisper-small.tflite (this one was correct) Also updated file sizes to match actual models: - Tiny: 42 MB (was 75 MB) - Base TOP_WORLD: 108 MB (was 150 MB) - Small: 388 MB (was 500 MB) This should fix the 404 errors when downloading models.
- Change from global isDownloading to per-model downloadingModel - Show spinner only on the model being downloaded - Show progress bar inline in model card (always visible during download) - Display download percentage next to progress bar - Remove global progress indicator at bottom - Only disable other download buttons while one is active This fixes the UI issues where all buttons showed spinners and progress was only visible when scrolling.
Critical fix for Whisper audio processing: - Fix incorrect buffer size calculation (was dividing by 8 unnecessarily) - Use fold() to calculate total tensor size from shape - Add mel spectrogram size validation and adjustment - Pad or truncate mel spectrogram to match expected tensor size - Add comprehensive logging for tensor shapes and sizes - Handle size mismatches gracefully instead of crashing This fixes the BufferOverflowException at WhisperEngine.kt:150 when trying to process audio for speech recognition. The buffer now correctly allocates expectedInputSize * 4 bytes instead of using the incorrect Float.SIZE_BYTES / 8 formula.
- Implement OpenAI Whisper API integration with cloud-based transcription - Implement OpenRouter Audio API with multiple model support and pricing - Add local ONNX Whisper support (WIP) with DocWolle models - Implement comprehensive usage statistics tracking: * Total cost tracking (USD) * Word count aggregation * Words-per-minute (WPM) calculation * Per-model breakdown with usage metrics - Add system language fallback for automatic locale detection - Add model compatibility warnings for unsupported variants - Integrate ONNX Runtime (v1.17.0) for local inference - Support multiple audio formats (WAV, MP3, ONNX) - Auto-enable speech engines on selection - Remove redundant toggles from sub-screens - Remove Audio Debug menu and related UI - Clean up development documentation files Features: ✅ Three speech recognition engines: Google Stock, OpenAI API, OpenRouter API ✅ Dynamic model listing with real-time pricing from OpenRouter ✅ API key validation with visual feedback ✅ Usage statistics with model-level breakdown ✅ Proper error handling and logging ✅ Graceful fallbacks for missing configurations Tested with: - Google Flash models on OpenRouter - OpenAI Whisper API - System language detection across locales
Collaborator
Author
Collaborator
Author
Collaborator
Author
|
TODO: Proper token calculation for audio; However still dirt cheap and sooo much better than using androids default speech recognition |
- **Onboarding AI Features**: Add AI Features and Voice Input Button toggles to tutorial screen for granular control during setup - **Speech Engine Defaults**: Change default engine to Google Speech Recognition instead of Whisper - **Long-Press Mic Button**: Long-press mic button navigates directly to Speech Recognition Settings - **Error Handling**: Suppress non-critical error toasts (NO_MATCH, SPEECH_TIMEOUT); only show critical errors - **SpeechRecognizer Lifecycle**: Properly destroy and recreate SpeechRecognizer after each recognition session to prevent ERROR_CLIENT - **OpenRouter Audio Filtering**: Filter models to show only those supporting audio input modalities - **OpenRouter Audio Pricing**: - Correctly parse per-token pricing from API and convert to per-million-tokens format - Display audio-specific pricing with proper formatting ($/M audio tokens) - Fallback to prompt pricing if audio pricing unavailable - **Availability Checks**: Show toast when speech recognition unavailable on device - **Voice Input Button Logic**: Display mic button when explicitly enabled, show warning toast when AI Features disabled - **Tests**: Add instrumented tests for OpenRouter audio model filtering, pricing extraction, and label formatting
Owner
|
I'm sorry and I thank you for your work and dedication but I don't think we need this in this phase of the project |
fe7c448 to
35c9b8e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


feat: Complete Whisper speech recognition implementation
Features:
✅ Three speech recognition engines: Google Stock, OpenAI API, OpenRouter API
✅ Dynamic model listing with real-time pricing from OpenRouter
✅ API key validation with visual feedback
✅ Usage statistics with model-level breakdown
✅ Proper error handling and logging
✅ Graceful fallbacks for missing configurations
Tested with: