Skip to content

Feature/whisper speech recognition#103

Open
pzauner wants to merge 14 commits intopalsoftware:mainfrom
pzauner:feature/whisper-speech-recognition
Open

Feature/whisper speech recognition#103
pzauner wants to merge 14 commits intopalsoftware:mainfrom
pzauner:feature/whisper-speech-recognition

Conversation

@pzauner
Copy link
Collaborator

@pzauner pzauner commented Jan 4, 2026

feat: Complete Whisper speech recognition implementation

  • Implement OpenAI Whisper API integration with cloud-based transcription
  • Implement OpenRouter Audio API with multiple model support and pricing
  • Add local ONNX Whisper support (WIP) with DocWolle models (not yet working correctly)
  • Implement comprehensive usage statistics tracking:
    • Total cost tracking (USD)
    • Word count aggregation
    • Words-per-minute (WPM) calculation
    • Per-model breakdown with usage metrics
  • Add system language fallback for automatic locale detection
  • Add model compatibility warnings for unsupported variants
  • Integrate ONNX Runtime (v1.17.0) for local inference
  • Auto-enable speech engines on selection
  • Remove redundant toggles from sub-screens
  • Remove Audio Debug menu and related UI
  • Clean up development documentation files

Features:
✅ Three speech recognition engines: Google Stock, OpenAI API, OpenRouter API
✅ Dynamic model listing with real-time pricing from OpenRouter
✅ API key validation with visual feedback
✅ Usage statistics with model-level breakdown
✅ Proper error handling and logging
✅ Graceful fallbacks for missing configurations

Tested with:

  • Google Flash models on OpenRouter
  • OpenAI Whisper API

pzauner added 13 commits January 4, 2026 02:22
This commit adds the foundational components for Whisper-based offline speech recognition:

Core Components:
- WhisperModel: Enum defining available models (Tiny, Base, Small)
- WhisperRecordBuffer: Audio buffer management for PCM samples
- WhisperRecorder: Audio recording with Voice Activity Detection (VAD)
- WhisperEngine: TensorFlow Lite inference engine for Whisper
- WhisperUtil: Mel spectrogram calculation and token decoding
- WhisperRecognitionManager: High-level API similar to SpeechRecognitionManager

Dependencies:
- TensorFlow Lite 2.15.0 (inference engine)
- TensorFlow Lite Support 0.4.4 (tensor utilities)
- Android VAD WebRTC 2.0.9 (voice activity detection)

Settings Integration:
- Added WhisperSettings to SettingsManager
- getUseWhisper/setUseWhisper for toggle
- getWhisperModel/setWhisperModel for model selection

Localization:
- Complete EN and DE translations for Whisper UI strings

Technical Details:
- Uses TFLite models from Hugging Face (DocWolle/whisper_tflite_models)
- Supports multilingual (99 languages) and English-only models
- Automatic silence detection with VAD for seamless UX
- 16kHz audio sampling as required by Whisper
- Max 30s recording duration

Next Steps:
- Model download UI
- Integration into PhysicalKeyboardInputMethodService
- Settings screen for model management

Related to #whisper-integration
- Created WhisperModelDownloader for Hugging Face model downloads
- Added Speech Recognition settings category (EN/DE)
- Progress tracking for downloads
- Atomic file writes with temp files
- Shared vocab file management

Next: Complete WhisperSettingsScreen UI and integrate into SettingsScreen
- Fixed VAD initialization (VadWebRTC instead of Vad)
- Removed incorrect listener API (use isSpeech() directly)
- Changed audio buffer from FloatArray to ByteArray (PCM16)
- Aligned with Whisper+ implementation
- Proper VAD frame size handling (480 samples = 30ms)
- Fix WhisperRecorder.kt to use correct Vad.builder() API
- Fix WhisperRecordBuffer.kt to use ByteArray instead of FloatArray
- Fix SettingsScreen.kt indentation and brace issues
- Add Speech Recognition navigation to main settings
- All menu items now properly visible in settings
- Implement WhisperSettingsScreen with model selection
- Add model download progress tracking
- Implement delete model functionality
- Add toggle between Google and Whisper recognition
- All strings and translations already in place
- UI shows download status, model size, and description
- Add WhisperRecognitionManager variable to PhysicalKeyboardInputMethodService
- Implement toggle logic between Google and Whisper based on settings
- Update startSpeechRecognition to check useWhisper setting
- Update stopSpeechRecognition to stop both managers
- Add error toast for Whisper recognition failures
- Maintain consistent UI callbacks for both recognition modes
…wnload

- Add detailed Log.d statements for download progress
- Add Toast notifications for success/error
- Catch and display exceptions in download coroutine
- Log Result status from downloadModel
- Improve user feedback for download issues
- Change from whisper_tiny_en.tflite to whisper-tiny-en.tflite
- Change from whisper_base.tflite to whisper-base.tflite
- Change from whisper_small.tflite to whisper-small.tflite
- Use correct filenames from DocWolle/whisper_tflite_models repo
- Add comprehensive download logging and error messages
- Fix Toast threading issues (moved to Main thread)
- Add progress indicator in download button

Fixes 404 errors when downloading models from Hugging Face
CRITICAL FIX - Previous filenames were completely wrong!

Correct filenames from DocWolle/whisper_tflite_models:
- whisper-tiny.en.tflite (with DOT between tiny and en!)
- whisper-base.TOP_WORLD.tflite (not just whisper-base.tflite!)
- whisper-small.tflite (this one was correct)

Also updated file sizes to match actual models:
- Tiny: 42 MB (was 75 MB)
- Base TOP_WORLD: 108 MB (was 150 MB)
- Small: 388 MB (was 500 MB)

This should fix the 404 errors when downloading models.
- Change from global isDownloading to per-model downloadingModel
- Show spinner only on the model being downloaded
- Show progress bar inline in model card (always visible during download)
- Display download percentage next to progress bar
- Remove global progress indicator at bottom
- Only disable other download buttons while one is active

This fixes the UI issues where all buttons showed spinners
and progress was only visible when scrolling.
Critical fix for Whisper audio processing:

- Fix incorrect buffer size calculation (was dividing by 8 unnecessarily)
- Use fold() to calculate total tensor size from shape
- Add mel spectrogram size validation and adjustment
- Pad or truncate mel spectrogram to match expected tensor size
- Add comprehensive logging for tensor shapes and sizes
- Handle size mismatches gracefully instead of crashing

This fixes the BufferOverflowException at WhisperEngine.kt:150
when trying to process audio for speech recognition.

The buffer now correctly allocates expectedInputSize * 4 bytes
instead of using the incorrect Float.SIZE_BYTES / 8 formula.
- Implement OpenAI Whisper API integration with cloud-based transcription
- Implement OpenRouter Audio API with multiple model support and pricing
- Add local ONNX Whisper support (WIP) with DocWolle models
- Implement comprehensive usage statistics tracking:
  * Total cost tracking (USD)
  * Word count aggregation
  * Words-per-minute (WPM) calculation
  * Per-model breakdown with usage metrics
- Add system language fallback for automatic locale detection
- Add model compatibility warnings for unsupported variants
- Integrate ONNX Runtime (v1.17.0) for local inference
- Support multiple audio formats (WAV, MP3, ONNX)
- Auto-enable speech engines on selection
- Remove redundant toggles from sub-screens
- Remove Audio Debug menu and related UI
- Clean up development documentation files

Features:
✅ Three speech recognition engines: Google Stock, OpenAI API, OpenRouter API
✅ Dynamic model listing with real-time pricing from OpenRouter
✅ API key validation with visual feedback
✅ Usage statistics with model-level breakdown
✅ Proper error handling and logging
✅ Graceful fallbacks for missing configurations

Tested with:
- Google Flash models on OpenRouter
- OpenAI Whisper API
- System language detection across locales
@pzauner
Copy link
Collaborator Author

pzauner commented Jan 4, 2026

#98

@pzauner
Copy link
Collaborator Author

pzauner commented Jan 5, 2026

Screenshot_20260105-004432 Screenshot_20260105-004526

Language selection is only used for fallback, the models are multilingual by default

@pzauner
Copy link
Collaborator Author

pzauner commented Jan 5, 2026

TODO: Proper token calculation for audio;
currently using normal input token cost, however audio input is calculated a bit differently:
e.g. Flash 2.5 $0.30/M input tokens normal vs. $1/M audio tokens

However still dirt cheap and sooo much better than using androids default speech recognition

- **Onboarding AI Features**: Add AI Features and Voice Input Button toggles to tutorial screen for granular control during setup
- **Speech Engine Defaults**: Change default engine to Google Speech Recognition instead of Whisper
- **Long-Press Mic Button**: Long-press mic button navigates directly to Speech Recognition Settings
- **Error Handling**: Suppress non-critical error toasts (NO_MATCH, SPEECH_TIMEOUT); only show critical errors
- **SpeechRecognizer Lifecycle**: Properly destroy and recreate SpeechRecognizer after each recognition session to prevent ERROR_CLIENT
- **OpenRouter Audio Filtering**: Filter models to show only those supporting audio input modalities
- **OpenRouter Audio Pricing**:
  - Correctly parse per-token pricing from API and convert to per-million-tokens format
  - Display audio-specific pricing with proper formatting ($/M audio tokens)
  - Fallback to prompt pricing if audio pricing unavailable
- **Availability Checks**: Show toast when speech recognition unavailable on device
- **Voice Input Button Logic**: Display mic button when explicitly enabled, show warning toast when AI Features disabled
- **Tests**: Add instrumented tests for OpenRouter audio model filtering, pricing extraction, and label formatting
@palsoftware
Copy link
Owner

I'm sorry and I thank you for your work and dedication but I don't think we need this in this phase of the project

@pzauner pzauner force-pushed the main branch 10 times, most recently from fe7c448 to 35c9b8e Compare March 6, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants