Talkie is a speech recognition application that transcribes audio input and simulates keyboard events to inject text into the active window. It runs continuously in the background with a Tk-based control interface.
The application monitors microphone input, performs voice activity detection, transcribes speech using configurable recognition engines, applies grammar error correction (punctuation, capitalization, homophones), and types the results via the Linux uinput subsystem.
- Real-time audio transcription
- Multiple speech recognition engines (Vosk, Sherpa-ONNX, Faster-Whisper)
- Voice activity detection with configurable threshold
- Grammar error correction (GEC) with Intel NPU acceleration
- Punctuation and capitalization restoration
- Homophone correction (their/there/they're, etc.)
- Keyboard event simulation via uinput
- Text preprocessing (punctuation commands, number conversion)
- External control via file-based IPC
- Persistent JSON configuration
- Single-instance enforcement
- Feedback logging for STT correction learning
src/
├── talkie.tcl # Main application entry point
├── talkie.sh # Startup script (handles OpenVINO paths, CLI)
├── config.tcl # Configuration management
├── engine.tcl # Audio capture + speech processing workers
├── audio.tcl # Result display, transcription state, device enumeration
├── worker.tcl # Reusable worker thread abstraction
├── output.tcl # Keyboard output (worker thread)
├── gec_worker.tcl # GEC pipeline (worker thread)
├── textproc.tcl # Text preprocessing and voice commands
├── coprocess.tcl # External engine communication
├── ui-layout.tcl # Tk interface
├── feedback.tcl # Unified feedback logging for correction learning
├── vosk.tcl # Vosk engine bindings
├── gec/ # Grammar Error Correction
│ ├── gec.tcl # OpenVINO critcl bindings (C code)
│ ├── pipeline.tcl # GEC pipeline orchestration
│ ├── punctcap.tcl # Punctuation and capitalization module
│ ├── homophone.tcl # Homophone correction module
│ ├── grammar.tcl # Grammar correction (T5-based)
│ └── tokens.tcl # BERT vocabulary constants
├── pa/ # PortAudio critcl bindings
├── audio/ # Audio energy calculation critcl bindings
├── vosk/ # Vosk critcl bindings
├── uinput/ # uinput critcl bindings
└── engines/ # External engine wrappers (Sherpa, Faster-Whisper)
Audio processing is fully decoupled from the main thread through a multi-worker architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Main Thread │
│ ┌──────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Tk GUI (5Hz) │ │ Result Display │ │
│ │ - Controls │ │ - final_text(), partial_text()│ │
│ │ - Audio level bar │ │ - Timing info display │ │
│ └──────────────────────┘ └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
▲ ▲
│ thread::send -async │ thread::send -async
│ (UI updates) │ (display notifications)
│ │
┌───────┴───────────────┐ ┌─────────────┴───────────────────────┐
│ Audio Worker │ │ GEC Worker │
│ ┌─────────────────┐ │ │ ┌───────────────────────────────┐ │
│ │ PortAudio │──┼──│─▶│ Homophone Correction (ELECTRA)│ │
│ │ Callbacks (40Hz)│ │ │ │ Punctuation/Caps (DistilBERT) │ │
│ └─────────────────┘ │ │ │ Grammar (T5, optional) │ │
└───────────────────────┘ │ └───────────────┬───────────────┘ │
│ └──────────────────┼───────────────────┘
│ thread::send -async │ thread::send -async
▼ ▼
┌───────────────────────────┐ ┌─────────────────────────────────┐
│ Processing Worker │ │ Output Worker │
│ ┌─────────────────────┐ │ │ ┌───────────────────────────┐ │
│ │ VAD (fixed threshold)│ │ │ │ uinput Keyboard │ │
│ │ Vosk Recognition │──┼──│ │ Simulation │ │
│ │ (or coprocess) │ │ │ └───────────────────────────┘ │
│ └─────────────────────┘ │ └─────────────────────────────────┘
└───────────────────────────┘
Pipeline: Audio → Processing → GEC → Output
└──▶ Main (display)
Data Flow:
- Audio Worker: PortAudio delivers 25ms chunks, queues to Processing (never blocks)
- Processing Worker: VAD threshold detection + speech recognition
- GEC Worker: Grammar correction via OpenVINO (Intel NPU accelerated)
- Output Worker: Keyboard simulation via uinput
- Main Thread: GUI updates throttled to 5Hz
talkie.tcl: Application initialization, single-instance enforcement, module loading
talkie.sh: Startup script that sets up OpenVINO/NPU library paths and provides CLI commands
config.tcl: JSON configuration file management (/.talkie.conf), file watching for external state changes (/.talkie), variable traces for hot-swapping engines/devices
engine.tcl: Creates two worker threads - Audio Worker (captures audio, queues to processing) and Processing Worker (VAD, speech recognition). Includes health monitoring to detect frozen audio streams.
audio.tcl: Display callbacks for results, transcription state management, audio device enumeration
gec_worker.tcl: Dedicated worker thread for grammar error correction pipeline. Receives final results from Processing, sends corrected text to Output.
worker.tcl: Reusable worker thread abstraction using Tcl Thread package. Provides create, send, send_async, exists, destroy operations.
output.tcl: Keyboard simulation via uinput on dedicated worker thread. Async text output to avoid blocking other threads.
gec/: Grammar Error Correction using OpenVINO for neural inference (Intel NPU accelerated):
gec.tcl- OpenVINO critcl bindings (C code)pipeline.tcl- GEC orchestrationpunctcap.tcl- DistilBERT for punctuation/capitalizationhomophone.tcl- ELECTRA for homophone correction
feedback.tcl: Unified feedback logging to ~/.config/talkie/feedback.jsonl. Captures GEC corrections and text injections.
textproc.tcl: Punctuation command processing, number-to-digit conversion
ui-layout.tcl: Tk GUI with transcription controls, real-time displays (5Hz updates), parameter adjustment
- Linux kernel with uinput support
- Tcl/Tk 8.6 or later
- PortAudio
- User must be member of
inputgroup for uinput access
- Intel CPU with NPU (e.g., Core Ultra series) - optional but recommended
- OpenVINO (built from source with NPU support)
- Intel NPU driver (linux-npu-driver)
- Tk - GUI framework
- Thread - Worker thread management
- json - JSON parsing/generation
- jbr::unix - Unix utilities
- jbr::filewatch - File monitoring
- pa - PortAudio bindings (critcl)
- audio - Audio energy calculation (critcl)
- uinput - Keyboard simulation (critcl)
- vosk - Vosk speech engine (critcl)
- gec - OpenVINO inference bindings (critcl)
Download and place in models/ directory:
- Vosk:
models/vosk/vosk-model-en-us-0.22-lgraph - Sherpa-ONNX:
models/sherpa-onnx/(streaming models) - Faster-Whisper:
models/faster-whisper/(CTranslate2 models)
Place in models/gec/:
distilbert-punct-cap.onnx- Punctuation and capitalizationelectra-small-generator.onnx- Homophone correction
cd src
make buildThis compiles the PortAudio, audio processing, uinput, and Vosk critcl packages.
# Load uinput kernel module
sudo modprobe uinput
# Add permanent loading (optional)
echo "uinput" | sudo tee /etc/modules-load.d/uinput.conf
# Add user to input group
sudo usermod -a -G input $USER
# Logout and login for group membership to take effectDownload the appropriate model files for your chosen engine and place them in the models/ directory.
For Vosk:
mkdir -p models/vosk
cd models/vosk
wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22-lgraph.zip
unzip vosk-model-en-us-0.22-lgraph.zipcd src
./talkie.shThe GUI window will appear. Only one instance can run at a time; additional launches will raise the existing window.
The startup script automatically configures OpenVINO library paths for GEC inference and pins to P-cores on Intel hybrid CPUs.
./talkie.sh start # Enable transcription (and mute audio if slim available)
./talkie.sh stop # Disable transcription (and unmute audio)
./talkie.sh toggle # Toggle transcription state
./talkie.sh state # Display current state as JSON
./talkie.sh --help # Show helpTranscription state can be controlled by modifying ~/.talkie:
echo '{"transcribing": true}' > ~/.talkie # Start transcription
echo '{"transcribing": false}' > ~/.talkie # Stop transcriptionThe application monitors this file and updates state within 500ms.
During transcription, speak these commands to insert punctuation:
- "period" → .
- "comma" → ,
- "question mark" → ?
- "exclamation mark" → !
- "colon" → :
- "semicolon" → ;
- "new line" → \n
- "new paragraph" → \n\n
Spoken numbers are converted to digits: "twenty five" → "25"
Configuration file: ~/.talkie.conf (JSON format)
{
"speech_engine": "vosk",
"input_device": "default",
"audio_threshold": 25.0,
"silence_seconds": 0.3,
"min_duration": 0.30,
"lookback_seconds": 0.5,
"spike_suppression_seconds": 0.3,
"confidence_threshold": 100,
"vosk_modelfile": "vosk-model-en-us-0.22-lgraph",
"vosk_beam": 10,
"vosk_lattice": 5,
"gec_homophone": 1,
"gec_punctcap": 1,
"gec_grammar": 0,
"typing_delay_ms": 5
}speech_engine: Recognition engine ("vosk", "sherpa", or "faster-whisper")
input_device: Audio input device name ("default" or specific device)
audio_threshold: Voice activity detection threshold (0-100). Audio above this level triggers speech detection.
silence_seconds: Silence duration before finalizing utterance (seconds)
min_duration: Minimum speech duration to accept (seconds). Shorter segments are discarded.
lookback_seconds: Pre-speech audio buffer duration (seconds)
spike_suppression_seconds: Cooldown period after speech ends before accepting new segments (prevents noise spikes)
confidence_threshold: Minimum recognition confidence for output (0-400)
vosk_beam: Beam search width for Vosk (higher = more accurate, slower)
vosk_lattice: Lattice beam width for Vosk
gec_homophone: Enable homophone correction (0/1)
gec_punctcap: Enable punctuation and capitalization (0/1)
gec_grammar: Enable T5-based grammar correction (0/1, experimental)
typing_delay_ms: Delay between keystrokes when simulating typing
Sample rate and buffer size are automatically detected from the audio device (~16kHz, 25ms chunks).
All parameters can be adjusted via the GUI or by editing the configuration file directly.
Talkie logs events to ~/.config/talkie/feedback.jsonl in JSON Lines format for analyzing STT accuracy.
| Type | Description | Fields |
|---|---|---|
gec |
GEC correction applied | input, output |
inject |
Text sent to uinput | text |
{"ts":1705500000000,"type":"gec","input":"their going","output":"they're going"}
{"ts":1705500000050,"type":"inject","text":"they're going"}View GEC corrections:
jq 'select(.type == "gec")' ~/.config/talkie/feedback.jsonl- Sample Rate: 16kHz (detected from device)
- Chunk Size: 25ms (~400 frames at 16kHz)
- Callback Rate: 40Hz on audio worker thread
- VAD: Fixed threshold with spike suppression
- Lookback: Configurable pre-speech audio buffering (default 0.5s)
- Homophone correction: 20-50ms per phrase
- Punctuation/capitalization: 8-15ms per phrase
- Total GEC: 30-65ms per phrase
- Decoupled Audio: Audio capture never blocks on recognition
- Pipeline Architecture: Audio → Processing → GEC → Output
- UI Responsiveness: GUI updates throttled to 5Hz
- Health Monitoring: Automatic restart of frozen audio streams
cd src
make build # Build all critcl packagesIndividual packages:
cd src/pa && make # PortAudio bindings
cd src/audio && make # Audio energy calculation
cd src/uinput && make # Keyboard simulation
cd src/vosk && make # Vosk speech recognition
cd src/gec && make # OpenVINO GEC inference- Add entry to
engine_registryinsrc/engine.tcl - For coprocess engines: create wrapper script in
src/engines/ - For critcl engines: create package directory with critcl code and Tcl interface
Run the application with console output visible:
cd src
./talkie.sh 2>&1 | tee talkie.logDebug output shows VAD state, segment timing, and GEC processing times.
ERROR: Cannot write to /dev/uinput
Verify user is in input group and has logged out/in:
groups | grep inputVoid Linux: The /dev/uinput device needs group permissions set:
# Quick fix (temporary)
make fix-uinput
# Permanent fix: install runit service
make install-uinput-serviceERROR: /dev/uinput device not found
Load the uinput kernel module:
sudo modprobe uinputList available audio devices and update configuration:
pactl list sources short # For PulseAudio systemsVerify model path in configuration matches actual model location in models/ directory.
MIT
