Skip to content

secforge/WhisperTray

Repository files navigation

WhisperTray

A Windows system-tray application for real-time speech-to-text dictation, powered by OpenAI's Whisper running entirely on the GPU via DirectCompute.

Press a hotkey, speak, and the transcribed text is typed into whatever window has focus — as if you had typed it with the keyboard.

Acknowledgments

The GPU inference engine is based on Const-me/Whisper by Konstantin, a Windows-native port of whisper.cpp that replaces CUDA/GGML with DirectCompute (D3D 11.0). All credit for the core inference engine, the DirectCompute shader pipeline, and the COM-lite interface layer goes to Konstantin's excellent work.

WhisperTray, the build scripts, and various cleanups are original additions.

Features

  • Global hotkey (Ctrl+Shift+Space by default) to start/stop capture
  • System tray icon with color-coded status (gray = idle, gold = loading, green = listening)
  • Automatic voice activity detection — recognizes speech segments and pauses
  • Hallucination filtering — strips phantom tokens like [BLANK_AUDIO] and (music)
  • Direct text injection via SendInput — works in any application, full Unicode support
  • GPU-accelerated inference via DirectCompute — no CUDA required, works on any D3D 11.0 GPU (NVIDIA, AMD, Intel)
  • Configurable microphone, language, model, hotkey, and timing via INI settings
  • No .NET runtime required — pure C++ with statically linked CRT

Requirements

  • 64-bit Windows 8.1 or later
  • DirectX 11.0 capable GPU
  • CPU with AVX1 and F16C support
  • A Whisper GGML model file (see Models)

Building

Prerequisites: Visual Studio 2026 Build Tools (or the full IDE) with the following components:

  • Desktop development with C++ workload
  • MSVC v143+ C++ compiler toolset
  • Windows 10/11 SDK

The full Visual Studio IDE is not required — the free Build Tools package is sufficient. No .NET SDK needed.

Run from the repository root:

build.bat

The build has four steps:

  1. ComputeShaders — compiles 46 HLSL compute shaders to DXBC bytecode
  2. CompressShaders — LZ4-compresses the shader binaries into a blob for static linking
  3. Whisper.lib — builds the DirectCompute inference engine as a static library
  4. WhisperTray.exe — builds the tray application (statically links the inference engine)

Alternatively, open WhisperTray.sln in Visual Studio and build Release|x64.

Building from WSL

Building directly from a WSL mount (e.g. /mnt/c/...) can cause issues with MSBuild due to path handling and file locking across the Windows/Linux filesystem boundary. To work around this, use wsl-build.bat which copies the source tree to a native Windows directory (%SystemDrive%\Whisper-Compiling), runs the build there, and copies the results back to dist/.

Usage

  1. Download a GGML model file (see below).
  2. Run dist\WhisperTray.exe. On first launch, it will prompt you to select a model file.
  3. Press Ctrl+Shift+Space (or click the tray icon) to start listening. The tray icon turns green.
  4. Speak. Transcribed text is typed into the focused window.
  5. Press the hotkey again (or click the tray icon) to stop.

Right-click the tray icon to:

  • Select a different microphone
  • Change the model file

Settings

Settings are stored in %APPDATA%\WhisperTray\settings.ini:

Setting Default Description
ModelPath (prompts on first run) Path to the GGML .bin model file
LanguageCode en Whisper language code
HotkeyModifiers Ctrl+Shift Hotkey modifier keys
HotkeyVirtualKey Space Hotkey key
MinDuration 3.0 s Minimum audio chunk duration
MaxDuration 11.0 s Maximum audio chunk duration
PauseDuration 0.6 s Silence gap that ends a speech segment

Models

Download GGML model files from Hugging Face. Supported sizes:

Model Parameters Disk Size Notes
tiny 39M ~75 MB Fastest, lowest accuracy
base 74M ~142 MB Good balance for quick dictation
small 244M ~466 MB Better accuracy
medium 769M ~1.5 GB High accuracy
large 1550M ~2.9 GB Best accuracy, requires significant VRAM

For real-time dictation, base or small models are recommended.

License

This project is licensed under the Mozilla Public License 2.0.

Based on:

  • Const-me/Whisper by Konstantin — DirectCompute inference engine (MPL 2.0)
  • whisper.cpp by Georgi Gerganov — original C/C++ Whisper port (MIT)
  • LZ4 by Yann Collet — compression library (BSD 2-Clause)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors