Skip to content

hehehai/voxt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

133 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voxt Logo

Voxt

A macOS menu bar voice input and translation app. Hold to speak, release to paste.
AI transcription with different rules for different apps and URLs.

English · 简体中文 · Report Issues · Prompt

image

✨ Feature Overview Speak, don't type

Speak and turn voice into text fn

  • Live transcription while you speak, with real-time text preview.
  • Result enhancement: remove filler words, add punctuation automatically, and customize prompts your own way.
  • App Branch groups let different apps or URLs use different enhancement rules and prompts, for coding, chat, email, and more.
  • Multilingual support with smooth mixed-language input.

Speak and translate right away fn+shift

  • AI translation immediately after transcription.
  • Selected-text translation: highlight text and translate it directly with a shortcut.
  • Custom translation prompts and terminology guidance, so output matches your habits.
  • Separate model selection for translation, so you can pick the strongest or fastest model for the job.

Use voice as a prompt fn+control

  • Example: "Help me write a 200-word self-introduction." Your speech becomes the prompt, and the result is inserted automatically.
  • Rewrite selected text by voice, for example: "Make this shorter and smoother."
  • More than voice input: it also works like a voice-driven AI assistant.

Download / Install

brew tap hehehai/tap
brew install --cask voxt

Model Support

image

Voxt separates ASR provider models and LLM provider models. They are used for speech-to-text, text enhancement, translation, and rewrite flows respectively.

System dictation is also supported through Apple Dictation, though multilingual coverage is more limited.

Local Models

With newer macOS versions and MLX support, Voxt currently ships with 5 built-in local ASR options in code, plus a set of downloadable local LLM models for enhancement, translation, and rewriting.

Note

"Current status / errors" below comes from the current project code. "Language support / speed / recommendation" is summarized from model cards plus project descriptions. Speed and recommendation are for model selection guidance, not a unified benchmark.

Voxt also supports Direct Dictation via Apple SFSpeechRecognizer:

  • Best for: quick setup when you do not want to download local models yet.
  • Limitation: relatively limited multilingual support.
  • Requirements: microphone permission plus speech recognition permission.
  • Common error: Speech Recognition permission is required for Direct Dictation.

Local ASR Models

Model Repository ID Size Language Support Speed Recommendation Current Status
Qwen3-ASR 0.6B (4bit) mlx-community/Qwen3-ASR-0.6B-4bit 0.6B / 4bit 30 languages including Chinese, English, Cantonese, and more Fast High Default local ASR, best overall quality/speed balance
Qwen3-ASR 1.7B (bf16) mlx-community/Qwen3-ASR-1.7B-bf16 1.7B / bf16 Same multilingual family as 0.6B Medium Very high Accuracy-first option with higher memory and storage cost
Voxtral Realtime Mini 4B (fp16) mlx-community/Voxtral-Mini-4B-Realtime-2602-fp16 4B / fp16 13 languages including Chinese, English, Japanese, Korean, and more Medium Medium-high Realtime-oriented model with the largest footprint in this list
Parakeet 0.6B mlx-community/parakeet-tdt-0.6b-v3 0.6B / bf16 Model card lists 25 languages; project copy positions it as lightweight English-first STT Very fast Medium-high Lightweight high-speed option, especially suitable for English-heavy workflows
GLM-ASR Nano (4bit) mlx-community/GLM-ASR-Nano-2512-4bit MLX 4bit, about 1.28 GB Current model card clearly states Chinese and English Fast High Smallest footprint, ideal for quick drafts and low-friction deployment

Common local ASR errors / states:

  • Invalid model identifier
  • Model repository unavailable (..., HTTP 401/404)
  • Download failed (...)
  • Model load failed (...)
  • Size unavailable
  • If you accidentally point to an alignment-only repo, Voxt will show alignment-only and not supported by Voxt transcription

Local LLM Models

Model Repository ID Size Language Bias Speed Recommendation Best For
Qwen2 1.5B Instruct Qwen/Qwen2-1.5B-Instruct 1.5B Balanced Chinese / English Fast High Lightweight cleanup and simple translation
Qwen2.5 3B Instruct Qwen/Qwen2.5-3B-Instruct 3B Balanced Chinese / English Medium-fast High More stable enhancement and formatting
Qwen3 4B (4bit) mlx-community/Qwen3-4B-4bit 4B / 4bit Chinese / English / multilingual Medium-fast Very high Best overall local balance for enhancement and translation
Qwen3 8B (4bit) mlx-community/Qwen3-8B-4bit 8B / 4bit Chinese / English / multilingual Medium-slow Very high Stronger rewriting, translation, and structured output
GLM-4 9B (4bit) mlx-community/GLM-4-9B-0414-4bit 9B / 4bit Chinese / English / multilingual Slow Very high Chinese rewriting and more complex prompt workflows
Llama 3.2 3B Instruct (4bit) mlx-community/Llama-3.2-3B-Instruct-4bit 3B / 4bit English-first, multilingual usable Medium-fast Medium-high Lightweight local rewriting
Llama 3.2 1B Instruct (4bit) mlx-community/Llama-3.2-1B-Instruct-4bit 1B / 4bit English-first, multilingual usable Very fast Medium Lowest-resource local enhancement
Meta Llama 3 8B Instruct (4bit) mlx-community/Meta-Llama-3-8B-Instruct-4bit 8B / 4bit English-first, multilingual usable Medium-slow Medium-high General enhancement, summarization, rewriting
Meta Llama 3.1 8B Instruct (4bit) mlx-community/Meta-Llama-3.1-8B-Instruct-4bit 8B / 4bit English-first, multilingual usable Medium-slow High Stable general-purpose local LLM
Mistral 7B Instruct v0.3 (4bit) mlx-community/Mistral-7B-Instruct-v0.3-4bit 7B / 4bit Stronger in English and European languages Medium High Concise rewrites and formatting cleanup
Mistral Nemo Instruct 2407 (4bit) mlx-community/Mistral-Nemo-Instruct-2407-4bit Nemo family / 4bit English-first, multilingual usable Medium-slow High More complex local enhancement tasks
Gemma 2 2B IT (4bit) mlx-community/gemma-2-2b-it-4bit 2B / 4bit English-first, multilingual usable Fast Medium-high Lightweight text cleanup
Gemma 2 9B IT (4bit) mlx-community/gemma-2-9b-it-4bit 9B / 4bit English-first, multilingual usable Slow High Higher-quality local polishing and translation

Common local LLM errors / states:

  • Custom LLM model is not installed locally.
  • Invalid local model path.
  • Invalid model identifier
  • No downloadable files were found for this model.
  • Downloaded files are incomplete.
  • Download failed: ...
  • Size unavailable

Remote Provider Models

For faster or more realtime transcription and enhancement, configure Remote ASR and Remote LLM separately in Model Settings. The tables below list only the provider entry points and recommended defaults that Voxt currently exposes in code.

Note

For the setup tutorial prompt below, you can give it to any AI assistant and let it help you complete the application and configuration process.

https://raw.githubusercontent.com/hehehai/voxt/refs/heads/main/docs/README.md
https://raw.githubusercontent.com/hehehai/voxt/refs/heads/main/docs/RemoteModel.md
How do I get started configuring remote ASR and LLM? I want to use Doubao ASR and Alibaba Cloud Bailian LLM. Please give me the full application and configuration workflow.

1. For every step that requires visiting a website, include the exact URL.
2. Point out the important notes and required configuration items.
3. Make the key steps more detailed.

For fuller provider notes, signup links, endpoints, and configuration examples, see docs/RemoteModel.md.

Remote ASR Providers

Provider Built-in Model Options Language Support Realtime Support Speed Recommendation Current Integration
OpenAI Whisper / Transcribe whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe Multilingual Partial. Voxt currently uses file-based transcription, with optional chunked pseudo-realtime preview Medium High v1/audio/transcriptions
Doubao ASR volc.bigasr.sauc.duration Chinese-first, well suited to mixed Chinese/English realtime usage Yes Fast High Streaming WebSocket ASR
GLM ASR glm-asr-2512, glm-asr-1 Officially positioned for broad scenarios and accents; Voxt currently integrates it as standard upload-based transcription No (current implementation is upload transcription) Medium Medium-high HTTP transcription endpoint
Aliyun Bailian ASR qwen3-asr-flash-realtime, fun-asr-realtime, paraformer-realtime-* Depends on model family: Qwen3 ASR is multilingual, Fun/Paraformer cover Chinese-English or broader multilingual use Yes Fast High Realtime WebSocket ASR, with separate endpoints for Qwen / Fun / Paraformer families

Common remote ASR errors / states:

  • Needs Setup
  • Missing API key for OpenAI / GLM / Aliyun
  • Missing Access Token or App ID for Doubao
  • Invalid ASR endpoint URL
  • Invalid WebSocket endpoint URL
  • Connection failed (HTTP %d). %@
  • No valid ASR response packet.
  • Doubao may also fail on GZIP init / decode; Aliyun may additionally fail with task-failed or auth-related 403 responses

Remote LLM Providers

Provider Built-in Recommended Model API Style Main Use Current Status
Anthropic claude-sonnet-4-6 Native Anthropic Enhancement / translation / rewrite Integrated
Google gemini-2.5-pro Native Gemini Enhancement / translation / rewrite Integrated
OpenAI gpt-5.2 OpenAI-compatible Enhancement / translation / rewrite Integrated
Ollama qwen2.5 OpenAI-compatible Local or self-hosted LLM gateway Integrated
DeepSeek deepseek-chat OpenAI-compatible Enhancement / translation / rewrite Integrated
OpenRouter openrouter/auto OpenAI-compatible Auto-routing across providers Integrated
xAI (Grok) grok-4 OpenAI-compatible Enhancement / translation / rewrite Integrated
Z.ai glm-5 OpenAI-compatible Enhancement / translation / rewrite Integrated
Volcengine doubao-seed-2-0-pro-260215 OpenAI-compatible Enhancement / translation / rewrite Integrated
Kimi kimi-k2.5 OpenAI-compatible Enhancement / translation / rewrite Integrated
LM Studio llama3.1 OpenAI-compatible Local or self-hosted LLM gateway Integrated
MiniMax MiniMax-M2.5 Native MiniMax Enhancement / translation / rewrite Integrated
Aliyun Bailian qwen-plus-latest OpenAI-compatible Enhancement / translation / rewrite Integrated

Common remote LLM errors / states:

  • Needs Setup
  • Missing provider-specific API key for Anthropic / Google / MiniMax
  • Invalid endpoint URL / Invalid Google endpoint URL
  • Invalid server response.
  • Server reachable, but authentication failed (HTTP 401/403).
  • Connection failed (HTTP %d). %@
  • Runtime failures can also appear as Remote LLM request failed (...) or Remote LLM returned no text content.

Shortcuts

image

Voxt includes two built-in shortcut presets (fn Combo / command Combo) and also supports fully custom bindings. Each shortcut set can use one of two trigger styles:

  • Tap (Press to Toggle): press once to start, press again to stop
  • Long Press (Release to End): hold to start, release to stop

The examples below use the default fn Combo preset.

fn Combo

Shortcut Action Typical Use Default Interaction
fn Standard transcription Voice input and speech-to-text After recording ends, Voxt enhances and outputs the result into the current input target
fn+shift Transcribe and translate Speak-then-translate, multilingual input If text is already selected, Voxt translates the selection directly instead of opening the recording flow
fn+control Transcribe and rewrite / prompt Voice-driven prompt generation, or rewriting selected text by voice If text is selected, Voxt rewrites against the selection; otherwise it treats your speech as an instruction and generates the result

You can think of them as three working modes:

  • fn: turn what you say into text
  • fn+shift: turn what you say into a target language, or directly translate selected text
  • fn+control: treat your speech as a prompt and let the model generate, rewrite, or polish text

Detailed behavior:

  • fn standard transcription
    • Tap mode: press fn to start recording, then press fn again to stop
    • Long-press mode: hold fn to record, release to stop
    • Best for quick input, meeting notes, chat replies, and email drafts
  • fn+shift transcribe + translate
    • Tap mode: press fn+shift to start recording; to stop, either press fn or press fn+shift again
    • Long-press mode: hold fn+shift to record, release to stop
    • If text is already selected when triggered, Voxt translates the selection directly without using the microphone flow
    • Best for mixed-language typing, cross-language chat, and quick paragraph translation
  • fn+control transcribe + rewrite / prompt
    • Tap mode: press fn+control to start recording, then press fn to stop
    • Long-press mode: hold fn+control to record, release to stop
    • Your dictated content is treated as an instruction, for example: "Make this reply more polite" or "Shorten this paragraph"
    • If text is selected, Voxt uses the selection as source material and returns a rewritten final result based on your spoken instruction
    • If nothing is selected, it behaves more like a voice-driven AI assistant input flow

Interaction details:

  • In tap mode, fn is the unified stop key. That means once a translation session has started, pressing fn can also end it.
  • To avoid accidental stops, Voxt ignores immediate repeated taps during the very short window right after recording starts.
  • fn+shift and fn+control have higher priority than plain fn, so combo presses are not misclassified as regular transcription.
  • All shortcuts can be remapped in Settings, and you can switch to the command Combo preset at any time.

App Settings

image

General controls app-level behavior and day-to-day usage preferences. Unlike the Model page, this is not where you choose which ASR or LLM to run. It is where you define how Voxt records, appears on screen, outputs results, starts with macOS, and manages network/configuration behavior.

Current General settings fall into these groups:

Configuration

  • Export current General, Model, App Branch, and shortcut settings to JSON
  • Import settings from JSON to quickly move your setup to another Mac
  • Sensitive fields are replaced with placeholders during export and must be filled in again after import

Useful for:

  • syncing settings across multiple devices
  • backing up your current workflow
  • cloning the same model / shortcut / grouping setup quickly

Audio

  • Choose the microphone input device
  • Turn interaction sounds on or off
  • Switch interaction sound presets and preview them directly

This section controls where audio comes from and whether Voxt gives you audible start/finish feedback. It matters if you use multiple microphones, external audio devices, or a specific input chain.

Transcription UI

  • Set the floating transcription overlay position

The overlay shows waveform, preview text, and processing state during recording. This setting controls where it appears so it does not block your workspace.

Interface Language

  • Change the app interface language
  • Currently supports English, Chinese, and Japanese

If the system language is not supported, Voxt falls back to English.

Translation

  • Set the default target language for the translation shortcut

This mainly affects the dedicated translation action, such as the default fn+shift flow. In practice, it decides which language transcription should be translated into by default.

Model Storage

  • View the current model storage path
  • Open the model folder in Finder
  • Change where new local models are stored

This is especially important for local model users.

Important

After you change the model storage path, previously downloaded models are not migrated automatically, and models in the old path are not detected in the new one. In most cases, you will need to download local models again.

Output

  • Also copy result to clipboard
  • Translate selected text with translation shortcut
  • App Enhancement (Beta)

This section controls how Voxt returns output and whether context-aware enhancement is enabled:

  • When "Also copy result to clipboard" is on, Voxt auto-pastes the result and also keeps it in the clipboard
  • When "Translate selected text with translation shortcut" is on, the translation shortcut directly translates and replaces the current selection if any text is highlighted
  • When App Enhancement is enabled, Voxt shows and activates app- and URL-aware enhancement configuration

Logging

  • Toggle hotkey debug logs
  • Toggle LLM debug logs

Useful when diagnosing:

  • why a shortcut did not trigger
  • why a combo key was misdetected
  • what the local or remote LLM request actually sent
  • why model output did not match expectations

Recommended default: keep logging off, and only enable it temporarily while debugging.

App Behavior

  • Launch at Login: start Voxt automatically at system login
  • Show in Dock: show or hide Voxt in the macOS Dock
  • Automatically check for updates: background update checks
  • Proxy: follow system proxy, disable proxy, or use a custom proxy

This group is about how the app behaves on your Mac:

  • If you want Voxt to stay in the menu bar all the time, enable launch at login
  • If you want faster access from the Dock, enable Dock visibility
  • If you use remote models in a restricted network, company network, or proxy environment, Proxy settings directly affect remote ASR and remote LLM connectivity

Current custom proxy support includes:

  • HTTP
  • HTTPS
  • SOCKS5

Host, port, username, and password can be configured. However, in the current codebase, username and password are stored but not yet injected automatically into every request path, which matters in more complex proxy setups.

Permissions

image

Voxt permissions are split by function. If you only use basic voice input, only the core permissions are needed. If you want stronger context awareness, such as URL-based App Branch matching, enable the extra permissions only when needed.

Important

If you just want to get Voxt working quickly, start with Microphone. If you use the default fn shortcut set and want results to be written back into other apps automatically, it is strongly recommended to enable both Accessibility and Input Monitoring.

Core Permissions

Permission Typical Importance Used For What Happens If Not Granted
Microphone Required Recording, speech-to-text, local ASR, remote ASR, translation, rewrite flows Recording cannot start
Speech Recognition Optional / as needed Only for Direct Dictation / Apple SFSpeechRecognizer Only system dictation becomes unavailable; MLX and remote ASR still work
Accessibility Strongly recommended Global hotkeys, automatically pasting results back into other apps, reading some UI context Recording still works, but auto-paste and some cross-app interactions are limited
Input Monitoring Strongly recommended More reliable global modifier hotkeys, especially fn, fn+shift, and fn+control Global shortcuts may become unstable, fail, or misfire
Automation Optional Reading the current browser tab URL for App Branch URL matching App Branch can still match by foreground app, but not by webpage URL

Additional notes:

  • Microphone permission is a hard requirement for the recording pipeline, regardless of whether you use local models, remote ASR, translation, or rewrite flows.
  • Speech Recognition permission is only for Apple system dictation. If you only use MLX Audio (On-device) or Remote ASR, you can leave it off.
  • Accessibility is not just for "seeing the UI". It is also used to write results back into other apps automatically. Without it, Voxt can still work, but results are more likely to stay in the clipboard for manual paste.
  • Input Monitoring mainly exists to make modifier-only shortcuts more reliable, which is why it is strongly recommended for the default fn shortcut set.

What Is App Branch? (Beta)

image

Important

App Branch is not enabled by default. You must first turn on App Enhancement in General -> Output before App Branch groups and URL-based behavior take effect.

App Branch is best understood as "switch prompts and rules automatically based on the current context."

You can group apps or URLs and assign a separate prompt to each group. In different contexts, Voxt automatically switches enhancement, translation, and rewrite behavior. For example:

  • in an IDE, it can bias toward code, commands, and technical terminology
  • in chat apps, it can bias toward shorter, more conversational replies
  • in email or document tools, it can bias toward formal wording and full sentences
  • on a specific website, it can apply that site's vocabulary, format, or tone

App Branch currently supports two matching layers:

  • match by foreground app: for example Xcode, Cursor, WeChat, or a browser
  • match by active browser tab URL: for example github.com/*, docs.google.com/*, mail.google.com/*

App Branch Permissions

App Branch itself does not always require extra permissions. It depends on how deep you want matching to go:

  • If you only group by foreground app, browser automation permission is usually not needed
  • If you group by browser URL, you must grant Automation permission to the corresponding browser so Voxt can read the active tab URL
  • If scripting-based URL reads fail in some browsers, Voxt can also try Accessibility as a fallback path

In practice:

  • app-level grouping has relatively low permission requirements
  • webpage-level grouping requires additional browser automation approval

App Branch URL Authorization

If you want to use URL rules, this is the most important permission area:

  • Voxt requests browser automation access to read the current active tab URL
  • Without access to the current URL, Voxt cannot determine whether a URL group matches
  • Without this permission, Voxt still works, but falls back to the global prompt or app-only matching

Tip

Only authorize the browsers you actually want to use for URL grouping. The safest workflow is to grant and test them one by one in Settings > Permissions > App Branch URL Authorization.

Built-in or supported browser URL read targets in the current project include:

  • Safari / Safari Technology Preview
  • Google Chrome
  • Microsoft Edge
  • Brave
  • Arc
  • plus any custom browsers you add manually in Settings

Recommendations:

  • only authorize the browsers you really need for URL grouping
  • grant and test them one by one in Settings > Permissions > App Branch URL Authorization
  • if you see Browser URL read test failed: permission denied., it usually means browser automation has not been approved yet

License

Apache 2.0. See LICENSE.

About

🎙️Voice input and translation app for macOS. Press to talk, release to paste.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages