Real-time AI voice conversations in Discord — powered by OpenAI Realtime API
A Discord voice bot that joins your voice channel and has real-time spoken conversations with you. No text-to-speech pipeline. No transcription middleware. Just raw voice in, voice out — speech-to-speech AI with sub-second latency.
~300 lines of code. No frameworks, no magic.
Built by Digital Forge Studios. Free and open source.
| Feature | Description |
|---|---|
| 🎤 Bidirectional Voice | Speak naturally, hear AI responses in real-time |
| 🧠 Semantic VAD | AI-powered turn detection — knows when you're done talking vs. just pausing |
| 🗣️ Barge-In | Interrupt the bot mid-sentence. It stops and listens. |
| 🔒 DAVE E2EE | Discord's mandatory end-to-end voice encryption, handled transparently |
| 🛠️ Agentic Tools | Web search, weather, file reading, shell commands, Discord messaging |
| ⚙️ Fully Configurable | Voice, personality, VAD mode, eagerness, temperature — all via env vars |
| 👥 Per-User Audio | Discord sends separate streams per speaker — no diarization needed |
| 🐳 Docker Ready | Dockerfile included for containerized deployment |
You speak → Discord Opus → decode → downsample 48kHz stereo → 24kHz mono
→ base64 PCM16 → OpenAI Realtime API (WebSocket)
AI responds → base64 PCM16 24kHz mono → upsample → 48kHz stereo
→ PlaybackStream → AudioPlayer → Discord voice channel
- Discord connection —
discord.js+@discordjs/voicehandles gateway, voice connection, and DAVE E2EE (via@snazzah/davey+sodium-native) - Audio receive — subscribes to each user's Opus stream individually (Discord sends per-user streams, not a mix)
- Downsampling — Discord sends 48kHz stereo Opus → decode to PCM → downsample to 24kHz mono (what OpenAI expects)
- OpenAI Realtime API — persistent WebSocket connection, streams audio bidirectionally, handles VAD/turn detection server-side
- Upsampling — OpenAI sends 24kHz mono PCM16 → upsample to 48kHz stereo → push to Readable stream → Discord plays it
- Tool calling — model invokes functions mid-conversation, we execute and feed results back, model speaks the answer
- Node.js >= 18
- A Discord bot with voice permissions
- OpenAI Realtime API access (via Azure AI Foundry or OpenAI directly)
- Go to Discord Developer Portal
- Create a new application → Bot → copy the token
- Enable Privileged Gateway Intents: Server Members, Message Content
- Invite to your server with permissions
36700160(Connect + Speak + Use Voice Activity):
https://discord.com/oauth2/authorize?client_id=YOUR_APP_ID&scope=bot&permissions=36700160
| Provider | Model | Notes |
|---|---|---|
| Azure AI Foundry (recommended) | gpt-realtime-mini / gpt-realtime-1.5 |
Deploy in Azure AI Studio |
| OpenAI | gpt-realtime |
Direct Realtime API endpoint |
git clone https://github.com/digitalforgeca/vox-discord.git
cd vox-discord
npm install
cp .env.example .env
# Edit .env with your credentials
npm startThe bot joins the configured voice channel automatically. Start talking.
npm install @digitalforgestudios/vox-discordAll configuration via environment variables (.env file):
| Variable | Description |
|---|---|
DISCORD_TOKEN |
Discord bot token |
DISCORD_GUILD_ID |
Server ID |
DISCORD_CHANNEL_ID |
Voice channel ID |
OPENAI_REALTIME_ENDPOINT |
WebSocket endpoint URL |
OPENAI_REALTIME_API_KEY |
API key |
| Variable | Default | Description |
|---|---|---|
OPENAI_REALTIME_MODEL |
gpt-realtime-mini |
Model deployment name |
VOICE_SYSTEM_PROMPT |
Generic assistant | Personality / character instructions |
VOX_VOICE |
alloy |
Voice: alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar |
VOX_TEMPERATURE |
0.8 |
Response creativity (0.0–1.2) |
| Variable | Default | Description |
|---|---|---|
VOX_VAD_TYPE |
semantic_vad |
semantic_vad (recommended), server_vad, or off |
VOX_EAGERNESS |
medium |
Semantic VAD: low (patient), medium (balanced), high (snappy) |
VOX_THRESHOLD |
0.6 |
Server VAD: sensitivity 0.0–1.0 |
VOX_SILENCE_DURATION |
500 |
Server VAD: silence ms before turn ends |
Tip: Use
semantic_vad— it uses the model itself to understand when you're done speaking, not just silence detection. It's the difference between a bot that interrupts your pauses and one that actually listens.
The bot can call tools mid-conversation:
| Tool | Description |
|---|---|
🔍 web_search |
Search the web for current information |
🕐 get_time |
Current date and time |
🌤️ get_weather |
Weather for any location |
📄 read_file |
Read project files |
💻 run_command |
Execute shell commands (sandboxed) |
📨 send_discord_message |
Post to Discord channels |
Tools are defined in tools.js — add your own by following the pattern.
| Model | Cost/min | 10-min chat |
|---|---|---|
gpt-realtime-mini |
~$0.03–0.10 | ~$0.30–$1.00 |
gpt-realtime-1.5 |
~$0.10–0.30 | ~$1.00–$3.00 |
Tips to reduce cost:
- Use
semantic_vad(smarter turn detection = fewer false triggers) - Increase
VOX_THRESHOLDin noisy environments - Use
gpt-realtime-minifor casual conversation - Keep system prompts concise (charged as input every turn)
A local CLI tool for generating configs interactively:
node control.jsLets you tweak VAD mode, eagerness, voice, temperature, and system prompt — then outputs the env vars to paste into .env.
docker build -t vox-discord .
docker run --env-file .env vox-discordvox-discord/
├── index.js # Main bot — Discord voice + OpenAI Realtime bridge (~300 lines)
├── tools.js # Agentic tool definitions
├── control.js # Local configuration CLI
├── .env.example # Environment variable template
├── Dockerfile # Container build
└── package.json # Dependencies
PRs welcome. Keep it lean — the beauty is in the simplicity.
MIT — do whatever you want with it.
Built with 🪽 by Digital Forge Studios