|
| 1 | +# Voice Activity Detector |
| 2 | + |
| 3 | +The Voice Activity Detector (VAD) helps you determine when speech is present in an audio stream. It analyzes short chunks of PCM audio and returns the probability that the chunk contains voice. |
| 4 | + |
| 5 | +This can be used to: |
| 6 | +- Drive push-to-talk or auto-mute logic |
| 7 | +- Skip encoding/sending silence to save bandwidth |
| 8 | +- Trigger UI indicators when the user is speaking |
| 9 | + |
| 10 | +API: `dev.onvoid.webrtc.media.audio.VoiceActivityDetector` |
| 11 | + |
| 12 | +## Overview |
| 13 | + |
| 14 | +`VoiceActivityDetector` exposes a minimal API: |
| 15 | +- `process(byte[] audio, int samplesPerChannel, int sampleRate)`: Analyze one audio frame. |
| 16 | +- `getLastVoiceProbability()`: Retrieve the probability (0.0..1.0) that the last processed frame contained voice. |
| 17 | +- `dispose()`: Release native resources. Always call this when done. |
| 18 | + |
| 19 | +Internally, VAD uses a native implementation optimized for real-time analysis. The class itself does not perform resampling or channel mixing, so provide audio matching the given `sampleRate` and expected format. |
| 20 | + |
| 21 | +## Audio format expectations |
| 22 | + |
| 23 | +- PCM signed 16-bit little-endian (typical Java byte[] from microphone capture via this library) |
| 24 | +- Mono is recommended. If you have stereo, downmix to mono before calling `process` or pass samples-per-channel accordingly |
| 25 | +- Frame size: commonly 10 ms per call (e.g., 160 samples at 16 kHz for 10 ms) |
| 26 | +- Supported sample rates: 8 kHz, 16 kHz, 32 kHz, 48 kHz (use one of these for best results) |
| 27 | + |
| 28 | +## Basic usage |
| 29 | + |
| 30 | +```java |
| 31 | +import dev.onvoid.webrtc.media.audio.VoiceActivityDetector; |
| 32 | + |
| 33 | +// Create the detector |
| 34 | +VoiceActivityDetector vad = new VoiceActivityDetector(); |
| 35 | + |
| 36 | +try { |
| 37 | + // Example parameters |
| 38 | + int sampleRate = 16000; // 16 kHz |
| 39 | + int frameMs = 10; // 10 ms frames |
| 40 | + int samplesPerChannel = sampleRate * frameMs / 1000; // 160 samples |
| 41 | + |
| 42 | + // audioFrame must contain 16-bit PCM data for one frame (mono) |
| 43 | + byte[] audioFrame = new byte[samplesPerChannel * 2]; // 2 bytes per sample |
| 44 | + |
| 45 | + // Fill audioFrame from your audio source here |
| 46 | + // ... |
| 47 | + |
| 48 | + // Analyze frame |
| 49 | + vad.process(audioFrame, samplesPerChannel, sampleRate); |
| 50 | + |
| 51 | + // Query probability of voice in the last frame |
| 52 | + float prob = vad.getLastVoiceProbability(); // 0.0 .. 1.0 |
| 53 | + |
| 54 | + boolean isSpeaking = prob >= 0.5f; // choose a threshold that works for your app |
| 55 | + |
| 56 | +} |
| 57 | +finally { |
| 58 | + // Always release resources |
| 59 | + vad.dispose(); |
| 60 | +} |
| 61 | +``` |
| 62 | + |
| 63 | +## Continuous processing loop |
| 64 | + |
| 65 | +```java |
| 66 | +VoiceActivityDetector vad = new VoiceActivityDetector(); |
| 67 | + |
| 68 | +try { |
| 69 | + int sampleRate = 16000; |
| 70 | + int frameMs = 10; |
| 71 | + int samplesPerChannel = sampleRate * frameMs / 1000; // 160 samples |
| 72 | + byte[] audioFrame = new byte[samplesPerChannel * 2]; |
| 73 | + |
| 74 | + while (running) { |
| 75 | + // Read PCM frame from your capture pipeline into audioFrame |
| 76 | + // ... |
| 77 | + |
| 78 | + vad.process(audioFrame, samplesPerChannel, sampleRate); |
| 79 | + float prob = vad.getLastVoiceProbability(); |
| 80 | + |
| 81 | + if (prob > 0.8f) { |
| 82 | + // High confidence of speech |
| 83 | + // e.g., enable VU meter, unmute, or mark active speaker |
| 84 | + } |
| 85 | + else { |
| 86 | + // Likely silence or noise |
| 87 | + } |
| 88 | + } |
| 89 | +} |
| 90 | +finally { |
| 91 | + vad.dispose(); |
| 92 | +} |
| 93 | +``` |
| 94 | + |
| 95 | +## Tips and best practices |
| 96 | + |
| 97 | +- Threshold selection: Start with 0.5–0.8 and tune for your environment. |
| 98 | +- Frame size consistency: Use a consistent frame duration and sample rate. |
| 99 | +- Resource management: VAD holds native resources; ensure `dispose()` is called. |
| 100 | +- Preprocessing: Consider using `AudioProcessing` (noise suppression, gain control) before VAD for improved robustness in noisy environments. See the Audio Processing guide. |
| 101 | + |
| 102 | +## Related guides |
| 103 | + |
| 104 | +- [Audio Processing](guide/audio_processing.md) |
0 commit comments