Skip to content

Comments

Fix key ratchet race condition during DAVE epoch transitions#5

Open
MohmmedAshraf wants to merge 1 commit intodisgoorg:masterfrom
MohmmedAshraf:fix/dave-epoch-key-ratchet-race
Open

Fix key ratchet race condition during DAVE epoch transitions#5
MohmmedAshraf wants to merge 1 commit intodisgoorg:masterfrom
MohmmedAshraf:fix/dave-epoch-key-ratchet-race

Conversation

@MohmmedAshraf
Copy link

Problem

When a user joins or leaves a DAVE-enabled voice channel, Discord triggers a DAVE_PROTOCOL_PREPARE_EPOCH event. The current code flow creates a race condition:

  1. prepareEpoch() calls Session.Init() which destroys the old MLS crypto state
  2. Discord then sends DavePrepareTransitionsetupKeyRatchetForUser(self)
  3. GetKeyRatchet() returns a Go struct wrapping a NULL C pointer (old state destroyed, new handshake not done yet)
  4. SetPassthroughMode(false) is called → encryptor expects encryption but has no keys
  5. Every Encrypt() call fails with ErrMissingKeyRatchet for ~1 second until the new MLS Welcome arrives

This means every member join/leave in a DAVE-enabled channel causes ~1 second of complete audio failure for all bots using golibdave.

Root Cause

Two issues:

  1. prepareEpoch destroys crypto state without a safety net. Session.Init() invalidates all key ratchets, but the encryptor/decryptors are still set to passthrough=false — so they try to encrypt/decrypt with destroyed keys.

  2. setupKeyRatchetForUser unconditionally disables passthrough. Even when GetKeyRatchet() returns nil (NULL C handle wrapped in a non-nil Go struct), the code sets SetPassthroughMode(false) and proceeds. The newKeyRatchet function doesn't check for nil C handles, unlike the existing newWelcomeResult which already does.

Fix

Three targeted changes:

1. prepareEpoch — reset to passthrough before destroying crypto state

func (s *session) prepareEpoch(epoch int, protocolVersion uint16) {
    if epoch != mlsNewGroupExpectedEpoch {
        return
    }

    // NEW: Safe fallback — passthrough while MLS renegotiates
    s.encryptor.SetPassthroughMode(true)
    for _, dec := range s.decryptors {
        dec.TransitionToPassthroughMode(true)
    }

    s.session.Init(protocolVersion, uint64(s.channelID), string(s.selfUserID))
}

2. setupKeyRatchetForUser — guard against nil key ratchets

If encryption is enabled but GetKeyRatchet() returns nil (MLS handshake still in progress), skip the transition and keep passthrough active. The next MLS Welcome/Commit will call prepareTransition again with valid keys.

3. newKeyRatchet — return nil for nil C handles

Matches the existing pattern used by newWelcomeResult. Prevents wrapping NULL C pointers in Go structs that appear non-nil to callers.

Testing

Tested in production with a Discord voice relay bot (listener + 3 speaker bots across multiple guilds):

  • Continuous audio relay with zero packet drops (pkt_dropped=0)
  • Multiple epoch transitions triggered by member joins/leaves — all completed successfully
  • Key ratchets correctly established after each MLS handshake
  • No ErrMissingKeyRatchet errors

Context

DAVE becomes mandatory on March 1, 2026. This race condition affects any bot using golibdave in channels where members join/leave while audio is active — which is essentially all real-world usage.

When a user joins or leaves a voice channel, Discord triggers a
DAVE_PROTOCOL_PREPARE_EPOCH event. The prepareEpoch() function calls
Session.Init() which destroys the old MLS crypto state. Discord then
sends DavePrepareTransition which calls setupKeyRatchetForUser, but
at this point GetKeyRatchet() returns a Go struct wrapping a NULL C
pointer because the new MLS handshake hasn't completed yet.

This causes SetPassthroughMode(false) to be called on the encryptor
which expects encryption keys that don't exist yet, resulting in every
Encrypt() call failing with ErrMissingKeyRatchet until the new MLS
Welcome message arrives (~1 second later).

This fix addresses the race in two places:

1. prepareEpoch: Reset encryptor and all decryptors to passthrough
   mode BEFORE calling Session.Init(), so audio can continue flowing
   unencrypted during the brief MLS renegotiation window rather than
   failing entirely.

2. setupKeyRatchetForUser: Guard against nil key ratchets returned by
   GetKeyRatchet(). If encryption is enabled but no valid key ratchet
   exists yet (because the MLS handshake is still in progress), skip
   the transition and keep passthrough mode active. The next MLS
   Welcome/Commit will call prepareTransition again with valid keys.

3. newKeyRatchet: Return nil when the underlying C handle is nil,
   matching the existing pattern used by newWelcomeResult. This
   prevents wrapping NULL C pointers in Go structs that appear
   non-nil to callers.
@topi314
Copy link
Member

topi314 commented Feb 23, 2026

@davfsa can you review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants