Skip to content

Performant lock-free inter-thread message queue (MPSC/SPSC) + reactor integration #81

@EdmondDantes

Description

@EdmondDantes

Motivation

We need a high-performance, non-blocking inter-thread message queue for worker-to-worker / reactor-to-worker communication. Concrete drivers:

  • Cross-worker HTTP/3 (HTTP/3: cross-worker connection migration (SO_REUSEPORT CID steering) #72): a UDP datagram for a connection owned by worker B can land on worker A's socket (SO_REUSEPORT). We must hand the packet (or a steering command) to B without blocking A's reactor — see the companion ACK-timing issue: a blocking handoff would stall ACKs. The current php-async channel ABI is blocking (send/receive suspend, poll_cb is not a coroutine), so it can't be used on the reactor path as-is. We need either a non-suspending try_send/try_receive, or our own MPSC + trigger.
  • WebSocket (WebSocket support (RFC 6455) #2): per-connection outbound queues, broadcast/fan-out, and offloading heavy handlers (see WebSocket section below).
  • General: offload target for heavy request handlers so the transport reactor stays non-blocking.

Requirements

  1. Lock-free / non-blocking on the hot path — producer (and ideally consumer) must not block; a blocked enqueue on the reactor thread would stall the transport.
  2. Bounded by default — fixed memory, deterministic backpressure; ties into our OOM-firewall philosophy. Unbounded node queues risk OOM under overload.
  3. Event-loop integration — wake the consumer's libuv/io_uring loop via eventfd/uv_async_send, lost-wakeup-safe (see pitfalls below).
  4. Batch drain — consumer drains many items per wakeup to amortize the syscall and coalesce writes (we already batch TLS records).
  5. Cache-friendly — head/tail on separate cache lines, padded slots, no false sharing.
  6. Shape: MPSC is the primary need (one consumer per worker, many producers). SPSC where a single producer suffices. MPMC only if we add a shared work-stealing pool.

Algorithm survey

Algorithm Shape Bounded Alloc on hot path Cost / op Notes
rigtorp / Lamport ring SPSC yes no ~load/store, no CAS Fastest possible; cached head/tail to cut coherency traffic. Use when exactly one producer.
Vyukov intrusive MPSC MPSC no (node-based) node is intrusive (no malloc if embedded) 1 XCHG + 1 release store on push The standard MPSC. Producer is wait-free (single atomic exchange, no retry). Consumer can briefly stall if a producer is preempted between XCHG and the follow-up store (not linearizable, "almost lock-free"). Unbounded ⇒ needs a separate backpressure counter.
Vyukov bounded MPMC (array + per-slot sequence) MPMC (works as MPSC) yes no 1 CAS, no amortization Per-slot sequence number instead of flags; producers/consumers don't touch the same data; cache-line aligned slots; fails cleanly on full (natural backpressure). Best "one size fits most" for bounded server use.
Michael & Scott MPMC no malloc per node 2 CAS + retries The 1996 classic, correctness baseline. Slower under contention; needs ABA protection + hazard-pointer/epoch reclamation. Not recommended as primary.
moodycamel ConcurrentQueue MPMC optional block-based very high throughput Set of per-producer sub-queues (only per-producer FIFO, no global order). Excellent general MPMC; heavier/complex; C++.
LMAX Disruptor MP→MC ring yes no (preallocated) barrier/seq Ring + sequence barriers; ~order-of-magnitude lower latency / ~8× throughput vs lock-based queue in LMAX's tests. Shines for broadcast (every consumer sees every item) and pipelined stages. Heavier framework; best for high-rate fan-out.

Recommendation (starting point)

  • Reactor↔worker / cross-worker H3 handoff: Vyukov bounded MPMC array used as MPSC — bounded (OOM-safe), zero alloc, single CAS, clean full-handling for backpressure. If profiling shows the producer XCHG of the intrusive MPSC wins and unboundedness is acceptable with an explicit length cap, consider that instead.
  • SPSC fast lanes: rigtorp-style ring where one producer is guaranteed.
  • Wakeup: eventfd (or uv_async_send) with the drain-and-rearm protocol below.
  • Avoid Michael&Scott (reclamation pain) and the Disruptor (framework weight) unless a broadcast use case demands the latter.

Wakeup integration & the lost-wakeup pitfall

Pattern (edge-triggered, must be race-free):

producer:  enqueue(item)
           if (queue_was_empty)  signal(eventfd)   // only first item signals
consumer:  on wakeup: drain ALL items
           mark empty; then re-check queue once more before sleeping (double-check)

Pitfalls to get right:

  • Lost wakeup: consumer sees empty → decides to sleep, but a producer enqueued + signaled in between. Fix with the double-check (re-scan after arming) and correct acquire/release ordering on the empty flag.
  • Signal coalescing: eventfd/uv_async coalesce; never assume one signal == one item. Always drain to empty.
  • Closed-fd race: don't write() a wakeup fd that may already be closed during shutdown (cf. Netty/Python eventfd races).

Memory reclamation

Node-based unbounded queues (Vyukov intrusive, M&S) need safe reclamation (hazard pointers / epoch-based / RCU) for the consumer to free nodes producers may still touch. Bounded array queues sidestep this entirely — another reason to prefer them for our hot paths.

Application: WebSocket (#2) — which queue types pay off

WebSocket is long-lived full-duplex over a single ordered stream; the framing layer must serialize writes per connection. Queue choices:

  • Per-connection outbound queue = bounded SPSC (single-threaded worker) or bounded MPSC (multi-thread producers). A single writer drains, frames (RFC 6455), optionally permessage-deflate-compresses, and coalesces into one writev/TLS record. Per-connection FIFO must be preserved; global order across connections is irrelevant → favors per-connection queues / per-producer sub-queues over one shared MPMC.
  • Backpressure is mandatory (slow client is the Add multi-protocol server description to README #1 WS threat — unbounded growth → OOM/crash). Bounded queue with low/high-water marks and an explicit policy per threshold:
    • block/pause the producer (apply backpressure upstream),
    • drop non-priority messages,
    • conflate ("latest value wins" — overwrite stale entries; ideal for state-sync / market-data style feeds; effectively a 1-slot overwriting mailbox),
    • close the slow connection at a critical mark.
  • Control-frame priority lane. ping/pong/close must not sit behind a large data backlog (RFC 6455 allows interjecting control frames between data fragments). Use a small priority queue / separate lane drained first; also lets us split "low-value tick burst" from "high-value message".
  • Broadcast / fan-out (chat, pub/sub): prefer per-subscriber queues over a shared MPMC so one slow subscriber doesn't head-of-line-block the others (backpressure isolation; queued items aren't duplicated into write buffers until processed). For very high-rate "every consumer sees every message" feeds, a Disruptor-style ring is the canonical fit.
  • Inbound heavy handlers: parse frames on the IO thread, hand the message to a worker via MPSC — same non-blocking constraint as the H3 reactor (companion issue).

Open questions / decisions

  • Bounded MPMC-as-MPSC vs intrusive MPSC+length-cap — decide after a micro-benchmark on our workload (producer cost vs consumer-stall vs alloc).
  • Fix the handoff at php-async ABI level (try_send/try_receive non-suspending) or implement our own MPSC+trigger in the server? (HTTP/3: cross-worker connection migration (SO_REUSEPORT CID steering) #72 blocker.)
  • Default queue depths / water-mark policy for WS, and conflation opt-in API.
  • Backpressure surface for WS: setter on HttpServerConfig (consistent with [setter-not-ini]).

References

Companion: ACK-timing / don't-block-the-reactor issue. Related: #72 (cross-worker H3), #2 (WebSocket).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions