You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need a high-performance, non-blocking inter-thread message queue for worker-to-worker / reactor-to-worker communication. Concrete drivers:
Cross-worker HTTP/3 (HTTP/3: cross-worker connection migration (SO_REUSEPORT CID steering) #72): a UDP datagram for a connection owned by worker B can land on worker A's socket (SO_REUSEPORT). We must hand the packet (or a steering command) to B without blocking A's reactor — see the companion ACK-timing issue: a blocking handoff would stall ACKs. The current php-async channel ABI is blocking (send/receive suspend, poll_cb is not a coroutine), so it can't be used on the reactor path as-is. We need either a non-suspending try_send/try_receive, or our own MPSC + trigger.
WebSocket (WebSocket support (RFC 6455) #2): per-connection outbound queues, broadcast/fan-out, and offloading heavy handlers (see WebSocket section below).
General: offload target for heavy request handlers so the transport reactor stays non-blocking.
Requirements
Lock-free / non-blocking on the hot path — producer (and ideally consumer) must not block; a blocked enqueue on the reactor thread would stall the transport.
Bounded by default — fixed memory, deterministic backpressure; ties into our OOM-firewall philosophy. Unbounded node queues risk OOM under overload.
Event-loop integration — wake the consumer's libuv/io_uring loop via eventfd/uv_async_send, lost-wakeup-safe (see pitfalls below).
Batch drain — consumer drains many items per wakeup to amortize the syscall and coalesce writes (we already batch TLS records).
Cache-friendly — head/tail on separate cache lines, padded slots, no false sharing.
Shape: MPSC is the primary need (one consumer per worker, many producers). SPSC where a single producer suffices. MPMC only if we add a shared work-stealing pool.
Algorithm survey
Algorithm
Shape
Bounded
Alloc on hot path
Cost / op
Notes
rigtorp / Lamport ring
SPSC
yes
no
~load/store, no CAS
Fastest possible; cached head/tail to cut coherency traffic. Use when exactly one producer.
Vyukov intrusive MPSC
MPSC
no (node-based)
node is intrusive (no malloc if embedded)
1 XCHG + 1 release store on push
The standard MPSC. Producer is wait-free (single atomic exchange, no retry). Consumer can briefly stall if a producer is preempted between XCHG and the follow-up store (not linearizable, "almost lock-free"). Unbounded ⇒ needs a separate backpressure counter.
Vyukov bounded MPMC (array + per-slot sequence)
MPMC (works as MPSC)
yes
no
1 CAS, no amortization
Per-slot sequence number instead of flags; producers/consumers don't touch the same data; cache-line aligned slots; fails cleanly on full (natural backpressure). Best "one size fits most" for bounded server use.
Michael & Scott
MPMC
no
malloc per node
2 CAS + retries
The 1996 classic, correctness baseline. Slower under contention; needs ABA protection + hazard-pointer/epoch reclamation. Not recommended as primary.
moodycamel ConcurrentQueue
MPMC
optional
block-based
very high throughput
Set of per-producer sub-queues (only per-producer FIFO, no global order). Excellent general MPMC; heavier/complex; C++.
LMAX Disruptor
MP→MC ring
yes
no (preallocated)
barrier/seq
Ring + sequence barriers; ~order-of-magnitude lower latency / ~8× throughput vs lock-based queue in LMAX's tests. Shines for broadcast (every consumer sees every item) and pipelined stages. Heavier framework; best for high-rate fan-out.
Recommendation (starting point)
Reactor↔worker / cross-worker H3 handoff:Vyukov bounded MPMC array used as MPSC — bounded (OOM-safe), zero alloc, single CAS, clean full-handling for backpressure. If profiling shows the producer XCHG of the intrusive MPSC wins and unboundedness is acceptable with an explicit length cap, consider that instead.
SPSC fast lanes: rigtorp-style ring where one producer is guaranteed.
Wakeup:eventfd (or uv_async_send) with the drain-and-rearm protocol below.
Avoid Michael&Scott (reclamation pain) and the Disruptor (framework weight) unless a broadcast use case demands the latter.
Wakeup integration & the lost-wakeup pitfall
Pattern (edge-triggered, must be race-free):
producer: enqueue(item)
if (queue_was_empty) signal(eventfd) // only first item signals
consumer: on wakeup: drain ALL items
mark empty; then re-check queue once more before sleeping (double-check)
Pitfalls to get right:
Lost wakeup: consumer sees empty → decides to sleep, but a producer enqueued + signaled in between. Fix with the double-check (re-scan after arming) and correct acquire/release ordering on the empty flag.
Signal coalescing:eventfd/uv_async coalesce; never assume one signal == one item. Always drain to empty.
Closed-fd race: don't write() a wakeup fd that may already be closed during shutdown (cf. Netty/Python eventfd races).
Memory reclamation
Node-based unbounded queues (Vyukov intrusive, M&S) need safe reclamation (hazard pointers / epoch-based / RCU) for the consumer to free nodes producers may still touch. Bounded array queues sidestep this entirely — another reason to prefer them for our hot paths.
Application: WebSocket (#2) — which queue types pay off
WebSocket is long-lived full-duplex over a single ordered stream; the framing layer must serialize writes per connection. Queue choices:
Per-connection outbound queue = bounded SPSC (single-threaded worker) or bounded MPSC (multi-thread producers). A single writer drains, frames (RFC 6455), optionally permessage-deflate-compresses, and coalesces into one writev/TLS record. Per-connection FIFO must be preserved; global order across connections is irrelevant → favors per-connection queues / per-producer sub-queues over one shared MPMC.
Backpressure is mandatory (slow client is the Add multi-protocol server description to README #1 WS threat — unbounded growth → OOM/crash). Bounded queue with low/high-water marks and an explicit policy per threshold:
block/pause the producer (apply backpressure upstream),
drop non-priority messages,
conflate ("latest value wins" — overwrite stale entries; ideal for state-sync / market-data style feeds; effectively a 1-slot overwriting mailbox),
close the slow connection at a critical mark.
Control-frame priority lane.ping/pong/close must not sit behind a large data backlog (RFC 6455 allows interjecting control frames between data fragments). Use a small priority queue / separate lane drained first; also lets us split "low-value tick burst" from "high-value message".
Broadcast / fan-out (chat, pub/sub): prefer per-subscriber queues over a shared MPMC so one slow subscriber doesn't head-of-line-block the others (backpressure isolation; queued items aren't duplicated into write buffers until processed). For very high-rate "every consumer sees every message" feeds, a Disruptor-style ring is the canonical fit.
Inbound heavy handlers: parse frames on the IO thread, hand the message to a worker via MPSC — same non-blocking constraint as the H3 reactor (companion issue).
Open questions / decisions
Bounded MPMC-as-MPSC vs intrusive MPSC+length-cap — decide after a micro-benchmark on our workload (producer cost vs consumer-stall vs alloc).
Motivation
We need a high-performance, non-blocking inter-thread message queue for worker-to-worker / reactor-to-worker communication. Concrete drivers:
send/receivesuspend,poll_cbis not a coroutine), so it can't be used on the reactor path as-is. We need either a non-suspendingtry_send/try_receive, or our own MPSC + trigger.Requirements
eventfd/uv_async_send, lost-wakeup-safe (see pitfalls below).Algorithm survey
Recommendation (starting point)
eventfd(oruv_async_send) with the drain-and-rearm protocol below.Wakeup integration & the lost-wakeup pitfall
Pattern (edge-triggered, must be race-free):
Pitfalls to get right:
eventfd/uv_asynccoalesce; never assume one signal == one item. Always drain to empty.write()a wakeup fd that may already be closed during shutdown (cf. Netty/Python eventfd races).Memory reclamation
Node-based unbounded queues (Vyukov intrusive, M&S) need safe reclamation (hazard pointers / epoch-based / RCU) for the consumer to free nodes producers may still touch. Bounded array queues sidestep this entirely — another reason to prefer them for our hot paths.
Application: WebSocket (#2) — which queue types pay off
WebSocket is long-lived full-duplex over a single ordered stream; the framing layer must serialize writes per connection. Queue choices:
permessage-deflate-compresses, and coalesces into onewritev/TLS record. Per-connection FIFO must be preserved; global order across connections is irrelevant → favors per-connection queues / per-producer sub-queues over one shared MPMC.ping/pong/closemust not sit behind a large data backlog (RFC 6455 allows interjecting control frames between data fragments). Use a small priority queue / separate lane drained first; also lets us split "low-value tick burst" from "high-value message".Open questions / decisions
try_send/try_receivenon-suspending) or implement our own MPSC+trigger in the server? (HTTP/3: cross-worker connection migration (SO_REUSEPORT CID steering) #72 blocker.)References
Companion: ACK-timing / don't-block-the-reactor issue. Related: #72 (cross-worker H3), #2 (WebSocket).