Performant lock-free inter-thread message queue (MPSC/SPSC) + reactor integration

## Motivation

We need a **high-performance, non-blocking inter-thread message queue** for worker-to-worker / reactor-to-worker communication. Concrete drivers:

- **Cross-worker HTTP/3 (#72):** a UDP datagram for a connection owned by worker B can land on worker A's socket (SO_REUSEPORT). We must hand the packet (or a steering command) to B **without blocking A's reactor** — see the companion ACK-timing issue: a blocking handoff would stall ACKs. The current php-async channel ABI is blocking (`send`/`receive` suspend, `poll_cb` is not a coroutine), so it can't be used on the reactor path as-is. We need either a non-suspending `try_send`/`try_receive`, or our own MPSC + trigger.
- **WebSocket (#2):** per-connection outbound queues, broadcast/fan-out, and offloading heavy handlers (see WebSocket section below).
- **General:** offload target for heavy request handlers so the transport reactor stays non-blocking.

## Requirements

1. **Lock-free / non-blocking on the hot path** — producer (and ideally consumer) must not block; a blocked enqueue on the reactor thread would stall the transport.
2. **Bounded by default** — fixed memory, deterministic backpressure; ties into our OOM-firewall philosophy. Unbounded node queues risk OOM under overload.
3. **Event-loop integration** — wake the consumer's libuv/io_uring loop via `eventfd`/`uv_async_send`, **lost-wakeup-safe** (see pitfalls below).
4. **Batch drain** — consumer drains many items per wakeup to amortize the syscall and coalesce writes (we already batch TLS records).
5. **Cache-friendly** — head/tail on separate cache lines, padded slots, no false sharing.
6. **Shape: MPSC is the primary need** (one consumer per worker, many producers). SPSC where a single producer suffices. MPMC only if we add a shared work-stealing pool.

## Algorithm survey

| Algorithm | Shape | Bounded | Alloc on hot path | Cost / op | Notes |
|---|---|---|---|---|---|
| **rigtorp / Lamport ring** | SPSC | yes | no | ~load/store, no CAS | Fastest possible; cached head/tail to cut coherency traffic. Use when exactly one producer. |
| **Vyukov intrusive MPSC** | MPSC | no (node-based) | node is intrusive (no malloc if embedded) | **1 XCHG** + 1 release store on push | The standard MPSC. Producer is wait-free (single atomic exchange, no retry). Consumer can briefly stall if a producer is preempted between XCHG and the follow-up store (not linearizable, "almost lock-free"). Unbounded ⇒ needs a separate backpressure counter. |
| **Vyukov bounded MPMC (array + per-slot sequence)** | MPMC (works as MPSC) | **yes** | **no** | 1 CAS, no amortization | Per-slot sequence number instead of flags; producers/consumers don't touch the same data; cache-line aligned slots; **fails cleanly on full** (natural backpressure). Best "one size fits most" for bounded server use. |
| **Michael & Scott** | MPMC | no | malloc per node | 2 CAS + retries | The 1996 classic, correctness baseline. Slower under contention; needs ABA protection + hazard-pointer/epoch reclamation. Not recommended as primary. |
| **moodycamel ConcurrentQueue** | MPMC | optional | block-based | very high throughput | Set of **per-producer sub-queues** (only per-producer FIFO, no global order). Excellent general MPMC; heavier/complex; C++. |
| **LMAX Disruptor** | MP→MC ring | yes | no (preallocated) | barrier/seq | Ring + sequence barriers; ~order-of-magnitude lower latency / ~8× throughput vs lock-based queue in LMAX's tests. Shines for **broadcast** (every consumer sees every item) and pipelined stages. Heavier framework; best for high-rate fan-out. |

### Recommendation (starting point)

- **Reactor↔worker / cross-worker H3 handoff:** **Vyukov bounded MPMC array used as MPSC** — bounded (OOM-safe), zero alloc, single CAS, clean full-handling for backpressure. If profiling shows the producer XCHG of the intrusive MPSC wins and unboundedness is acceptable with an explicit length cap, consider that instead.
- **SPSC fast lanes:** rigtorp-style ring where one producer is guaranteed.
- **Wakeup:** `eventfd` (or `uv_async_send`) with the drain-and-rearm protocol below.
- Avoid Michael&Scott (reclamation pain) and the Disruptor (framework weight) unless a broadcast use case demands the latter.

## Wakeup integration & the lost-wakeup pitfall

Pattern (edge-triggered, must be race-free):

```
producer:  enqueue(item)
           if (queue_was_empty)  signal(eventfd)   // only first item signals
consumer:  on wakeup: drain ALL items
           mark empty; then re-check queue once more before sleeping (double-check)
```

Pitfalls to get right:
- **Lost wakeup:** consumer sees empty → decides to sleep, but a producer enqueued + signaled in between. Fix with the double-check (re-scan after arming) and correct acquire/release ordering on the empty flag.
- **Signal coalescing:** `eventfd`/`uv_async` coalesce; never assume one signal == one item. Always drain to empty.
- **Closed-fd race:** don't `write()` a wakeup fd that may already be closed during shutdown (cf. Netty/Python eventfd races).

## Memory reclamation

Node-based unbounded queues (Vyukov intrusive, M&S) need safe reclamation (hazard pointers / epoch-based / RCU) for the consumer to free nodes producers may still touch. **Bounded array queues sidestep this entirely** — another reason to prefer them for our hot paths.

## Application: WebSocket (#2) — which queue types pay off

WebSocket is long-lived full-duplex over a single ordered stream; the framing layer must serialize writes per connection. Queue choices:

- **Per-connection outbound queue = bounded SPSC (single-threaded worker) or bounded MPSC (multi-thread producers).** A single writer drains, frames (RFC 6455), optionally `permessage-deflate`-compresses, and coalesces into one `writev`/TLS record. Per-connection FIFO must be preserved; global order across connections is irrelevant → favors per-connection queues / per-producer sub-queues over one shared MPMC.
- **Backpressure is mandatory** (slow client is the #1 WS threat — unbounded growth → OOM/crash). Bounded queue with low/high-water marks and an explicit policy per threshold:
  - **block/pause** the producer (apply backpressure upstream),
  - **drop** non-priority messages,
  - **conflate** ("latest value wins" — overwrite stale entries; ideal for state-sync / market-data style feeds; effectively a 1-slot overwriting mailbox),
  - **close** the slow connection at a critical mark.
- **Control-frame priority lane.** `ping`/`pong`/`close` must not sit behind a large data backlog (RFC 6455 allows interjecting control frames between data fragments). Use a small **priority queue / separate lane** drained first; also lets us split "low-value tick burst" from "high-value message".
- **Broadcast / fan-out (chat, pub/sub):** prefer **per-subscriber queues** over a shared MPMC so one slow subscriber doesn't head-of-line-block the others (backpressure isolation; queued items aren't duplicated into write buffers until processed). For very high-rate "every consumer sees every message" feeds, a **Disruptor-style ring** is the canonical fit.
- **Inbound heavy handlers:** parse frames on the IO thread, hand the message to a worker via **MPSC** — same non-blocking constraint as the H3 reactor (companion issue).

## Open questions / decisions

- [ ] Bounded MPMC-as-MPSC vs intrusive MPSC+length-cap — decide after a micro-benchmark on our workload (producer cost vs consumer-stall vs alloc).
- [ ] Fix the handoff at php-async ABI level (`try_send`/`try_receive` non-suspending) or implement our own MPSC+trigger in the server? (#72 blocker.)
- [ ] Default queue depths / water-mark policy for WS, and conflation opt-in API.
- [ ] Backpressure surface for WS: setter on HttpServerConfig (consistent with [setter-not-ini]).

## References

- Vyukov — intrusive MPSC: https://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue
- Vyukov — bounded MPMC: https://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
- rigtorp SPSCQueue + ring-buffer optimization: https://github.com/rigtorp/SPSCQueue , https://rigtorp.se/ringbuffer/
- moodycamel ConcurrentQueue design: https://moodycamel.com/blog/2014/detailed-design-of-a-lock-free-queue
- LMAX Disruptor: https://lmax-exchange.github.io/disruptor/disruptor.html
- Michael & Scott (1996) lock-free queue (baseline)
- Lost-wakeup with eventfd loops (Netty #9362, Python bpo-30038): https://github.com/netty/netty/issues/9362
- WebSocket broadcasting & per-subscriber queues (python-websockets): https://websockets.readthedocs.io/en/stable/topics/broadcast.html
- WebSocket memory/backpressure: https://websockets.readthedocs.io/en/stable/topics/memory.html
- RFC 6455 (WebSocket framing, control frames): https://www.rfc-editor.org/rfc/rfc6455.html

Companion: ACK-timing / don't-block-the-reactor issue. Related: #72 (cross-worker H3), #2 (WebSocket).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performant lock-free inter-thread message queue (MPSC/SPSC) + reactor integration #81

Motivation

Requirements

Algorithm survey

Recommendation (starting point)

Wakeup integration & the lost-wakeup pitfall

Memory reclamation

Application: WebSocket (#2) — which queue types pay off

Open questions / decisions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Algorithm	Shape	Bounded	Alloc on hot path	Cost / op	Notes
rigtorp / Lamport ring	SPSC	yes	no	~load/store, no CAS	Fastest possible; cached head/tail to cut coherency traffic. Use when exactly one producer.
Vyukov intrusive MPSC	MPSC	no (node-based)	node is intrusive (no malloc if embedded)	1 XCHG + 1 release store on push	The standard MPSC. Producer is wait-free (single atomic exchange, no retry). Consumer can briefly stall if a producer is preempted between XCHG and the follow-up store (not linearizable, "almost lock-free"). Unbounded ⇒ needs a separate backpressure counter.
Vyukov bounded MPMC (array + per-slot sequence)	MPMC (works as MPSC)	yes	no	1 CAS, no amortization	Per-slot sequence number instead of flags; producers/consumers don't touch the same data; cache-line aligned slots; fails cleanly on full (natural backpressure). Best "one size fits most" for bounded server use.
Michael & Scott	MPMC	no	malloc per node	2 CAS + retries	The 1996 classic, correctness baseline. Slower under contention; needs ABA protection + hazard-pointer/epoch reclamation. Not recommended as primary.
moodycamel ConcurrentQueue	MPMC	optional	block-based	very high throughput	Set of per-producer sub-queues (only per-producer FIFO, no global order). Excellent general MPMC; heavier/complex; C++.
LMAX Disruptor	MP→MC ring	yes	no (preallocated)	barrier/seq	Ring + sequence barriers; ~order-of-magnitude lower latency / ~8× throughput vs lock-based queue in LMAX's tests. Shines for broadcast (every consumer sees every item) and pipelined stages. Heavier framework; best for high-rate fan-out.

Performant lock-free inter-thread message queue (MPSC/SPSC) + reactor integration #81

Description

Motivation

Requirements

Algorithm survey

Recommendation (starting point)

Wakeup integration & the lost-wakeup pitfall

Memory reclamation

Application: WebSocket (#2) — which queue types pay off

Open questions / decisions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions