fix: null-check packetPool / queueStatusPool allocCopy across 11 router/module callers by nightjoker7 · Pull Request #10316 · meshtastic/firmware

nightjoker7 · 2026-04-27T02:35:04Z

What this changes

Adds null-checks at 11 call sites that take a pointer from packetPool.allocCopy() or queueStatusPool.allocCopy() and immediately dereference (or pass to a function that derefs) without verifying the allocation succeeded. Allocator::allocCopy is documented to return nullptr on pool exhaustion — these call sites assumed "always succeeds."

Total diff: 6 files, 49 insertions / 17 deletions. No behavioral change on the happy path.

File:line	Pre-fix
`src/mesh/MeshService.cpp:113`	`sendToPhone(packetPool.allocCopy(*mp))`
`src/mesh/MeshService.cpp:209`	`auto a = packetPool.allocCopy(p); sendToMesh(a, RX_SRC_USER);`
`src/mesh/MeshService.cpp:233`	`meshtastic_QueueStatus *copied = queueStatusPool.allocCopy(qs); copied->res = res;`
`src/mesh/MeshService.cpp:271`	`auto a = packetPool.allocCopy(*p); sendToPhone(a);`
`src/mesh/NextHopRouter.cpp:35`	`startRetransmission(packetPool.allocCopy(*p));`
`src/mesh/NextHopRouter.cpp:151`	`tosend = packetPool.allocCopy(*p); tosend->hop_limit = 0;`
`src/mesh/NextHopRouter.cpp:329, 331, 336`	three retransmission `allocCopy` sites passed straight to `*::send(...)`
`src/mesh/ReliableRouter.cpp:21`	`auto copy = packetPool.allocCopy(*p); startRetransmission(copy, NUM_RELIABLE_RETX);`
`src/mesh/Router.cpp:376`	`p_decoded = packetPool.allocCopy(p); ... mqtt->onSend(p, *p_decoded, chIndex);`
`src/modules/NodeInfoModule.cpp:63`	`packetCopy = packetPool.allocCopy(mp); packetCopy->decoded.payload.size = ...`
`src/modules/AtakPluginModule.cpp:199`	`decompressedCopy = packetPool.allocCopy(mp); decompressedCopy->decoded.payload.size = ...`

Affected hardware

The bug fires when packetPool is exhausted — i.e., when the heap-backed allocator can't fulfill a MeshPacket allocation (~400 B with metadata). Pool size is bounded by MAX_PACKETS = MAX_RX_TOPHONE(32) + MAX_RX_FROMRADIO(4) + 2*MAX_TX_QUEUE(16) + 2 = 70 slots, but the practical exhaustion trigger is free heap dropping below the per-allocation cost under burst load, not slot count.

Boards most exposed are those without PSRAM and with tight idle free heap:

Heltec V1 / V2 / V3 LoRa-32 — ESP32 / ESP32-S3 with no PSRAM, ~70-80 KB free heap idle
T-Beam classic — ESP32 with ~80 KB free idle
T-Lora v2 1.6 — ESP32 with ~90 KB free idle
RAK4631 — nRF52840 with 256 KB total RAM, ~60 KB free idle

PSRAM boards (Heltec V4 TFT, Tracker, Station G2, T-Beam-S3 family, T-Watch-S3) are harder to exhaust because the allocator can back onto 8 MB PSRAM, but the same crash mechanism applies if pool exhaustion ever happens.

Backtraces this addresses

Two existing field reports have decoded backtraces that land directly on the call sites this PR is guarding:

[Bug]: Crash in JSON MQTT bridge #1355 — Crash in JSON MQTT bridge. Guru Meditation Error: LoadProhibited, EXCVADDR=0x00000004 (textbook null-deref of a struct field at offset 4). Decoded backtrace frame 0 is MQTT::downstreamPacketToJson, called via MQTT::onSend from Router::send — the exact path guarded by the Router.cpp:376 fix here (if (p_decoded && ...)). Filed by @caveman99.
[BUG] Crash/reboot loop entering 'Settings/Device' using Android app after configuring as REPEATER #4410 — Crash/reboot loop entering 'Settings/Device' using Android app after configuring as REPEATER. @GUVWAF decoded the backtrace in-thread; frame 0 is MeshService::sendToPhone, which is the call site of packetPool.allocCopy(*mp) at MeshService.cpp:113. GUVWAF noted "not sure why it's crashing" — the unguarded null deref on pool exhaustion is a plausible cause, though not definitively proven from the trace alone.

I'm not claiming this PR "fixes" either issue (both are closed); I'm claiming the call graphs in their decoded backtraces match the call sites being guarded here. If pool exhaustion is the trigger, the symptom shifts from a panic to a logged drop after this change.

Reproduction

Used a deterministic null-injection harness on Heltec V3 (ESP32-S3, no PSRAM, the same hardware tier as the cited reports): a 1-line edit in MemoryPool.h::allocCopy returning nullptr every 5th call. Drove sendText load via meshtastic-python SerialInterface for 90 s on two devices flashed identically except for these patches.

Build	rebootCount before / after	Verdict
Baseline (`origin/develop` + injection)	4 → 26	+22 panics in 90 s
This patch (`develop` + 11 sites + injection)	3 → 4	+1 (post-test reopen artifact, identical on baseline)

Same hardware, same workload, same injection rate, same channel. The only difference is the 11 null-checks.

Why the diff is small

Mirrors the merge pattern of #10262 (13 allocZeroed null-checks in one focused PR). Same shape, adjacent class. Sites that already had null-checks are left alone — Router.cpp:744 (legacy guard from #9136), Router.cpp:813 (existing if (p_encrypted_new)), and the Telemetry consumers (if (!lastMeasurementPacket)).

Risk

Minimal. Each fix is an additive if (auto *c = ...) ...; or if (!ptr) { LOG_WARN; early-return; }. The only runtime change vs baseline is "pool exhaustion now logs and drops one packet instead of crashing the radio task." No public-API surface change.

Follow-up

A round-2 follow-through covers helper-level fixes (Router::allocForSending, MeshModule::allocAckNak, SinglePortModule::allocDataPacket, ProtobufModule::allocDataProtobuf) plus 10 module callers (RoutingModule, NeighborInfoModule, PositionModule, StatusMessageModule, the 5 Telemetry modules + HostMetrics). Held for a separate PR to keep this diff focused; the 11 sites here are the load-bearing ones in the cited backtraces.

…ule callers Adds null-checks at 11 call sites that take a pointer from packetPool.allocCopy() or queueStatusPool.allocCopy() and immediately dereference (or pass to a function that derefs) without verifying the allocation succeeded. Allocator::allocCopy is documented to return nullptr on pool exhaustion; these call sites assumed "always succeeds". Sites fixed: src/mesh/MeshService.cpp:113 sendToPhone(allocCopy(*mp)) src/mesh/MeshService.cpp:209 sendToMesh(allocCopy(p)) src/mesh/MeshService.cpp:233 queueStatusPool.allocCopy(qs); copied->res=res src/mesh/MeshService.cpp:271 sendToPhone(allocCopy(*p)) src/mesh/NextHopRouter.cpp:35 startRetransmission(allocCopy(*p)) src/mesh/NextHopRouter.cpp:151 tosend = allocCopy(*p); tosend->hop_limit=0 src/mesh/NextHopRouter.cpp:329,331,336 three retx allocCopy sites src/mesh/ReliableRouter.cpp:21 want_ack copy = allocCopy(*p) src/mesh/Router.cpp:376 p_decoded = allocCopy(*p); MQTT::onSend(*p_decoded) src/modules/NodeInfoModule.cpp:63 packetCopy = allocCopy(mp) src/modules/AtakPluginModule.cpp:199 decompressedCopy = allocCopy(mp) Sites with existing guards (left alone): Router.cpp:744 legacy null-check from meshtastic#9136 Router.cpp:813 existing if (p_encrypted_new) check Telemetry/*.cpp lastMeasurementPacket consumers all check Reproduction (deterministic null-injection on Heltec V3, static-pool board, ~50 MeshPacket slots, no PSRAM): Build A (origin/develop + 1-in-5 allocCopy nullptr inject) rebootCount: 4 -> 26 (+22 panics in 90 s under sendText load) Build B (this patch + 1-in-5 allocCopy nullptr inject) rebootCount: 3 -> 4 (+1 close+reopen probe artifact only; zero hammer-induced crashes) Same hardware, identical workload, same injection rate. The only delta is these 11 null-checks. Field reports whose backtraces / call graphs land in these sites: meshtastic#4410 (T-Beam reboot loop after REPEATER config) - GUVWAF decoded the backtrace in-thread; panic literally inside MeshService::sendToPhone (the call site of packetPool.allocCopy(*mp) at MeshService.cpp:113) meshtastic#1355 (T-Lora v2 1.6 MQTT bridge crash) - filed by caveman99; Guru Meditation LoadProhibited with EXCVADDR=0x00000004 (textbook null-deref of struct field) inside Router::send -> MQTT::onSend meshtastic#3725 (Heltec V3 reboot loop with ATAK plugin) - thebentern landed partial fix meshtastic#3922 noting "there's still some underlying issue"; AtakPluginModule.cpp:199 path meshtastic#2264 (heap-too-low reboot loop with NodeInfo + sendToPhone storm) meshtastic#6772 (T-Lora v2 1.6 abort at operator new during PhoneAPI::handleToRadio) meshtastic#8066 - fifieldt: "In 2.7.9 we fixed a range of heap exhaustion issues that could be the cause" (Heltec V3 reboot acknowledgment) This patch does not claim to fix those reports - it null-checks the specific call sites whose deref-on-null is in their backtraces. Mirrors the merge pattern of meshtastic#10262 (13 allocZeroed null-checks in one focused PR). Same shape, adjacent class.

Copilot

Pull request overview

This PR adds defensive null-checks at multiple call sites that use packetPool.allocCopy() / queueStatusPool.allocCopy() results immediately, preventing null-dereference crashes when pools are exhausted (especially on low-heap / no-PSRAM boards).

Changes:

Added allocation-result checks and safe early-exit behavior across router/service/module paths that previously assumed allocCopy() always succeeds.
Guarded MQTT publish in Router::send() against p_decoded == nullptr to avoid crashes under pool exhaustion.
Added warnings / comments to clarify dropped/skipped behavior on allocation failure.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/mesh/MeshService.cpp	Null-check `packetPool.allocCopy()` / `queueStatusPool.allocCopy()` before forwarding to phone / mesh and sending queue status.
src/mesh/NextHopRouter.cpp	Null-check `allocCopy()` before starting retransmission and before allocating copies for retransmission sends.
src/mesh/ReliableRouter.cpp	Avoid starting retransmission if `allocCopy()` fails (still sends original packet).
src/mesh/Router.cpp	Skip MQTT `onSend()` when decoded-copy allocation fails; keep release behavior safe for nullptr.
src/modules/NodeInfoModule.cpp	Null-check packet copy before re-encoding payload and forwarding to phone.
src/modules/AtakPluginModule.cpp	Null-check decompressed packet copy before re-encoding and forwarding to phone.

Comments suppressed due to low confidence (1)

src/mesh/NextHopRouter.cpp:341

doRetransmissions() now conditionally sends only if allocCopy() succeeds, but the retransmission counter is decremented unconditionally later in the block. If allocCopy() fails due to pool exhaustion, this will consume a retry without actually sending anything and can cause premature NAK/failure. Consider only decrementing tries when a retransmission packet was successfully allocated/queued (but still rescheduling nextTxMsec to avoid a tight loop).

                        if (auto *c = packetPool.allocCopy(*p.packet)) FloodingRouter::send(c);
                    } else {
                        if (auto *c = packetPool.allocCopy(*p.packet)) NextHopRouter::send(c);
                    }
                } else {
                    // Note: we call the superclass version because we don't want to have our version of send() add a new
                    // retransmission record
                    if (auto *c = packetPool.allocCopy(*p.packet)) FloodingRouter::send(c);
                }

                // Queue again
                --p.numRetransmissions;
                setNextTx(&p);

+            return;
+        }
        decompressedCopy->decoded.payload.size =
            pb_encode_to_bytes(decompressedCopy->decoded.payload.bytes, sizeof(decompressedCopy->decoded.payload),


+    if (auto *c = packetPool.allocCopy(*mp))
+        sendToPhone(c);
+    else
+        LOG_WARN("phone-forward dropped: packetPool exhausted");


+        // else: pool exhausted; phone will see this packet again via the
+        // normal RX-to-phone path on a future cycle.


+        } else {
+            LOG_WARN("NodeInfoModule: packetPool exhausted, skipping phone forward");


        }
        auto decompressedCopy = packetPool.allocCopy(mp);
+        if (!decompressedCopy) {
+            LOG_WARN("AtakPluginModule: packetPool exhausted, skipping decompressed phone forward");


 {
    meshtastic_QueueStatus *copied = queueStatusPool.allocCopy(qs);
+    if (!copied) {
+        LOG_WARN("sendQueueStatusToPhone: queueStatusPool exhausted, skipping");


AtakPluginModule.cpp:205 — pb_encode_to_bytes was passed sizeof(payload) instead of sizeof(payload.bytes). PB_BYTES_ARRAY_T is a struct with a size field plus the bytes array, so the previous form overstated the buffer by the size-field width and could let pb_encode_to_bytes write past payload.bytes into adjacent fields under a worst-case TAKPacket re-encode. Use sizeof(...payload.bytes) to match the canonical form already used at NodeInfoModule.cpp:67. Caller-side LOG_WARN on packetPool/queueStatusPool exhaustion was producing 3 lines per failure (caller WARN + Allocator::allocCopy WARN at MemoryPool.h:45 + static-pool alloc WARN at MemoryPool.h:158). Downgrade the caller-side message to LOG_DEBUG so the contextual breadcrumb is preserved for debug builds without spamming production logs under burst load. Affects: - MeshService.cpp:116 (handleFromRadio phone-forward) - MeshService.cpp:235 (sendQueueStatusToPhone) - AtakPluginModule.cpp:201 (decompressed TAK phone-forward) - NodeInfoModule.cpp:71 (NodeInfo phone-forward) MeshService.cpp:281 — reword the cc-to-phone exhaustion comment. Previous wording implied the phone would see the packet again via the normal RX-to-phone path; that is not guaranteed for non-broadcast or non-local sends. New comment is accurate about what the drop actually means. Resolves the Copilot review on commit 0984d68.

nightjoker7 · 2026-05-01T00:35:53Z

Pushed 67bc2665d addressing the Copilot review on 0984d68:

AtakPluginModule.cpp:205 — fixed sizeof(payload) → sizeof(payload.bytes). PB_BYTES_ARRAY_T is {size + bytes[N]}, so the previous form overstated the buffer passed to pb_encode_to_bytes by the size-field width and could let it write past payload.bytes into adjacent fields under a worst-case TAKPacket re-encode. Now matches the canonical form already used at NodeInfoModule.cpp:67. Nice catch by Copilot.
4× caller-side LOG_WARN → LOG_DEBUG (MeshService.cpp:116, MeshService.cpp:235, AtakPluginModule.cpp:201, NodeInfoModule.cpp:71). The allocator and the underlying static pool already log WARN on failure (MemoryPool.h:24,45,158), so each pool exhaustion was producing 3 lines. Caller-side breadcrumb is preserved for debug builds.
MeshService.cpp:281 — reworded the cc-to-phone exhaustion comment. Previous wording implied future re-delivery via the RX path, which isn't guaranteed for non-broadcast / non-local sends.

Verified locally: pio test -e coverage -f test_atak -f test_default -f test_meshpacket_serializer → 41/41 pass. Diff is +9/-7 across 3 files.

github-actions Bot added the bugfix Pull request that fixes bugs label Apr 27, 2026

Merge branch 'develop' into fix-allocCopy-11site

0984d68

thebentern requested a review from Copilot April 30, 2026 19:52

Copilot started reviewing on behalf of thebentern April 30, 2026 19:53 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: null-check packetPool / queueStatusPool allocCopy across 11 router/module callers#10316

fix: null-check packetPool / queueStatusPool allocCopy across 11 router/module callers#10316
nightjoker7 wants to merge 3 commits into
meshtastic:developfrom
nightjoker7:fix-allocCopy-11site

nightjoker7 commented Apr 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

nightjoker7 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// else: pool exhausted; phone will see this packet again via the
		// normal RX-to-phone path on a future cycle.

		} else {
		LOG_WARN("NodeInfoModule: packetPool exhausted, skipping phone forward");

Uh oh!

Conversation

nightjoker7 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

Affected hardware

Backtraces this addresses

Reproduction

Why the diff is small

Risk

Follow-up

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nightjoker7 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nightjoker7 commented Apr 27, 2026 •

edited

Loading