fix: null-check packetPool / queueStatusPool allocCopy across 11 router/module callers#10316
Open
nightjoker7 wants to merge 3 commits into
Open
fix: null-check packetPool / queueStatusPool allocCopy across 11 router/module callers#10316nightjoker7 wants to merge 3 commits into
nightjoker7 wants to merge 3 commits into
Conversation
…ule callers Adds null-checks at 11 call sites that take a pointer from packetPool.allocCopy() or queueStatusPool.allocCopy() and immediately dereference (or pass to a function that derefs) without verifying the allocation succeeded. Allocator::allocCopy is documented to return nullptr on pool exhaustion; these call sites assumed "always succeeds". Sites fixed: src/mesh/MeshService.cpp:113 sendToPhone(allocCopy(*mp)) src/mesh/MeshService.cpp:209 sendToMesh(allocCopy(p)) src/mesh/MeshService.cpp:233 queueStatusPool.allocCopy(qs); copied->res=res src/mesh/MeshService.cpp:271 sendToPhone(allocCopy(*p)) src/mesh/NextHopRouter.cpp:35 startRetransmission(allocCopy(*p)) src/mesh/NextHopRouter.cpp:151 tosend = allocCopy(*p); tosend->hop_limit=0 src/mesh/NextHopRouter.cpp:329,331,336 three retx allocCopy sites src/mesh/ReliableRouter.cpp:21 want_ack copy = allocCopy(*p) src/mesh/Router.cpp:376 p_decoded = allocCopy(*p); MQTT::onSend(*p_decoded) src/modules/NodeInfoModule.cpp:63 packetCopy = allocCopy(mp) src/modules/AtakPluginModule.cpp:199 decompressedCopy = allocCopy(mp) Sites with existing guards (left alone): Router.cpp:744 legacy null-check from meshtastic#9136 Router.cpp:813 existing if (p_encrypted_new) check Telemetry/*.cpp lastMeasurementPacket consumers all check Reproduction (deterministic null-injection on Heltec V3, static-pool board, ~50 MeshPacket slots, no PSRAM): Build A (origin/develop + 1-in-5 allocCopy nullptr inject) rebootCount: 4 -> 26 (+22 panics in 90 s under sendText load) Build B (this patch + 1-in-5 allocCopy nullptr inject) rebootCount: 3 -> 4 (+1 close+reopen probe artifact only; zero hammer-induced crashes) Same hardware, identical workload, same injection rate. The only delta is these 11 null-checks. Field reports whose backtraces / call graphs land in these sites: meshtastic#4410 (T-Beam reboot loop after REPEATER config) - GUVWAF decoded the backtrace in-thread; panic literally inside MeshService::sendToPhone (the call site of packetPool.allocCopy(*mp) at MeshService.cpp:113) meshtastic#1355 (T-Lora v2 1.6 MQTT bridge crash) - filed by caveman99; Guru Meditation LoadProhibited with EXCVADDR=0x00000004 (textbook null-deref of struct field) inside Router::send -> MQTT::onSend meshtastic#3725 (Heltec V3 reboot loop with ATAK plugin) - thebentern landed partial fix meshtastic#3922 noting "there's still some underlying issue"; AtakPluginModule.cpp:199 path meshtastic#2264 (heap-too-low reboot loop with NodeInfo + sendToPhone storm) meshtastic#6772 (T-Lora v2 1.6 abort at operator new during PhoneAPI::handleToRadio) meshtastic#8066 - fifieldt: "In 2.7.9 we fixed a range of heap exhaustion issues that could be the cause" (Heltec V3 reboot acknowledgment) This patch does not claim to fix those reports - it null-checks the specific call sites whose deref-on-null is in their backtraces. Mirrors the merge pattern of meshtastic#10262 (13 allocZeroed null-checks in one focused PR). Same shape, adjacent class.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds defensive null-checks at multiple call sites that use packetPool.allocCopy() / queueStatusPool.allocCopy() results immediately, preventing null-dereference crashes when pools are exhausted (especially on low-heap / no-PSRAM boards).
Changes:
- Added allocation-result checks and safe early-exit behavior across router/service/module paths that previously assumed
allocCopy()always succeeds. - Guarded MQTT publish in
Router::send()againstp_decoded == nullptrto avoid crashes under pool exhaustion. - Added warnings / comments to clarify dropped/skipped behavior on allocation failure.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/mesh/MeshService.cpp | Null-check packetPool.allocCopy() / queueStatusPool.allocCopy() before forwarding to phone / mesh and sending queue status. |
| src/mesh/NextHopRouter.cpp | Null-check allocCopy() before starting retransmission and before allocating copies for retransmission sends. |
| src/mesh/ReliableRouter.cpp | Avoid starting retransmission if allocCopy() fails (still sends original packet). |
| src/mesh/Router.cpp | Skip MQTT onSend() when decoded-copy allocation fails; keep release behavior safe for nullptr. |
| src/modules/NodeInfoModule.cpp | Null-check packet copy before re-encoding payload and forwarding to phone. |
| src/modules/AtakPluginModule.cpp | Null-check decompressed packet copy before re-encoding and forwarding to phone. |
Comments suppressed due to low confidence (1)
src/mesh/NextHopRouter.cpp:341
- doRetransmissions() now conditionally sends only if allocCopy() succeeds, but the retransmission counter is decremented unconditionally later in the block. If allocCopy() fails due to pool exhaustion, this will consume a retry without actually sending anything and can cause premature NAK/failure. Consider only decrementing tries when a retransmission packet was successfully allocated/queued (but still rescheduling nextTxMsec to avoid a tight loop).
if (auto *c = packetPool.allocCopy(*p.packet)) FloodingRouter::send(c);
} else {
if (auto *c = packetPool.allocCopy(*p.packet)) NextHopRouter::send(c);
}
} else {
// Note: we call the superclass version because we don't want to have our version of send() add a new
// retransmission record
if (auto *c = packetPool.allocCopy(*p.packet)) FloodingRouter::send(c);
}
// Queue again
--p.numRetransmissions;
setNextTx(&p);
| return; | ||
| } | ||
| decompressedCopy->decoded.payload.size = | ||
| pb_encode_to_bytes(decompressedCopy->decoded.payload.bytes, sizeof(decompressedCopy->decoded.payload), |
| if (auto *c = packetPool.allocCopy(*mp)) | ||
| sendToPhone(c); | ||
| else | ||
| LOG_WARN("phone-forward dropped: packetPool exhausted"); |
Comment on lines
+280
to
+281
| // else: pool exhausted; phone will see this packet again via the | ||
| // normal RX-to-phone path on a future cycle. |
Comment on lines
+70
to
+71
| } else { | ||
| LOG_WARN("NodeInfoModule: packetPool exhausted, skipping phone forward"); |
| } | ||
| auto decompressedCopy = packetPool.allocCopy(mp); | ||
| if (!decompressedCopy) { | ||
| LOG_WARN("AtakPluginModule: packetPool exhausted, skipping decompressed phone forward"); |
| { | ||
| meshtastic_QueueStatus *copied = queueStatusPool.allocCopy(qs); | ||
| if (!copied) { | ||
| LOG_WARN("sendQueueStatusToPhone: queueStatusPool exhausted, skipping"); |
AtakPluginModule.cpp:205 — pb_encode_to_bytes was passed sizeof(payload) instead of sizeof(payload.bytes). PB_BYTES_ARRAY_T is a struct with a size field plus the bytes array, so the previous form overstated the buffer by the size-field width and could let pb_encode_to_bytes write past payload.bytes into adjacent fields under a worst-case TAKPacket re-encode. Use sizeof(...payload.bytes) to match the canonical form already used at NodeInfoModule.cpp:67. Caller-side LOG_WARN on packetPool/queueStatusPool exhaustion was producing 3 lines per failure (caller WARN + Allocator::allocCopy WARN at MemoryPool.h:45 + static-pool alloc WARN at MemoryPool.h:158). Downgrade the caller-side message to LOG_DEBUG so the contextual breadcrumb is preserved for debug builds without spamming production logs under burst load. Affects: - MeshService.cpp:116 (handleFromRadio phone-forward) - MeshService.cpp:235 (sendQueueStatusToPhone) - AtakPluginModule.cpp:201 (decompressed TAK phone-forward) - NodeInfoModule.cpp:71 (NodeInfo phone-forward) MeshService.cpp:281 — reword the cc-to-phone exhaustion comment. Previous wording implied the phone would see the packet again via the normal RX-to-phone path; that is not guaranteed for non-broadcast or non-local sends. New comment is accurate about what the drop actually means. Resolves the Copilot review on commit 0984d68.
Contributor
Author
|
Pushed
Verified locally: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this changes
Adds null-checks at 11 call sites that take a pointer from
packetPool.allocCopy()orqueueStatusPool.allocCopy()and immediately dereference (or pass to a function that derefs) without verifying the allocation succeeded.Allocator::allocCopyis documented to returnnullptron pool exhaustion — these call sites assumed "always succeeds."Total diff: 6 files, 49 insertions / 17 deletions. No behavioral change on the happy path.
src/mesh/MeshService.cpp:113sendToPhone(packetPool.allocCopy(*mp))src/mesh/MeshService.cpp:209auto a = packetPool.allocCopy(p); sendToMesh(a, RX_SRC_USER);src/mesh/MeshService.cpp:233meshtastic_QueueStatus *copied = queueStatusPool.allocCopy(qs); copied->res = res;src/mesh/MeshService.cpp:271auto a = packetPool.allocCopy(*p); sendToPhone(a);src/mesh/NextHopRouter.cpp:35startRetransmission(packetPool.allocCopy(*p));src/mesh/NextHopRouter.cpp:151tosend = packetPool.allocCopy(*p); tosend->hop_limit = 0;src/mesh/NextHopRouter.cpp:329, 331, 336allocCopysites passed straight to*::send(...)src/mesh/ReliableRouter.cpp:21auto copy = packetPool.allocCopy(*p); startRetransmission(copy, NUM_RELIABLE_RETX);src/mesh/Router.cpp:376p_decoded = packetPool.allocCopy(*p); ... mqtt->onSend(*p, *p_decoded, chIndex);src/modules/NodeInfoModule.cpp:63packetCopy = packetPool.allocCopy(mp); packetCopy->decoded.payload.size = ...src/modules/AtakPluginModule.cpp:199decompressedCopy = packetPool.allocCopy(mp); decompressedCopy->decoded.payload.size = ...Affected hardware
The bug fires when
packetPoolis exhausted — i.e., when the heap-backed allocator can't fulfill aMeshPacketallocation (~400 B with metadata). Pool size is bounded byMAX_PACKETS = MAX_RX_TOPHONE(32) + MAX_RX_FROMRADIO(4) + 2*MAX_TX_QUEUE(16) + 2 = 70slots, but the practical exhaustion trigger is free heap dropping below the per-allocation cost under burst load, not slot count.Boards most exposed are those without PSRAM and with tight idle free heap:
PSRAM boards (Heltec V4 TFT, Tracker, Station G2, T-Beam-S3 family, T-Watch-S3) are harder to exhaust because the allocator can back onto 8 MB PSRAM, but the same crash mechanism applies if pool exhaustion ever happens.
Backtraces this addresses
Two existing field reports have decoded backtraces that land directly on the call sites this PR is guarding:
[Bug]: Crash in JSON MQTT bridge #1355 —
Crash in JSON MQTT bridge.Guru Meditation Error: LoadProhibited,EXCVADDR=0x00000004(textbook null-deref of a struct field at offset 4). Decoded backtrace frame 0 isMQTT::downstreamPacketToJson, called viaMQTT::onSendfromRouter::send— the exact path guarded by theRouter.cpp:376fix here (if (p_decoded && ...)). Filed by @caveman99.[BUG] Crash/reboot loop entering 'Settings/Device' using Android app after configuring as REPEATER #4410 —
Crash/reboot loop entering 'Settings/Device' using Android app after configuring as REPEATER. @GUVWAF decoded the backtrace in-thread; frame 0 isMeshService::sendToPhone, which is the call site ofpacketPool.allocCopy(*mp)atMeshService.cpp:113. GUVWAF noted "not sure why it's crashing" — the unguarded null deref on pool exhaustion is a plausible cause, though not definitively proven from the trace alone.I'm not claiming this PR "fixes" either issue (both are closed); I'm claiming the call graphs in their decoded backtraces match the call sites being guarded here. If pool exhaustion is the trigger, the symptom shifts from a panic to a logged drop after this change.
Reproduction
Used a deterministic null-injection harness on Heltec V3 (ESP32-S3, no PSRAM, the same hardware tier as the cited reports): a 1-line edit in
MemoryPool.h::allocCopyreturningnullptrevery 5th call. Drove sendText load viameshtastic-python SerialInterfacefor 90 s on two devices flashed identically except for these patches.origin/develop+ injection)develop+ 11 sites + injection)Same hardware, same workload, same injection rate, same channel. The only difference is the 11 null-checks.
Why the diff is small
Mirrors the merge pattern of #10262 (13
allocZeroednull-checks in one focused PR). Same shape, adjacent class. Sites that already had null-checks are left alone —Router.cpp:744(legacy guard from #9136),Router.cpp:813(existingif (p_encrypted_new)), and the Telemetry consumers (if (!lastMeasurementPacket)).Risk
Minimal. Each fix is an additive
if (auto *c = ...) ...;orif (!ptr) { LOG_WARN; early-return; }. The only runtime change vs baseline is "pool exhaustion now logs and drops one packet instead of crashing the radio task." No public-API surface change.Follow-up
A round-2 follow-through covers helper-level fixes (
Router::allocForSending,MeshModule::allocAckNak,SinglePortModule::allocDataPacket,ProtobufModule::allocDataProtobuf) plus 10 module callers (RoutingModule, NeighborInfoModule, PositionModule, StatusMessageModule, the 5 Telemetry modules + HostMetrics). Held for a separate PR to keep this diff focused; the 11 sites here are the load-bearing ones in the cited backtraces.