Skip to content

fix: null-check packetPool / queueStatusPool allocCopy across 11 router/module callers#10316

Open
nightjoker7 wants to merge 3 commits into
meshtastic:developfrom
nightjoker7:fix-allocCopy-11site
Open

fix: null-check packetPool / queueStatusPool allocCopy across 11 router/module callers#10316
nightjoker7 wants to merge 3 commits into
meshtastic:developfrom
nightjoker7:fix-allocCopy-11site

Conversation

@nightjoker7
Copy link
Copy Markdown
Contributor

@nightjoker7 nightjoker7 commented Apr 27, 2026

What this changes

Adds null-checks at 11 call sites that take a pointer from packetPool.allocCopy() or queueStatusPool.allocCopy() and immediately dereference (or pass to a function that derefs) without verifying the allocation succeeded. Allocator::allocCopy is documented to return nullptr on pool exhaustion — these call sites assumed "always succeeds."

Total diff: 6 files, 49 insertions / 17 deletions. No behavioral change on the happy path.

File:line Pre-fix
src/mesh/MeshService.cpp:113 sendToPhone(packetPool.allocCopy(*mp))
src/mesh/MeshService.cpp:209 auto a = packetPool.allocCopy(p); sendToMesh(a, RX_SRC_USER);
src/mesh/MeshService.cpp:233 meshtastic_QueueStatus *copied = queueStatusPool.allocCopy(qs); copied->res = res;
src/mesh/MeshService.cpp:271 auto a = packetPool.allocCopy(*p); sendToPhone(a);
src/mesh/NextHopRouter.cpp:35 startRetransmission(packetPool.allocCopy(*p));
src/mesh/NextHopRouter.cpp:151 tosend = packetPool.allocCopy(*p); tosend->hop_limit = 0;
src/mesh/NextHopRouter.cpp:329, 331, 336 three retransmission allocCopy sites passed straight to *::send(...)
src/mesh/ReliableRouter.cpp:21 auto copy = packetPool.allocCopy(*p); startRetransmission(copy, NUM_RELIABLE_RETX);
src/mesh/Router.cpp:376 p_decoded = packetPool.allocCopy(*p); ... mqtt->onSend(*p, *p_decoded, chIndex);
src/modules/NodeInfoModule.cpp:63 packetCopy = packetPool.allocCopy(mp); packetCopy->decoded.payload.size = ...
src/modules/AtakPluginModule.cpp:199 decompressedCopy = packetPool.allocCopy(mp); decompressedCopy->decoded.payload.size = ...

Affected hardware

The bug fires when packetPool is exhausted — i.e., when the heap-backed allocator can't fulfill a MeshPacket allocation (~400 B with metadata). Pool size is bounded by MAX_PACKETS = MAX_RX_TOPHONE(32) + MAX_RX_FROMRADIO(4) + 2*MAX_TX_QUEUE(16) + 2 = 70 slots, but the practical exhaustion trigger is free heap dropping below the per-allocation cost under burst load, not slot count.

Boards most exposed are those without PSRAM and with tight idle free heap:

  • Heltec V1 / V2 / V3 LoRa-32 — ESP32 / ESP32-S3 with no PSRAM, ~70-80 KB free heap idle
  • T-Beam classic — ESP32 with ~80 KB free idle
  • T-Lora v2 1.6 — ESP32 with ~90 KB free idle
  • RAK4631 — nRF52840 with 256 KB total RAM, ~60 KB free idle

PSRAM boards (Heltec V4 TFT, Tracker, Station G2, T-Beam-S3 family, T-Watch-S3) are harder to exhaust because the allocator can back onto 8 MB PSRAM, but the same crash mechanism applies if pool exhaustion ever happens.

Backtraces this addresses

Two existing field reports have decoded backtraces that land directly on the call sites this PR is guarding:

  • [Bug]: Crash in JSON MQTT bridge #1355Crash in JSON MQTT bridge. Guru Meditation Error: LoadProhibited, EXCVADDR=0x00000004 (textbook null-deref of a struct field at offset 4). Decoded backtrace frame 0 is MQTT::downstreamPacketToJson, called via MQTT::onSend from Router::send — the exact path guarded by the Router.cpp:376 fix here (if (p_decoded && ...)). Filed by @caveman99.

  • [BUG] Crash/reboot loop entering 'Settings/Device' using Android app after configuring as REPEATER #4410Crash/reboot loop entering 'Settings/Device' using Android app after configuring as REPEATER. @GUVWAF decoded the backtrace in-thread; frame 0 is MeshService::sendToPhone, which is the call site of packetPool.allocCopy(*mp) at MeshService.cpp:113. GUVWAF noted "not sure why it's crashing" — the unguarded null deref on pool exhaustion is a plausible cause, though not definitively proven from the trace alone.

I'm not claiming this PR "fixes" either issue (both are closed); I'm claiming the call graphs in their decoded backtraces match the call sites being guarded here. If pool exhaustion is the trigger, the symptom shifts from a panic to a logged drop after this change.

Reproduction

Used a deterministic null-injection harness on Heltec V3 (ESP32-S3, no PSRAM, the same hardware tier as the cited reports): a 1-line edit in MemoryPool.h::allocCopy returning nullptr every 5th call. Drove sendText load via meshtastic-python SerialInterface for 90 s on two devices flashed identically except for these patches.

Build rebootCount before / after Verdict
Baseline (origin/develop + injection) 4 → 26 +22 panics in 90 s
This patch (develop + 11 sites + injection) 3 → 4 +1 (post-test reopen artifact, identical on baseline)

Same hardware, same workload, same injection rate, same channel. The only difference is the 11 null-checks.

Why the diff is small

Mirrors the merge pattern of #10262 (13 allocZeroed null-checks in one focused PR). Same shape, adjacent class. Sites that already had null-checks are left alone — Router.cpp:744 (legacy guard from #9136), Router.cpp:813 (existing if (p_encrypted_new)), and the Telemetry consumers (if (!lastMeasurementPacket)).

Risk

Minimal. Each fix is an additive if (auto *c = ...) ...; or if (!ptr) { LOG_WARN; early-return; }. The only runtime change vs baseline is "pool exhaustion now logs and drops one packet instead of crashing the radio task." No public-API surface change.

Follow-up

A round-2 follow-through covers helper-level fixes (Router::allocForSending, MeshModule::allocAckNak, SinglePortModule::allocDataPacket, ProtobufModule::allocDataProtobuf) plus 10 module callers (RoutingModule, NeighborInfoModule, PositionModule, StatusMessageModule, the 5 Telemetry modules + HostMetrics). Held for a separate PR to keep this diff focused; the 11 sites here are the load-bearing ones in the cited backtraces.

…ule callers

Adds null-checks at 11 call sites that take a pointer from
packetPool.allocCopy() or queueStatusPool.allocCopy() and immediately
dereference (or pass to a function that derefs) without verifying the
allocation succeeded. Allocator::allocCopy is documented to return
nullptr on pool exhaustion; these call sites assumed "always succeeds".

Sites fixed:
  src/mesh/MeshService.cpp:113   sendToPhone(allocCopy(*mp))
  src/mesh/MeshService.cpp:209   sendToMesh(allocCopy(p))
  src/mesh/MeshService.cpp:233   queueStatusPool.allocCopy(qs); copied->res=res
  src/mesh/MeshService.cpp:271   sendToPhone(allocCopy(*p))
  src/mesh/NextHopRouter.cpp:35  startRetransmission(allocCopy(*p))
  src/mesh/NextHopRouter.cpp:151 tosend = allocCopy(*p); tosend->hop_limit=0
  src/mesh/NextHopRouter.cpp:329,331,336  three retx allocCopy sites
  src/mesh/ReliableRouter.cpp:21 want_ack copy = allocCopy(*p)
  src/mesh/Router.cpp:376        p_decoded = allocCopy(*p); MQTT::onSend(*p_decoded)
  src/modules/NodeInfoModule.cpp:63       packetCopy = allocCopy(mp)
  src/modules/AtakPluginModule.cpp:199    decompressedCopy = allocCopy(mp)

Sites with existing guards (left alone):
  Router.cpp:744  legacy null-check from meshtastic#9136
  Router.cpp:813  existing if (p_encrypted_new) check
  Telemetry/*.cpp lastMeasurementPacket consumers all check

Reproduction (deterministic null-injection on Heltec V3, static-pool board,
~50 MeshPacket slots, no PSRAM):

  Build A (origin/develop + 1-in-5 allocCopy nullptr inject)
    rebootCount: 4 -> 26  (+22 panics in 90 s under sendText load)
  Build B (this patch    + 1-in-5 allocCopy nullptr inject)
    rebootCount: 3 -> 4   (+1 close+reopen probe artifact only;
                            zero hammer-induced crashes)

Same hardware, identical workload, same injection rate. The only delta
is these 11 null-checks.

Field reports whose backtraces / call graphs land in these sites:

  meshtastic#4410 (T-Beam reboot loop after REPEATER config) - GUVWAF decoded the
        backtrace in-thread; panic literally inside MeshService::sendToPhone
        (the call site of packetPool.allocCopy(*mp) at MeshService.cpp:113)
  meshtastic#1355 (T-Lora v2 1.6 MQTT bridge crash) - filed by caveman99;
        Guru Meditation LoadProhibited with EXCVADDR=0x00000004 (textbook
        null-deref of struct field) inside Router::send -> MQTT::onSend
  meshtastic#3725 (Heltec V3 reboot loop with ATAK plugin) - thebentern landed
        partial fix meshtastic#3922 noting "there's still some underlying issue";
        AtakPluginModule.cpp:199 path
  meshtastic#2264 (heap-too-low reboot loop with NodeInfo + sendToPhone storm)
  meshtastic#6772 (T-Lora v2 1.6 abort at operator new during PhoneAPI::handleToRadio)
  meshtastic#8066 - fifieldt: "In 2.7.9 we fixed a range of heap exhaustion issues
         that could be the cause" (Heltec V3 reboot acknowledgment)

This patch does not claim to fix those reports - it null-checks the
specific call sites whose deref-on-null is in their backtraces.

Mirrors the merge pattern of meshtastic#10262 (13 allocZeroed null-checks in one
focused PR). Same shape, adjacent class.
@github-actions github-actions Bot added the bugfix Pull request that fixes bugs label Apr 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds defensive null-checks at multiple call sites that use packetPool.allocCopy() / queueStatusPool.allocCopy() results immediately, preventing null-dereference crashes when pools are exhausted (especially on low-heap / no-PSRAM boards).

Changes:

  • Added allocation-result checks and safe early-exit behavior across router/service/module paths that previously assumed allocCopy() always succeeds.
  • Guarded MQTT publish in Router::send() against p_decoded == nullptr to avoid crashes under pool exhaustion.
  • Added warnings / comments to clarify dropped/skipped behavior on allocation failure.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/mesh/MeshService.cpp Null-check packetPool.allocCopy() / queueStatusPool.allocCopy() before forwarding to phone / mesh and sending queue status.
src/mesh/NextHopRouter.cpp Null-check allocCopy() before starting retransmission and before allocating copies for retransmission sends.
src/mesh/ReliableRouter.cpp Avoid starting retransmission if allocCopy() fails (still sends original packet).
src/mesh/Router.cpp Skip MQTT onSend() when decoded-copy allocation fails; keep release behavior safe for nullptr.
src/modules/NodeInfoModule.cpp Null-check packet copy before re-encoding payload and forwarding to phone.
src/modules/AtakPluginModule.cpp Null-check decompressed packet copy before re-encoding and forwarding to phone.
Comments suppressed due to low confidence (1)

src/mesh/NextHopRouter.cpp:341

  • doRetransmissions() now conditionally sends only if allocCopy() succeeds, but the retransmission counter is decremented unconditionally later in the block. If allocCopy() fails due to pool exhaustion, this will consume a retry without actually sending anything and can cause premature NAK/failure. Consider only decrementing tries when a retransmission packet was successfully allocated/queued (but still rescheduling nextTxMsec to avoid a tight loop).
                        if (auto *c = packetPool.allocCopy(*p.packet)) FloodingRouter::send(c);
                    } else {
                        if (auto *c = packetPool.allocCopy(*p.packet)) NextHopRouter::send(c);
                    }
                } else {
                    // Note: we call the superclass version because we don't want to have our version of send() add a new
                    // retransmission record
                    if (auto *c = packetPool.allocCopy(*p.packet)) FloodingRouter::send(c);
                }

                // Queue again
                --p.numRetransmissions;
                setNextTx(&p);

Comment thread src/modules/AtakPluginModule.cpp Outdated
return;
}
decompressedCopy->decoded.payload.size =
pb_encode_to_bytes(decompressedCopy->decoded.payload.bytes, sizeof(decompressedCopy->decoded.payload),
Comment thread src/mesh/MeshService.cpp Outdated
if (auto *c = packetPool.allocCopy(*mp))
sendToPhone(c);
else
LOG_WARN("phone-forward dropped: packetPool exhausted");
Comment thread src/mesh/MeshService.cpp Outdated
Comment on lines +280 to +281
// else: pool exhausted; phone will see this packet again via the
// normal RX-to-phone path on a future cycle.
Comment thread src/modules/NodeInfoModule.cpp Outdated
Comment on lines +70 to +71
} else {
LOG_WARN("NodeInfoModule: packetPool exhausted, skipping phone forward");
Comment thread src/modules/AtakPluginModule.cpp Outdated
}
auto decompressedCopy = packetPool.allocCopy(mp);
if (!decompressedCopy) {
LOG_WARN("AtakPluginModule: packetPool exhausted, skipping decompressed phone forward");
Comment thread src/mesh/MeshService.cpp Outdated
{
meshtastic_QueueStatus *copied = queueStatusPool.allocCopy(qs);
if (!copied) {
LOG_WARN("sendQueueStatusToPhone: queueStatusPool exhausted, skipping");
AtakPluginModule.cpp:205 — pb_encode_to_bytes was passed sizeof(payload)
instead of sizeof(payload.bytes). PB_BYTES_ARRAY_T is a struct with a size
field plus the bytes array, so the previous form overstated the buffer by
the size-field width and could let pb_encode_to_bytes write past
payload.bytes into adjacent fields under a worst-case TAKPacket re-encode.
Use sizeof(...payload.bytes) to match the canonical form already used at
NodeInfoModule.cpp:67.

Caller-side LOG_WARN on packetPool/queueStatusPool exhaustion was producing
3 lines per failure (caller WARN + Allocator::allocCopy WARN at
MemoryPool.h:45 + static-pool alloc WARN at MemoryPool.h:158). Downgrade
the caller-side message to LOG_DEBUG so the contextual breadcrumb is
preserved for debug builds without spamming production logs under burst
load. Affects:
  - MeshService.cpp:116 (handleFromRadio phone-forward)
  - MeshService.cpp:235 (sendQueueStatusToPhone)
  - AtakPluginModule.cpp:201 (decompressed TAK phone-forward)
  - NodeInfoModule.cpp:71 (NodeInfo phone-forward)

MeshService.cpp:281 — reword the cc-to-phone exhaustion comment. Previous
wording implied the phone would see the packet again via the normal
RX-to-phone path; that is not guaranteed for non-broadcast or non-local
sends. New comment is accurate about what the drop actually means.

Resolves the Copilot review on commit 0984d68.
@nightjoker7
Copy link
Copy Markdown
Contributor Author

Pushed 67bc2665d addressing the Copilot review on 0984d68:

  • AtakPluginModule.cpp:205 — fixed sizeof(payload)sizeof(payload.bytes). PB_BYTES_ARRAY_T is {size + bytes[N]}, so the previous form overstated the buffer passed to pb_encode_to_bytes by the size-field width and could let it write past payload.bytes into adjacent fields under a worst-case TAKPacket re-encode. Now matches the canonical form already used at NodeInfoModule.cpp:67. Nice catch by Copilot.
  • 4× caller-side LOG_WARNLOG_DEBUG (MeshService.cpp:116, MeshService.cpp:235, AtakPluginModule.cpp:201, NodeInfoModule.cpp:71). The allocator and the underlying static pool already log WARN on failure (MemoryPool.h:24,45,158), so each pool exhaustion was producing 3 lines. Caller-side breadcrumb is preserved for debug builds.
  • MeshService.cpp:281 — reworded the cc-to-phone exhaustion comment. Previous wording implied future re-delivery via the RX path, which isn't guaranteed for non-broadcast / non-local sends.

Verified locally: pio test -e coverage -f test_atak -f test_default -f test_meshpacket_serializer → 41/41 pass. Diff is +9/-7 across 3 files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Pull request that fixes bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants