Skip to content

Restore OTA CRC integrity check and add hard reset after BLE DFU#21

Closed
4np wants to merge 3 commits into
oltaco:masterfrom
4np:fix-ota-crc-discard-and-no-reset-after-ble-dfu
Closed

Restore OTA CRC integrity check and add hard reset after BLE DFU#21
4np wants to merge 3 commits into
oltaco:masterfrom
4np:fix-ota-crc-discard-and-no-reset-after-ble-dfu

Conversation

@4np
Copy link
Copy Markdown

@4np 4np commented Mar 22, 2026

BLE OTA firmware update succeeds but application never boots

This Pull Requests addresses #15 by fixing two bugs in the BLE OTA DFU path that together cause the application to never start after a successful over-the-air firmware update. The DFU upload completes without error, but the device silently fails to boot the new firmware and does not come online.

Bug 1 — OTA CRC is discarded, permanently disabling the boot-time integrity check.
dfu_init_postvalidate() computed and validated the image CRC but never returned it to the caller. m_image_crc remained 0 after every OTA, causing bank_0_crc = 0 to be written to bootloader settings. Because bootloader_app_is_valid() skips the CRC check when bank_0_crc is zero, any image — including a corrupted one — was unconditionally accepted at boot.

Bug 2 — No hard reset after BLE OTA causes silent application initialisation failure.
After activating the new image, the bootloader jumped directly to the application without NVIC_SystemReset(). BLE applications depend on a clean hardware reset to initialise their radio, SoftDevice, and peripheral stack. The dirty post-DFU hardware state caused silent initialisation failure. This is the direct cause of the application never coming online.

Impact: Any board using BLE OTA DFU is affected. Verified on RAK4631 (nRF52840, S140 6.1.1)
with MeshCore 1.14.1. The serial/USB DFU path is unaffected.


Summary

After a successful BLE OTA firmware update via the iOS nRF DFU app, the newly flashed application
never starts. The device silently fails to boot and does not come online. This PR fixes two distinct
bugs that together cause this behaviour:

  1. m_image_crc is never set — the post-OTA CRC computed during image validation is
    discarded, so bank_0_crc = 0 is always written to bootloader settings. This permanently
    disables the boot-time integrity check, meaning a corrupted image would still be booted.
  2. No hard reset after BLE OTA — after activating the new image, the bootloader jumps
    directly to the application without issuing NVIC_SystemReset(). This leaves nRF52840
    peripheral registers, radio state, and NVIC configuration in a post-DFU dirty state that
    BLE applications such as MeshCore cannot recover from.

Both issues must be fixed together. Fix 1 ensures image integrity is enforced after every OTA. Fix 2 is the direct cause of the application failing to start.


Reproduction

  • Hardware: RAK4631 (nRF52840), bootloader 0.9.2-OTAFIX2.1-BP1.2, SoftDevice S140 6.1.1
  • Firmware: MeshCore 1.14.1 for RAK4631 (built against S140 6.1.1 — SD version mismatch ruled out as a cause)
  • DFU client: iOS nRF Device Firmware Update app (sends 20-byte BLE packets)
  • Symptom: iOS app reports the upload as successful. The device disconnects (expected). The application never starts; the device appears dead or re-advertises in DFU mode.

Root Cause Analysis

Bug 1 — m_image_crc is never assigned

dfu_init_postvalidate() in src/dfu_init.c correctly computes a CRC-16 over the received image in flash and compares it against the CRC supplied in the DFU init packet. If they match it returns NRF_SUCCESS. However, the computed CRC value was never communicated back to the caller.

The original call site in dfu_single_bank.c:

// Before fix — computed CRC is silently discarded
err_code = dfu_init_postvalidate((uint8_t *)mp_storage_handle_active->block_id, m_image_size);

The static variable m_image_crc (initialised to 0 in dfu_init()) was therefore never
updated:

// dfu_activate_app() — always writes 0
update_status.app_crc = m_image_crc;   // m_image_crc == 0 every time

This zero propagates into bootloader_settings_t.bank_0_crc, which is persisted to flash at BOOTLOADER_SETTINGS_ADDRESS (0x000FF000 on nRF52840).

At the next boot, bootloader_app_is_valid() reads bank_0_crc:

// bootloader.c — boot-time integrity check
if (p_bootloader_settings->bank_0_crc != 0) {
    image_crc = crc16_compute(...);
    success = (image_crc == p_bootloader_settings->bank_0_crc);
}
// When bank_0_crc == 0: image_crc stays 0, condition is vacuously true — check is skipped

Because bank_0_crc is always 0, the CRC check is unconditionally bypassed and any image -including a corrupted one- will be considered valid. This is a silent integrity regression introduced whenever a BLE OTA is performed.


Bug 2 — No hard reset after BLE OTA (root cause of boot failure)

After dfu_image_activate() completes and bootloader settings are saved, execution in check_dfu_mode() (src/main.c) resumes:

// src/main.c — check_dfu_mode(), after bootloader_dfu_start() returns
if (_ota_dfu) {
    sd_softdevice_disable();
    usb_teardown();   // ← returns here, falls through to main()
}

Control then returns to main(), which calls bootloader_app_start():

// bootloader.c — bootloader_app_start()
fwd_ret = sd_softdevice_vector_table_base_set(app_addr);   // FAILS: SD is already disabled

if (fwd_ret != NRF_SUCCESS) {
    // Fallback: write forwarding address to first word of SRAM
    *(uint32_t *)(0x20000000) = app_addr;   // may be overwritten by app .bss init
}

bootloader_util_app_start(app_addr);   // direct jump, no reset

The direct jump is performed with the following hardware state:

Resource State after BLE OTA
nRF52840 peripheral registers Post-DFU, not reset
NVIC (cleared by bootloader_app_start) Cleared, but SD had configured it for DFU
BLE radio Disabled by sd_softdevice_disable(), but register state not guaranteed clean
SoftDevice vector table forward Unreliable: sd_softdevice_vector_table_base_set() fails when SD is disabled; SRAM fallback at 0x20000000 may be zeroed by the application's .bss initialisation before the SD is re-enabled
RAM Bootloader stacks and globals still present

Applications such as MeshCore initialise the SoftDevice, BLE stack, LoRa (SX1262 via SPI), and their mesh networking layer from scratch on startup. They depend on the hardware reset vector sequence to put all peripherals into a known state. A direct jump from a post-DFU bootloader provides no such guarantee and the application's initialisation silently fails.

Critically, NVIC_SystemReset() was the correct post-OTA action. GPREGRET is already cleared to 0 earlier in check_dfu_mode() (line 245), so a reset will produce a clean boot: no magic value in GPREGRETcheck_dfu_mode() does not enter DFU → bootloader_app_is_valid() returns truebootloader_app_start() is called with a fully reset hardware state.

The direct-jump behaviour was introduced in OTAFIX 2.1 as "auto-boot after OTA" to avoid the BLE OTA timeout when USB is connected. However, it was applied unconditionally to the BLE OTA path, not only to the USB/UF2 path it was intended for. A NVIC_SystemReset() after BLE OTA costs approximately one additional boot cycle (~100–200 ms) and is the only reliable way to guarantee a clean hardware state for the incoming application.


Changes

src/dfu_init.c and lib/sdk11/components/libraries/bootloader_dfu/dfu_init.h

Added an output parameter uint16_t * p_crc_out to dfu_init_postvalidate(). When post-validation succeeds, the validated CRC is written through this pointer before returning NRF_SUCCESS.

// After fix
uint32_t dfu_init_postvalidate(uint8_t * p_image, uint32_t image_len, uint16_t * p_crc_out)
{
    ...
    if (image_crc != received_crc) {
        return NRF_ERROR_INVALID_DATA;
    }
    *p_crc_out = image_crc;   // ← new: caller receives the validated CRC
    return NRF_SUCCESS;
}

lib/sdk11/components/libraries/bootloader_dfu/dfu_single_bank.c

Updated the call to dfu_init_postvalidate() to pass &m_image_crc, so the validated CRC is captured and subsequently stored in bootloader_settings_t.bank_0_crc.

// After fix
err_code = dfu_init_postvalidate(
    (uint8_t *)mp_storage_handle_active->block_id,
    m_image_size,
    &m_image_crc);   // ← m_image_crc now holds the real CRC after validation

After this change, bootloader_app_is_valid() will compute a CRC over the application in flash on every boot and compare it against the stored value. A corrupted or incomplete OTA will be caught and the bootloader will re-enter DFU mode rather than jumping to a broken image.

src/main.c

Added NVIC_SystemReset() in check_dfu_mode() immediately after BLE OTA teardown. This replaces the direct-jump path with a clean hardware reset, giving the application a fully initialised register and peripheral state on first boot.

// After fix
if (_ota_dfu) {
    sd_softdevice_disable();
    usb_teardown();
    NVIC_SystemReset();   // ← clean reset; GPREGRET is already 0, boots straight to app
}

Why both fixes are needed

Without Fix 1 With Fix 1 only With Fix 2 only With both fixes
bank_0_crc after OTA Always 0 Real CRC Always 0 Real CRC
Boot-time CRC check Always skipped Active Always skipped Active
Hardware state on first app boot Dirty Dirty Clean Clean
App starts reliably No No Yes Yes
Corrupted OTA caught at boot No Yes No Yes

Fix 2 alone would make the application start, but with the integrity check still bypassed.
Fix 1 alone restores the integrity check but does not fix the boot failure.
Both fixes together restore correct, safe, and reliable OTA behaviour.


Testing

  • BLE OTA of MeshCore 1.14.1 on RAK4631 via iOS nRF DFU app: application starts successfully after OTA completes.
  • Serial DFU path (adafruit-nrfutil dfu serial) unaffected; NVIC_SystemReset() is only added to the _ota_dfu branch.
  • A second OTA immediately after the first completes and boots correctly (GPREGRET state
    verified clean between cycles).
  • Deliberate corruption test: truncated firmware image fails post-validation CRC check during OTA (NRF_ERROR_INVALID_DATA), DFU is rejected before activation — image in flash is unchanged.

Note: This investigation and the resulting code changes were performed with the assistance of Claude Code.

4np added 2 commits March 22, 2026 15:21
  Two bugs caused BLE OTA to silently succeed while the application never
  booted:

  1. dfu_init_postvalidate() computed and validated the image CRC but
     discarded it without writing it back to the caller. m_image_crc
     remained 0 after every OTA, so bank_0_crc = 0 was persisted to
     bootloader settings. bootloader_app_is_valid() skips the CRC check
     when bank_0_crc is 0, meaning any image — including a corrupted one —
     was unconditionally accepted at boot.

     Fix: add uint16_t *p_crc_out to dfu_init_postvalidate() and write the
     validated CRC through it. Update the call site in dfu_single_bank.c to
     pass &m_image_crc so the value is captured and stored in
     bootloader_settings_t.bank_0_crc.

  2. After BLE OTA activation, check_dfu_mode() tore down the SoftDevice
     and USB and then returned to main(), which jumped directly to the
     application without issuing NVIC_SystemReset(). This left nRF52840
     peripheral registers and radio state in a post-DFU condition. BLE
     applications (e.g. MeshCore on RAK4631) depend on a clean hardware
     reset to initialise their radio, SoftDevice, and peripheral stack.
     The direct jump caused silent initialisation failure and the device
     never came online. Additionally, sd_softdevice_vector_table_base_set()
     fails when the SD is already disabled, falling back to writing the
     forwarding address to 0x20000000, which the application's .bss
     initialisation can overwrite before the SD is re-enabled.

     Fix: add NVIC_SystemReset() after BLE OTA teardown. GPREGRET is
     already cleared to 0 earlier in check_dfu_mode(), so the subsequent
     boot goes straight to the application with a fully reset hardware
     state. The serial/USB DFU path is unaffected.
  Two bugs caused BLE OTA to silently succeed while the application never
  booted:

  1. dfu_init_postvalidate() computed and validated the image CRC but
     discarded it without writing it back to the caller. m_image_crc
     remained 0 after every OTA, so bank_0_crc = 0 was persisted to
     bootloader settings. bootloader_app_is_valid() skips the CRC check
     when bank_0_crc is 0, meaning any image — including a corrupted one —
     was unconditionally accepted at boot.

     Fix: add uint16_t *p_crc_out to dfu_init_postvalidate() and write the
     validated CRC through it. Update the call site in dfu_single_bank.c to
     pass &m_image_crc so the value is captured and stored in
     bootloader_settings_t.bank_0_crc.

  2. After BLE OTA activation, check_dfu_mode() tore down the SoftDevice
     and USB and then returned to main(), which jumped directly to the
     application without issuing NVIC_SystemReset(). This left nRF52840
     peripheral registers and radio state in a post-DFU condition. BLE
     applications (e.g. MeshCore on RAK4631) depend on a clean hardware
     reset to initialise their radio, SoftDevice, and peripheral stack.
     The direct jump caused silent initialisation failure and the device
     never came online. Additionally, sd_softdevice_vector_table_base_set()
     fails when the SD is already disabled, falling back to writing the
     forwarding address to 0x20000000, which the application's .bss
     initialisation can overwrite before the SD is re-enabled.

     Fix: add NVIC_SystemReset() after BLE OTA teardown. GPREGRET is
     already cleared to 0 earlier in check_dfu_mode(), so the subsequent
     boot goes straight to the application with a fully reset hardware
     state. The serial/USB DFU path is unaffected.
@4np 4np force-pushed the fix-ota-crc-discard-and-no-reset-after-ble-dfu branch from efce1e7 to 1256ae2 Compare March 22, 2026 14:38
@4np
Copy link
Copy Markdown
Author

4np commented Mar 22, 2026

Closing in favor of #22 (some commits got borked).

@4np 4np closed this Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant