Fix: Null mContext crash when commissioning after background/foreground#43127
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a critical crash that occurs during commissioning after an app background/foreground cycle. The root cause was correctly identified as a null pointer dereference on mContext due to clusters failing to re-register.
The changes introduced are robust and well-thought-out:
- The
ServerClusterInterfaceRegistry::RegisterandLazyRegisteredServerCluster::Createmethods are now idempotent, which correctly handles re-registration without causingCHIP_ERROR_DUPLICATE_KEY_IDerrors. - The
DefaultServerCluster::Startupmethod is also made idempotent, allowing clusters to be re-initialized without being fully destroyed, preserving important state likemDataVersion. - The shutdown logic in
CodeDrivenDataModelProvider::Shutdownhas been correctly modified to only clear cluster state (mContext) without unregistering the clusters, which is essential for the warm-start lifecycle. - The changes in the tv-casting app example demonstrate a more robust start/stop sequence for the application lifecycle.
- A new unit test,
WarmStartAfterShutdown, has been added to specifically validate the fix for the background/foreground scenario, and existing tests have been updated to reflect the new idempotent behavior. - Defensive null checks have been added in
GeneralCommissioningClusterto prevent similar crashes in the future.
Overall, this is an excellent fix that addresses the issue at its core and improves the robustness of the application lifecycle management. The changes are consistent, well-commented, and well-tested.
f3ff9d5 to
8bdfa18
Compare
|
PR #43127: Size comparison from 07f6efb to 8bdfa18 Full report (34 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nxp, psoc6, qpg, realtek, stm32, telink)
|
8bdfa18 to
4437009
Compare
|
PR #43127: Size comparison from 58568d5 to 4437009 Full report (21 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, nxp, psoc6, realtek, stm32)
|
4437009 to
5f6da83
Compare
|
PR #43127: Size comparison from 58568d5 to 5f6da83 Full report (22 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, psoc6, qpg, realtek, stm32)
|
5f6da83 to
3fc953f
Compare
|
PR #43127: Size comparison from 58568d5 to 3fc953f Increases above 0.2%:
Full report (35 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, nxp, psoc6, qpg, realtek, stm32, telink)
|
3fc953f to
1bbe14f
Compare
|
PR #43127: Size comparison from 7636760 to 1bbe14f Increases above 0.2%:
Full report (35 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, nxp, psoc6, qpg, realtek, stm32, telink)
|
|
PR #43127: Size comparison from 7636760 to 22a0a5a Increases above 0.2%:
Full report (31 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, nrfconnect, nxp, psoc6, qpg, realtek, stm32, telink)
|
|
PR #43127: Size comparison from 7636760 to 133879d Increases above 0.2%:
Full report (35 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, nxp, psoc6, qpg, realtek, stm32, telink)
|
|
PR #43127: Size comparison from 7636760 to c912e28 Increases above 0.2%:
Full report (27 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, nxp, psoc6, qpg, realtek, stm32)
|
|
PR #43127: Size comparison from 7636760 to d9699fc Increases above 0.2%:
Full report (27 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, nxp, psoc6, qpg, realtek, stm32)
|
|
PR #43127: Size comparison from 7636760 to 5acf3f6 Increases above 0.2%:
Full report (34 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, nxp, psoc6, qpg, realtek, stm32, telink)
|
After backgrounding/foregrounding, commissioning crashes with null mContext. Root cause: - Background: ClearContext() sets mContext = null - Foreground: Register() fails with CHIP_ERROR_DUPLICATE_KEY_ID - SetContext() blocked, mContext stays null - Commission attempt accesses null mContext → crash Solution: - Make Register() and Shutdown() idempotent for app lifecycle - Unregister clusters in Shutdown() to clean state - SetContext() now succeeds, restoring mContext - Add 2-second delay in test framework after SIGTERM for socket cleanup Testing: Validated on iOS examples/tv-casting-app app and all unit tests pass. Test Fix: TC_TLSCERT_2_12 was failing because idempotent shutdown makes cleanup so fast that kernel doesn't finish releasing TCP sockets before next test starts. Added brief delay to allow proper socket cleanup. Applies to all platforms with app lifecycle (iOS, Android, tvOS, etc.) Apply restyle formatting (whitespace & clang-format)
5acf3f6 to
b8423a7
Compare
|
PR #43127: Size comparison from be18b91 to b8423a7 Increases above 0.2%:
Full report (35 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, esp32, nrfconnect, nxp, psoc6, qpg, realtek, stm32, telink)
|
andy31415
left a comment
There was a problem hiding this comment.
Consensur review checkmark: authored and reviewed by media maintainers.
…nd (project-chip#43127) After backgrounding/foregrounding, commissioning crashes with null mContext. Root cause: - Background: ClearContext() sets mContext = null - Foreground: Register() fails with CHIP_ERROR_DUPLICATE_KEY_ID - SetContext() blocked, mContext stays null - Commission attempt accesses null mContext → crash Solution: - Make Register() and Shutdown() idempotent for app lifecycle - Unregister clusters in Shutdown() to clean state - SetContext() now succeeds, restoring mContext - Add 2-second delay in test framework after SIGTERM for socket cleanup Testing: Validated on iOS examples/tv-casting-app app and all unit tests pass. Test Fix: TC_TLSCERT_2_12 was failing because idempotent shutdown makes cleanup so fast that kernel doesn't finish releasing TCP sockets before next test starts. Added brief delay to allow proper socket cleanup. Applies to all platforms with app lifecycle (iOS, Android, tvOS, etc.) Apply restyle formatting (whitespace & clang-format) (cherry picked from commit ead8174)
…nd (project-chip#43127) After backgrounding/foregrounding, commissioning crashes with null mContext. Root cause: - Background: ClearContext() sets mContext = null - Foreground: Register() fails with CHIP_ERROR_DUPLICATE_KEY_ID - SetContext() blocked, mContext stays null - Commission attempt accesses null mContext → crash Solution: - Make Register() and Shutdown() idempotent for app lifecycle - Unregister clusters in Shutdown() to clean state - SetContext() now succeeds, restoring mContext - Add 2-second delay in test framework after SIGTERM for socket cleanup Testing: Validated on iOS examples/tv-casting-app app and all unit tests pass. Test Fix: TC_TLSCERT_2_12 was failing because idempotent shutdown makes cleanup so fast that kernel doesn't finish releasing TCP sockets before next test starts. Added brief delay to allow proper socket cleanup. Applies to all platforms with app lifecycle (iOS, Android, tvOS, etc.) Apply restyle formatting (whitespace & clang-format) (cherry picked from commit ead8174)
…nd (project-chip#43127) After backgrounding/foregrounding, commissioning crashes with null mContext. Root cause: - Background: ClearContext() sets mContext = null - Foreground: Register() fails with CHIP_ERROR_DUPLICATE_KEY_ID - SetContext() blocked, mContext stays null - Commission attempt accesses null mContext → crash Solution: - Make Register() and Shutdown() idempotent for app lifecycle - Unregister clusters in Shutdown() to clean state - SetContext() now succeeds, restoring mContext - Add 2-second delay in test framework after SIGTERM for socket cleanup Testing: Validated on iOS examples/tv-casting-app app and all unit tests pass. Test Fix: TC_TLSCERT_2_12 was failing because idempotent shutdown makes cleanup so fast that kernel doesn't finish releasing TCP sockets before next test starts. Added brief delay to allow proper socket cleanup. Applies to all platforms with app lifecycle (iOS, Android, tvOS, etc.) Apply restyle formatting (whitespace & clang-format) (cherry picked from commit ead8174)
…nd (project-chip#43127) After backgrounding/foregrounding, commissioning crashes with null mContext. Root cause: - Background: ClearContext() sets mContext = null - Foreground: Register() fails with CHIP_ERROR_DUPLICATE_KEY_ID - SetContext() blocked, mContext stays null - Commission attempt accesses null mContext → crash Solution: - Make Register() and Shutdown() idempotent for app lifecycle - Unregister clusters in Shutdown() to clean state - SetContext() now succeeds, restoring mContext - Add 2-second delay in test framework after SIGTERM for socket cleanup Testing: Validated on iOS examples/tv-casting-app app and all unit tests pass. Test Fix: TC_TLSCERT_2_12 was failing because idempotent shutdown makes cleanup so fast that kernel doesn't finish releasing TCP sockets before next test starts. Added brief delay to allow proper socket cleanup. Applies to all platforms with app lifecycle (iOS, Android, tvOS, etc.) Apply restyle formatting (whitespace & clang-format) (cherry picked from commit ead8174)
…nd (#43127) (#43299) After backgrounding/foregrounding, commissioning crashes with null mContext. Root cause: - Background: ClearContext() sets mContext = null - Foreground: Register() fails with CHIP_ERROR_DUPLICATE_KEY_ID - SetContext() blocked, mContext stays null - Commission attempt accesses null mContext → crash Solution: - Make Register() and Shutdown() idempotent for app lifecycle - Unregister clusters in Shutdown() to clean state - SetContext() now succeeds, restoring mContext - Add 2-second delay in test framework after SIGTERM for socket cleanup Testing: Validated on iOS examples/tv-casting-app app and all unit tests pass. Test Fix: TC_TLSCERT_2_12 was failing because idempotent shutdown makes cleanup so fast that kernel doesn't finish releasing TCP sockets before next test starts. Added brief delay to allow proper socket cleanup. Applies to all platforms with app lifecycle (iOS, Android, tvOS, etc.) Apply restyle formatting (whitespace & clang-format) (cherry picked from commit ead8174)
| void DefaultServerCluster::Shutdown(ClusterShutdownType) | ||
| { | ||
| mContext = nullptr; | ||
| // Make shutdown idempotent - safe to call multiple times |
There was a problem hiding this comment.
It was safe before though ... you can set mContext to nullptr multiple times. This seems like trying to preserve mContext non-null if mShutdown is called.
This feels off.
| protected: | ||
| const ConcreteClusterPath mPath; | ||
| ServerClusterContext * mContext = nullptr; | ||
| // Tracks if shutdown has been called to make it idempotent |
There was a problem hiding this comment.
This comment does not make things clear.
andy31415
left a comment
There was a problem hiding this comment.
@pgregorr-amazon I thought I originally reviewed changes to casting apps, but now I see changes to clusters and DefaultServerCluster which I am not clear about. it seems to not enforce Startup/Shutdown symmetry which feels off.
| { | ||
| VerifyOrReturnError(mContext == nullptr, CHIP_ERROR_ALREADY_INITIALIZED); | ||
| // Reset shutdown state to allow restart after shutdown | ||
| mIsShutdown = false; |
There was a problem hiding this comment.
Is this not equivalent to using/checking mContext being null or not?
We should probably focus on mDataVersion updates as the differentiating factor.
At the same time, why do we allow startup called while startup was already called? The original API contract was that startup/shutdown are paired and documentation below seems to say the same.
| Span<const ConcreteClusterPath> paths = entry.serverClusterInterface->GetPaths(); | ||
| VerifyOrReturnError(!paths.empty(), CHIP_ERROR_INVALID_ARGUMENT); | ||
|
|
||
| // Check early if this cluster is already registered (idempotent case) |
There was a problem hiding this comment.
I had missed this change in the original review ... the intent for registration was to disallow double-registration.
| // Double-checking for duplicates makes the checks O(n^2) on the total number of registered | ||
| // items. We preserve this however we may want to make this optional at some point in time. | ||
| VerifyOrReturnError(Get(path) == nullptr, CHIP_ERROR_DUPLICATE_KEY_ID); | ||
| // A different cluster is already registered for this path |
There was a problem hiding this comment.
This seems equivalent to the previous code that did VerifyOrReturnError ... could we have made the delta smaller?
| // Handle idempotent Create() - if already constructed, this is a no-op. | ||
| // The cluster remains registered and functional. This supports the Stop() → Start() lifecycle | ||
| // where clusters are shutdown but remain constructed. | ||
| if (IsConstructed()) |
There was a problem hiding this comment.
We do not actually wanted this .... sorry for the review, my original assumption was that we only changed tv casting code.
…foreground (project-chip#43127)" This reverts commit ead8174.
…foreground (project-chip#43127)" (project-chip#43416) This reverts commit ead8174.
Summary
Fix null mContext crash when commissioning after app background/foreground cycle.
After backgrounding and foregrounding (e.g., switching to Photos app, locking device for 30+ seconds), commissioning crashes with null pointer dereference on mContext. Root cause: CHIP_ERROR_DUPLICATE_KEY_ID during cluster re-registration blocks SetContext() from completing, leaving mContext null. When commission attempt accesses mContext, app crashes.
Solution:
Architecture: Aligns registry state with cluster persistence - clusters remain registered across background/foreground cycles, only mContext cycles (cleared on background, restored on foreground).
Impact:
Attached Documentation:
0_Repro_Steps.mdfor detailed reproduction stepsTest Infrastructure Fix:
TC_TLSCERT_2_12 was failing due to socket cleanup timing. The idempotent shutdown optimization makes app termination so efficient that processes exit before the kernel completes TCP socket cleanup. This leaves ports actively bound (not in TIME_WAIT), causing "Address already in use" errors when the next test tries to bind 17 seconds later.
Investigation findings:
Fix: Added 2-second delay after SIGTERM in test framework to allow kernel time to complete socket cleanup. This is test-infrastructure specific and doesn't affect production behavior.
Related issues
Fixes iOS Matter Casting app crash after backgrounding/foregrounding.
Testing
Unit Tests:
Manual Testing:
Log Evidence:
Files Modified:
Readability checklist
The checklist below will help the reviewer finish PR review in time and keep the
code readable:
descriptive
"When in Rome…"
rule (coding style)
See: Pull Request Guidelines
0_Repro_Steps.md
1_iOS logs withOUT fix - Background Lock Unlock Foreground Connect.log.filterline.log
2_iOS logs WITH fix - Background Lock Unlock Foreground Connect.log.filterline.log
3_iOS Prod logs WITH fix - Background Lock Unlock Foreground Connect.log.filterline.log