[featured] fix stale FEATURE entry after concurrent package uninstall#13
Open
DavidZagury wants to merge 2 commits intomaster_RCfrom
Open
[featured] fix stale FEATURE entry after concurrent package uninstall#13DavidZagury wants to merge 2 commits intomaster_RCfrom
DavidZagury wants to merge 2 commits intomaster_RCfrom
Conversation
When sonic-package-manager uninstalls a package it deletes the FEATURE
entry from CONFIG_DB. Because featured unblocks spm (via STATE_DB) before
finishing its own handler, the entry can be gone by the time featured tries
to write back to it. Two code paths independently recreate the deleted
entry, leaving a ghost record that survives the uninstall and appears in
"show feature status":
1. sync_feature_scope(): called after a successful disable_feature(), it
calls mod_entry({has_per_asic_scope, has_global_scope}) on the already-
deleted key. Redis HSET recreates the hash with only those two fields
(no 'state'), so the next notification arrives with state=None and
featured logs "ERR: Unexpected state value 'None'".
There is also a TOCTOU gap: spm can delete the entry between the
mod_entry calls and the post-write read, producing the same partial
entry.
2. resync_feature_state(): called when update_feature_state() fails (e.g.
systemctl start fails because the service file was already removed by
the concurrent uninstall). get_entry() returns {} for a deleted key,
_feature_state_is_template(None) evaluates True, and mod_entry writes
{state: disabled} back - recreating the entry with no other fields.
The same function also completes partial entries left by (1): a non-
empty entry with no 'state' key passes the deleted-entry guard and
receives a state field, producing a 3-field ghost record.
Fix all paths:
- sync_feature_scope: after the two mod_entry writes, re-read the entry.
If it exists but has no 'state' key, the write raced with deregister;
delete the partial entry immediately.
- resync_feature_state: return early if the entry is fully absent. If
the entry exists but has no 'state' key, delete it rather than
completing the ghost record.
- update_feature_state: suppress the ERR log when state is None, since
that is a normal consequence of the race rather than a programming error.
Signed-off-by: david.zagury <davidza@nvidia.com>
593ffd1 to
7188b88
Compare
vivekrnv
approved these changes
May 5, 2026
bf7e5d4 to
a4c0934
Compare
test_feature_resync: the sub-case where get_entry returns None previously asserted that mod_entry would be called with the rendered state. That behaviour was the root of the bug -- resync_feature_state was recreating a FEATURE row that had been deleted by a concurrent deregister. Update the assertion to assert_not_called(), matching the new guard that returns early when the entry is absent. Add a further sub-case for a partial entry (row exists but has no 'state' key, the TOCTOU artifact left by sync_feature_scope racing with deregister): resync_feature_state must call set_entry(None) to remove the ghost record rather than completing it with mod_entry. test_sync_feature_scope_toctou: new test for the post-write TOCTOU guard in sync_feature_scope. Uses side_effect to return a live entry on the pre-write check and a partial entry (no 'state' key) on the post-write check, simulating the window where the FEATURE row is deleted and recreated by mod_entry during the race. sync_feature_scope must call set_entry(None) to clean it up. Signed-off-by: david.zagury <davidza@nvidia.com>
a4c0934 to
dd9a46e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why I did it
After uninstalling a SONiC application package, the feature could reappear in show feature status with an incomplete entry (state: disabled, missing auto_restart, set_owner, etc.), even though the package was fully removed. This was caused by a race between featured and sonic-package-manager during uninstall.
sonic-package-manager unblocks as soon as featured writes disabled to STATE_DB, then immediately runs deregister() which deletes the FEATURE entry from CONFIG_DB. At that point featured is still inside its handler and has not finished writing back to CONFIG_DB. Two code paths independently recreate the deleted entry with partial data:
How I did it
Three guards added to featured:
Additionally, the ERR log for state=None in update_feature_state is suppressed since state=None means the FEATURE entry was deleted by a concurrent uninstall while the notification was queued — not a programming error. Any other unexpected non-None state still logs ERR.