Skip to content

[featured] fix stale FEATURE entry after concurrent package uninstall#13

Open
DavidZagury wants to merge 2 commits intomaster_RCfrom
masster_RC_stale_appextension
Open

[featured] fix stale FEATURE entry after concurrent package uninstall#13
DavidZagury wants to merge 2 commits intomaster_RCfrom
masster_RC_stale_appextension

Conversation

@DavidZagury
Copy link
Copy Markdown

Why I did it

After uninstalling a SONiC application package, the feature could reappear in show feature status with an incomplete entry (state: disabled, missing auto_restart, set_owner, etc.), even though the package was fully removed. This was caused by a race between featured and sonic-package-manager during uninstall.

sonic-package-manager unblocks as soon as featured writes disabled to STATE_DB, then immediately runs deregister() which deletes the FEATURE entry from CONFIG_DB. At that point featured is still inside its handler and has not finished writing back to CONFIG_DB. Two code paths independently recreate the deleted entry with partial data:

  1. sync_feature_scope — called after a successful disable_feature(), it writes has_per_asic_scope and has_global_scope back to CONFIG_DB. disable_feature() writes disabled to STATE_DB as its last step, which immediately unblocks spm; by the time sync_feature_scope runs, spm may have already deleted the entry. mod_entry on a missing Redis key recreates the hash with only those two fields — no state key. featured then receives a notification for this partial entry, reads state=None, and logs ERR featured: Unexpected state value 'None' for feature .
  2. resync_feature_state — called when update_feature_state() fails (e.g. systemctl start fails because the service file was already removed by the concurrent uninstall). The enable path is triggered by config feature state enabled followed immediately by sonic-package-manager uninstall. Spm unblocks on the earlier STATE_DB disabled written during install processing, removes the service file, and deletes the FEATURE entry — all while featured is still waiting for systemctl start to complete. When start fails, resync_feature_state is called and finds either a fully absent entry or the partial entry left by (1). In both cases current_feature_state is None, _feature_state_is_template(None) evaluates True, and mod_entry({state: disabled}) is called — completing or creating a ghost FEATURE record in CONFIG_DB that persists after the package is gone.

How I did it

Three guards added to featured:

  • sync_feature_scope — post-write check: after both mod_entry calls, re-read the entry. If it exists but has no state key, the write raced with deregister and recreated a deleted key; delete the partial entry immediately.
  • resync_feature_state — deleted-entry check: if get_entry() returns an empty dict, return without touching CONFIG_DB.
  • resync_feature_state — partial-entry check: if the entry is non-empty but has no state key, it is the artifact from sync_feature_scope; delete it rather than completing the ghost record by writing the state field back.

Additionally, the ERR log for state=None in update_feature_state is suppressed since state=None means the FEATURE entry was deleted by a concurrent uninstall while the notification was queued — not a programming error. Any other unexpected non-None state still logs ERR.

When sonic-package-manager uninstalls a package it deletes the FEATURE
entry from CONFIG_DB.  Because featured unblocks spm (via STATE_DB) before
finishing its own handler, the entry can be gone by the time featured tries
to write back to it.  Two code paths independently recreate the deleted
entry, leaving a ghost record that survives the uninstall and appears in
"show feature status":

1. sync_feature_scope(): called after a successful disable_feature(), it
   calls mod_entry({has_per_asic_scope, has_global_scope}) on the already-
   deleted key.  Redis HSET recreates the hash with only those two fields
   (no 'state'), so the next notification arrives with state=None and
   featured logs "ERR: Unexpected state value 'None'".
   There is also a TOCTOU gap: spm can delete the entry between the
   mod_entry calls and the post-write read, producing the same partial
   entry.

2. resync_feature_state(): called when update_feature_state() fails (e.g.
   systemctl start fails because the service file was already removed by
   the concurrent uninstall).  get_entry() returns {} for a deleted key,
   _feature_state_is_template(None) evaluates True, and mod_entry writes
   {state: disabled} back - recreating the entry with no other fields.
   The same function also completes partial entries left by (1): a non-
   empty entry with no 'state' key passes the deleted-entry guard and
   receives a state field, producing a 3-field ghost record.

Fix all paths:

- sync_feature_scope: after the two mod_entry writes, re-read the entry.
  If it exists but has no 'state' key, the write raced with deregister;
  delete the partial entry immediately.

- resync_feature_state: return early if the entry is fully absent.  If
  the entry exists but has no 'state' key, delete it rather than
  completing the ghost record.

- update_feature_state: suppress the ERR log when state is None, since
  that is a normal consequence of the race rather than a programming error.

Signed-off-by: david.zagury <davidza@nvidia.com>
@DavidZagury DavidZagury force-pushed the masster_RC_stale_appextension branch from 593ffd1 to 7188b88 Compare May 5, 2026 14:31
@DavidZagury DavidZagury force-pushed the masster_RC_stale_appextension branch 2 times, most recently from bf7e5d4 to a4c0934 Compare May 6, 2026 11:16
test_feature_resync: the sub-case where get_entry returns None previously
asserted that mod_entry would be called with the rendered state.  That
behaviour was the root of the bug -- resync_feature_state was recreating a
FEATURE row that had been deleted by a concurrent deregister.  Update the
assertion to assert_not_called(), matching the new guard that returns early
when the entry is absent.  Add a further sub-case for a partial entry (row
exists but has no 'state' key, the TOCTOU artifact left by sync_feature_scope
racing with deregister): resync_feature_state must call set_entry(None) to
remove the ghost record rather than completing it with mod_entry.

test_sync_feature_scope_toctou: new test for the post-write TOCTOU guard in
sync_feature_scope.  Uses side_effect to return a live entry on the pre-write
check and a partial entry (no 'state' key) on the post-write check, simulating
the window where the FEATURE row is deleted and recreated by mod_entry during
the race.  sync_feature_scope must call set_entry(None) to clean it up.

Signed-off-by: david.zagury <davidza@nvidia.com>
@DavidZagury DavidZagury force-pushed the masster_RC_stale_appextension branch from a4c0934 to dd9a46e Compare May 6, 2026 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants