From 2b8fa758be3a53b3537215e0bd5fd412c374f8e1 Mon Sep 17 00:00:00 2001 From: Sonic Build Admin Date: Thu, 7 May 2026 17:32:06 +0000 Subject: [PATCH] [featured]: Skip redundant CONFIG_DB writes in sync_feature_scope MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit #### Why I did it Fix https://github.com/sonic-net/sonic-host-services/issues/369 Partially address https://github.com/sonic-net/sonic-buildimage/issues/26350 `sync_feature_scope()` unconditionally writes to CONFIG_DB for every feature at startup and on feature state changes at runtime without checking whether the values have actually changed. We should avoid redundant writes that does not update CONFIG_DB. #### How I did it * Added a read-before-write check in `sync_feature_scope()`: read the current `FEATURE|` entry from CONFIG_DB via `get_entry()`, and only call `mod_entry()` when `has_per_asic_scope` or `has_global_scope` actually differs from the intended value. * Guarded the `get_entry()` return with `or {}` to safely handle `None` returns, consistent with the defensive pattern already used in `resync_feature_state()` and `sync_feature_delay_state()` in the same file. * The conditional write applies to both the host CONFIG_DB and per-namespace CONFIG_DB on multi-ASIC platforms. * Added unit tests covering three cases: values unchanged (no write), values changed (write), and missing/empty entry (write). #### How to verify it I used the following script to test the update on a virtual switch in GNS3. We can iterate quickly since the modified Python file is at `/usr/local/bin/featured` and can be updated on the fly. The virtual switch image is downloaded from [official pipeline build on 2026/03/30](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=1074536) on commit [ab081ee](https://github.com/sonic-net/sonic-buildimage/commit/ab081eedf9ea32e769dbc537062ec39d7b40e72d).
Bash Script & Setup I had three files on the switch: ``` admin@sonic:~$ ls new.py old.py test_featured.sh ``` `old.py` is copied from `/usr/local/bin/featured` on the switch. It is almost identical to the [latest previous commit that modified the script](https://github.com/sonic-net/sonic-host-services/commit/2c5bf36175262a54b9c6aaa441b21c5e4c273578), except due to compilation the first line changed from `#!/usr/bin/env python3` to `#!/usr/bin/python3`. `new.py` is almost identical to the script in PR, except the first line also changed from `#!/usr/bin/env python3` to `#!/usr/bin/python3`. The bash script `test_featured.sh` use either `old.py` or `new.py` to overwrite `/usr/local/bin/featured`, use systemctl to reload featured, then run `config reload` and see how many feature writes there are & how long did it take until the last write. ```bash #!/bin/bash # test_featured.sh - Test featured daemon CONFIG_DB write behavior # Usage: ./test_featured.sh old|new # # Measures how many redundant FEATURE scope writes (has_per_asic_scope, # has_global_scope) the featured daemon makes after config reload, and # how long the CONFIG_DB mutation window stays open. set -e if [ -z "$1" ] || [[ "$1" != "old" && "$1" != "new" ]]; then echo "Usage: $0 old|new" exit 1 fi VERSION="$1" LOG="/tmp/monitor_${VERSION}.log" # Clean up any leftover redis-cli MONITOR processes from previous runs sudo pkill -f "redis-cli.*MONITOR" 2>/dev/null || true sudo pkill -f "docker exec.*redis-cli.*MONITOR" 2>/dev/null || true sleep 1 rm -f "$LOG" # Deploy the selected version of featured echo "==> Installing ${VERSION} version of featured..." sudo cp ~/${VERSION}.py /usr/local/bin/featured sudo systemctl restart featured echo "==> Waiting for featured to settle..." sleep 5 # Start capturing all CONFIG_DB operations echo "==> Starting CONFIG_DB monitor..." redis-cli -n 4 MONITOR < /dev/null > "$LOG" 2>&1 & MONITOR_PID=$! sleep 2 # Run config reload and measure wall-clock time echo "==> Running config reload..." RELOAD_START=$(date +%s.%N) sudo config reload -y -f RELOAD_END=$(date +%s.%N) RELOAD_DURATION=$(python3 -c "print(f'{$RELOAD_END - $RELOAD_START:.2f}')") # Wait for any background daemon writes to finish echo "==> Waiting 90 seconds for background writes to complete..." sleep 90 # Stop monitoring kill $MONITOR_PID 2>/dev/null || true sudo pkill -f "redis-cli.*MONITOR" 2>/dev/null || true sleep 1 # --- Analysis --- # Count: how many times featured wrote scope fields to FEATURE entries SCOPE_WRITE_COUNT=$(grep --text -c 'HSET.*FEATURE.*\(has_per_asic_scope\|has_global_scope\)' "$LOG" || true) # Timestamp of the first CONFIG_DB write (sonic-cfggen bulk load start) FIRST_DB_WRITE=$(grep --text '"HSET"' "$LOG" | head -1 | awk '{print $1}') # Timestamp of the last FEATURE scope field write by featured daemon LAST_SCOPE_WRITE=$(grep --text 'HSET.*FEATURE.*\(has_per_asic_scope\|has_global_scope\)' "$LOG" | tail -1 | awk '{print $1}') # Timestamp of the very last CONFIG_DB write of any kind LAST_ANY_WRITE=$(grep --text '"HSET"' "$LOG" | tail -1 | awk '{print $1}') # Calculate how long after config reload returned the scope writes continued if [ -n "$LAST_SCOPE_WRITE" ] && [ -n "$FIRST_DB_WRITE" ]; then SCOPE_WINDOW=$(python3 -c "print(f'{$LAST_SCOPE_WRITE - $RELOAD_END:.2f}')") SCOPE_DURATION=$(python3 -c "print(f'{$LAST_SCOPE_WRITE - $FIRST_DB_WRITE:.2f}')") else SCOPE_WINDOW="N/A (no scope writes detected)" SCOPE_DURATION="N/A" fi echo "" echo "============================================================" echo " Featured Daemon CONFIG_DB Write Test Results" echo " Version: ${VERSION}" echo "============================================================" echo "" echo " Config reload wall-clock duration: ${RELOAD_DURATION}s" echo "" echo " FEATURE scope writes (has_per_asic_scope / has_global_scope):" echo " Total write count: ${SCOPE_WRITE_COUNT}" echo " First CONFIG_DB write: ${FIRST_DB_WRITE:-N/A}" echo " Last scope write by featured: ${LAST_SCOPE_WRITE:-none}" echo " Last CONFIG_DB write (any): ${LAST_ANY_WRITE:-N/A}" echo "" echo " Seconds from first DB write to last scope write:" echo " ${SCOPE_DURATION}" echo "" echo " Seconds from config reload RETURNING to last scope write:" echo " ${SCOPE_WINDOW}" echo "" echo " Full log: ${LOG}" echo "============================================================" ```
Full Logs Running the Script on VS ``` admin@sonic:~$ chmod +x test_featured.sh # Run the old and new version alternatively, test each multiple times # This output is caputred after running `sudo config save-y` and running the script multiple times previously admin@sonic:~$ ./test_featured.sh old && \ sleep 20 && \ ./test_featured.sh new && \ sleep 20 && \ ./test_featured.sh old && \ sleep 20 && \ ./test_featured.sh new ==> Installing old version of featured... ==> Waiting for featured to settle... ==> Starting CONFIG_DB monitor... ==> Running config reload... Acquired lock on /etc/sonic/reload.lock Disabling container and routeCheck monitoring ... Running command: sudo systemctl stop aaastatsd.timer Running command: sudo systemctl stop featured.timer Running command: sudo systemctl stop hostcfgd.timer Running command: sudo systemctl stop tacacs-config.timer Stopping SONiC target ... Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db Running command: /usr/local/bin/db_migrator.py -o migrate Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment Restarting SONiC target ... Enabling container and routeCheck monitoring ... Reloading Monit configuration ... Reinitializing monit daemon Released lock on /etc/sonic/reload.lock ==> Waiting 90 seconds for background writes to complete... ============================================================ Featured Daemon CONFIG_DB Write Test Results Version: old ============================================================ Config reload wall-clock duration: 30.63s FEATURE scope writes (has_per_asic_scope / has_global_scope): Total write count: 84 First CONFIG_DB write: 1774906714.590470 Last scope write by featured: 1774906746.737287 Last CONFIG_DB write (any): 1774906836.590132 Seconds from first DB write to last scope write: 32.15 Seconds from config reload RETURNING to last scope write: -0.30 Full log: /tmp/monitor_old.log ============================================================ ==> Installing new version of featured... ==> Waiting for featured to settle... ==> Starting CONFIG_DB monitor... ==> Running config reload... Acquired lock on /etc/sonic/reload.lock Disabling container and routeCheck monitoring ... Running command: sudo systemctl stop aaastatsd.timer Running command: sudo systemctl stop featured.timer Running command: sudo systemctl stop hostcfgd.timer Running command: sudo systemctl stop tacacs-config.timer Stopping SONiC target ... Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db Running command: /usr/local/bin/db_migrator.py -o migrate Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment Restarting SONiC target ... Enabling container and routeCheck monitoring ... Reloading Monit configuration ... Reinitializing monit daemon Released lock on /etc/sonic/reload.lock ==> Waiting 90 seconds for background writes to complete... ============================================================ Featured Daemon CONFIG_DB Write Test Results Version: new ============================================================ Config reload wall-clock duration: 30.62s FEATURE scope writes (has_per_asic_scope / has_global_scope): Total write count: 38 First CONFIG_DB write: 1774906865.598661 Last scope write by featured: 1774906886.589453 Last CONFIG_DB write (any): 1774906987.320508 Seconds from first DB write to last scope write: 20.99 Seconds from config reload RETURNING to last scope write: -11.33 Full log: /tmp/monitor_new.log ============================================================ ==> Installing old version of featured... ==> Waiting for featured to settle... ==> Starting CONFIG_DB monitor... ==> Running config reload... Acquired lock on /etc/sonic/reload.lock Disabling container and routeCheck monitoring ... Running command: sudo systemctl stop aaastatsd.timer Running command: sudo systemctl stop featured.timer Running command: sudo systemctl stop hostcfgd.timer Running command: sudo systemctl stop tacacs-config.timer Stopping SONiC target ... Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db Running command: /usr/local/bin/db_migrator.py -o migrate Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment Restarting SONiC target ... Enabling container and routeCheck monitoring ... Reloading Monit configuration ... Reinitializing monit daemon Released lock on /etc/sonic/reload.lock ==> Waiting 90 seconds for background writes to complete... ============================================================ Featured Daemon CONFIG_DB Write Test Results Version: old ============================================================ Config reload wall-clock duration: 30.14s FEATURE scope writes (has_per_asic_scope / has_global_scope): Total write count: 84 First CONFIG_DB write: 1774907016.330388 Last scope write by featured: 1774907046.978066 Last CONFIG_DB write (any): 1774907137.935647 Seconds from first DB write to last scope write: 30.65 Seconds from config reload RETURNING to last scope write: -1.36 Full log: /tmp/monitor_old.log ============================================================ ==> Installing new version of featured... ==> Waiting for featured to settle... ==> Starting CONFIG_DB monitor... ==> Running config reload... Acquired lock on /etc/sonic/reload.lock Disabling container and routeCheck monitoring ... Running command: sudo systemctl stop aaastatsd.timer Running command: sudo systemctl stop featured.timer Running command: sudo systemctl stop hostcfgd.timer Running command: sudo systemctl stop tacacs-config.timer Stopping SONiC target ... Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db Running command: /usr/local/bin/db_migrator.py -o migrate Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment Restarting SONiC target ... Enabling container and routeCheck monitoring ... Reloading Monit configuration ... Reinitializing monit daemon Released lock on /etc/sonic/reload.lock ==> Waiting 90 seconds for background writes to complete... ============================================================ Featured Daemon CONFIG_DB Write Test Results Version: new ============================================================ Config reload wall-clock duration: 30.02s FEATURE scope writes (has_per_asic_scope / has_global_scope): Total write count: 38 First CONFIG_DB write: 1774907166.717689 Last scope write by featured: 1774907187.528028 Last CONFIG_DB write (any): 1774907288.738037 Seconds from first DB write to last scope write: 20.81 Seconds from config reload RETURNING to last scope write: -11.11 Full log: /tmp/monitor_new.log ============================================================ ```
TLDR: The number of scope writes dropped from 84 to 38. The time from first CONFIG_DB write to last scope write dropped from ~30–32s to ~21s, so scope synchronization completes about 10s earlier, shortening the potential race-condition window. #### Description for the changelog Skip redundant CONFIG_DB writes in featured daemon's sync_feature_scope to speed up config reload. Signed-off-by: Sonic Build Admin --- scripts/featured | 37 +++++++++-- tests/featured/featured_test.py | 109 ++++++++++++++++++++++++++++++++ 2 files changed, 140 insertions(+), 6 deletions(-) diff --git a/scripts/featured b/scripts/featured index 66fa1f56..a642c5b6 100644 --- a/scripts/featured +++ b/scripts/featured @@ -286,6 +286,28 @@ class FeatureHandler(object): return True + @staticmethod + def _conditional_update_scope(db, feature_name, new_per_asic, new_global): + """Write scope fields to a CONFIG_DB only when the values have actually changed. + + Args: + db: ConfigDBConnector instance (host or namespace). + feature_name (str): Name of the feature entry to update. + new_per_asic (str): Desired value for has_per_asic_scope ('True'/'False'). + new_global (str): Desired value for has_global_scope ('True'/'False'). + + Returns: + None. + """ + current_entry = db.get_entry(FEATURE_TBL, feature_name) or {} + update_fields = {} + if current_entry.get('has_per_asic_scope') != new_per_asic: + update_fields['has_per_asic_scope'] = new_per_asic + if current_entry.get('has_global_scope') != new_global: + update_fields['has_global_scope'] = new_global + if update_fields: + db.mod_entry(FEATURE_TBL, feature_name, update_fields) + def sync_feature_scope(self, feature_config): """Updates the has_global_scope or has_per_asic_scope field in the FEATURE|* tables as the field might have to be rendered based on DEVICE_METADATA table or Device Running configuration. @@ -321,13 +343,16 @@ class FeatureHandler(object): .format(feature_name, feature_suffixes[-1])) self.set_feature_state(feature_config, self.FEATURE_STATE_FAILED) return - self._config_db.mod_entry(FEATURE_TBL, feature_config.name, {'has_per_asic_scope': str(feature_config.has_per_asic_scope)}) - self._config_db.mod_entry(FEATURE_TBL, feature_config.name, {'has_global_scope': str(feature_config.has_global_scope)}) - - # sync has_per_asic_scope to CONFIG_DB in namespaces in multi-asic platform + # Only write scope fields to CONFIG_DB if the values have actually changed. + # This avoids redundant writes that can race with config reload consumers + # (e.g. sonic-mgmt YANG validation checks reading CONFIG_DB snapshots). + new_per_asic = str(feature_config.has_per_asic_scope) + new_global = str(feature_config.has_global_scope) + self._conditional_update_scope(self._config_db, feature_config.name, new_per_asic, new_global) + + # Sync both has_per_asic_scope and has_global_scope to CONFIG_DB in namespaces on multi-asic platforms for ns, db in self.ns_cfg_db.items(): - db.mod_entry(FEATURE_TBL, feature_config.name, {'has_per_asic_scope': str(feature_config.has_per_asic_scope)}) - db.mod_entry(FEATURE_TBL, feature_config.name, {'has_global_scope': str(feature_config.has_global_scope)}) + self._conditional_update_scope(db, feature_config.name, new_per_asic, new_global) def update_systemd_config(self, feature_config): """Updates `Restart=` field in feature's systemd configuration file diff --git a/tests/featured/featured_test.py b/tests/featured/featured_test.py index 25544eaf..ed6329fb 100644 --- a/tests/featured/featured_test.py +++ b/tests/featured/featured_test.py @@ -376,6 +376,115 @@ def test_enable_and_disable_feature_ExclusionList_skips_actions(self, mock_syslo mocked_set_state.assert_not_called() assert any("ExclusionList: skip disabling 'frr_bmp'" in str(c.args[1]) for c in mock_syslog.call_args_list) + @mock.patch('featured.FeatureHandler.update_systemd_config', mock.MagicMock()) + @mock.patch('featured.FeatureHandler.update_feature_state', mock.MagicMock()) + @mock.patch('featured.FeatureHandler.sync_feature_delay_state', mock.MagicMock()) + @mock.patch('featured.FeatureHandler.get_systemd_unit_state', mock.MagicMock(return_value="")) + def test_sync_feature_scope_conditional_write(self): + """Verify sync_feature_scope only writes when scope values differ or entry is missing.""" + mock_db = mock.MagicMock() + mock_db.get_entry = mock.MagicMock() + mock_db.mod_entry = mock.MagicMock() + mock_feature_state_table = mock.MagicMock() + + feature_handler = featured.FeatureHandler(mock_db, mock_feature_state_table, {}, False) + + feature = featured.Feature('sflow', { + 'state': 'enabled', + 'has_global_scope': 'True', + 'has_per_asic_scope': 'False', + }) + + # Values already match -> no write + mock_db.get_entry.return_value = { + 'has_per_asic_scope': 'False', + 'has_global_scope': 'True', + } + feature_handler.sync_feature_scope(feature) + mock_db.mod_entry.assert_not_called() + mock_db.mod_entry.reset_mock() + + # Only has_per_asic_scope differs -> write only changed field in single call + mock_db.get_entry.return_value = { + 'has_per_asic_scope': 'True', + 'has_global_scope': 'True', + } + feature_handler.sync_feature_scope(feature) + mock_db.mod_entry.assert_called_once_with(featured.FEATURE_TBL, 'sflow', {'has_per_asic_scope': 'False'}) + mock_db.mod_entry.reset_mock() + + # Entry missing (None) -> write both fields in single call + mock_db.get_entry.return_value = None + feature_handler.sync_feature_scope(feature) + mock_db.mod_entry.assert_called_once_with(featured.FEATURE_TBL, 'sflow', { + 'has_per_asic_scope': 'False', + 'has_global_scope': 'True', + }) + + @mock.patch('featured.FeatureHandler.update_systemd_config', mock.MagicMock()) + @mock.patch('featured.FeatureHandler.update_feature_state', mock.MagicMock()) + @mock.patch('featured.FeatureHandler.sync_feature_delay_state', mock.MagicMock()) + @mock.patch('featured.FeatureHandler.get_systemd_unit_state', mock.MagicMock(return_value="")) + def test_sync_feature_scope_namespace_dbs(self): + """Verify sync_feature_scope propagates writes to per-namespace DBs on multi-ASIC.""" + mock_db = mock.MagicMock() + mock_db.get_entry = mock.MagicMock() + mock_db.mod_entry = mock.MagicMock() + mock_feature_state_table = mock.MagicMock() + + feature_handler = featured.FeatureHandler(mock_db, mock_feature_state_table, {}, False) + mock_ns_db_0 = mock.MagicMock() + mock_ns_db_1 = mock.MagicMock() + feature_handler.ns_cfg_db = {'asic0': mock_ns_db_0, 'asic1': mock_ns_db_1} + + feature = featured.Feature('sflow', { + 'state': 'enabled', + 'has_global_scope': 'True', + 'has_per_asic_scope': 'True', + }) + + # Host differs -> both host and namespaces should be written + mock_db.get_entry.return_value = { + 'has_per_asic_scope': 'False', + 'has_global_scope': 'False', + } + for mock_ns_db in [mock_ns_db_0, mock_ns_db_1]: + mock_ns_db.get_entry.return_value = { + 'has_per_asic_scope': 'False', + 'has_global_scope': 'False', + } + feature_handler.sync_feature_scope(feature) + + for mock_ns_db in [mock_ns_db_0, mock_ns_db_1]: + mock_ns_db.mod_entry.assert_called_once_with(featured.FEATURE_TBL, 'sflow', { + 'has_per_asic_scope': 'True', + 'has_global_scope': 'True', + }) + + # Reset for next scenario + mock_db.mod_entry.reset_mock() + for mock_ns_db in [mock_ns_db_0, mock_ns_db_1]: + mock_ns_db.mod_entry.reset_mock() + + # Host matches but namespaces are stale -> namespaces should still be written + mock_db.get_entry.return_value = { + 'has_per_asic_scope': 'True', + 'has_global_scope': 'True', + } + for mock_ns_db in [mock_ns_db_0, mock_ns_db_1]: + mock_ns_db.get_entry.return_value = { + 'has_per_asic_scope': 'False', + 'has_global_scope': 'False', + } + feature_handler.sync_feature_scope(feature) + + mock_db.mod_entry.assert_not_called() + for mock_ns_db in [mock_ns_db_0, mock_ns_db_1]: + mock_ns_db.mod_entry.assert_called_once_with(featured.FEATURE_TBL, 'sflow', { + 'has_per_asic_scope': 'True', + 'has_global_scope': 'True', + }) + @mock.patch("syslog.syslog", side_effect=syslog_side_effect) @mock.patch('sonic_py_common.device_info.get_device_runtime_metadata')