[action] [PR:370] [caclmgrd]: Skip empty ACL tables#381
Merged
Conversation
This PR propagates a fix for timing issue caused by `caclmgrd` processing of ACL tables during Config DB record removal operation.
#### Evidence from code
* What `caclmgrd` does
```
self._tables_db_info = config_db_connector.get_table(self.ACL_TABLE)
...
if table_data["type"] != self.ACL_TABLE_TYPE_CTRLPLANE:
```
* `ConfigDBConnector.get_table()` is implemented as
```
const auto& keys = client.keys(pattern);
for (auto& key: keys)
{
auto const& entry = client.hgetall<map<string, string>>(key);
```
This is a two-step read:
1. list keys (`KEYS ACL_TABLE|*`)
2. per key `HGETALL`
If a key is deleted/changed between step 1 and step 2, `hgetall` can return `{}`.
Then `table_data` exists as an empty map in `_tables_db_info`, and `table_data["type"]` throws `KeyError`.
So missing `type` does NOT require field-only deletion; plain key deletion racing with `get_table()` is enough.
Also field-only deletion is possible in the stack:
```
client.hmset(_hash, data.begin(), data.end());
...
if (!found)
{
client.hdel(_hash, k);
}
```
`set_entry()` can remove extra fields with `HDEL` while keeping the record key alive.
#### Evidence from pytest
The log shows rapid table deletes:
```
sudo config acl remove table DATA_INGRESS_L3TEST
...
sudo config acl remove table DATA_EGRESS_L3V6TEST
```
That high churn exactly matches a race window for the non-atomic `get_table()` read in `caclmgrd`.
So missing `type` can happen via:
- field-level `HDEL` updates
- key-level delete race during `KEYS` + `HGETALL` snapshot collection
#### SYSLOG:
```
syslog:2026 Mar 27 03:57:03.486679 sonic INFO start-LogAnalyzer-test_acl_config.2026-03-27-00:57:02
syslog:2026 Mar 27 03:57:08.378308 sonic ERR caclmgrd: Exception occured at Thread-2 (check_and_update_control_plane_acls) thread for namespace '' due to KeyErro
r('type')
syslog:2026 Mar 27 03:57:08.874919 sonic ERR caclmgrd: Main thread detected exception in child thread for namespace '': KeyError('type')
syslog:2026 Mar 27 03:57:08.875092 sonic ERR caclmgrd: Full traceback from child thread:
syslog:2026 Mar 27 03:57:08.875239 sonic ERR caclmgrd: Traceback (most recent call last):
syslog:2026 Mar 27 03:57:08.875386 sonic ERR caclmgrd: File "/usr/local/bin/caclmgrd", line 970, in check_and_update_control_plane_acls
syslog:2026 Mar 27 03:57:08.875532 sonic ERR caclmgrd: self.update_control_plane_acls(namespace, new_config_db_connector)
syslog:2026 Mar 27 03:57:08.875668 sonic ERR caclmgrd: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
syslog:2026 Mar 27 03:57:08.875845 sonic ERR caclmgrd: File "/usr/local/bin/caclmgrd", line 907, in update_control_plane_acls
syslog:2026 Mar 27 03:57:08.875984 sonic ERR caclmgrd: iptables_cmds, service_to_source_ip_map = self.get_acl_rules_and_translate_to_iptables_commands(names
pace, config_db_connector)
syslog:2026 Mar 27 03:57:08.876104 sonic ERR caclmgrd: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^
syslog:2026 Mar 27 03:57:08.876176 sonic ERR caclmgrd: File "/usr/local/bin/caclmgrd", line 740, in get_acl_rules_and_translate_to_iptables_commands
syslog:2026 Mar 27 03:57:08.876239 sonic ERR caclmgrd: if table_data["type"] != self.ACL_TABLE_TYPE_CTRLPLANE:
syslog:2026 Mar 27 03:57:08.876302 sonic ERR caclmgrd: ~~~~~~~~~~^^^^^^^^
syslog:2026 Mar 27 03:57:08.876362 sonic ERR caclmgrd: KeyError: 'type'
syslog:2026 Mar 27 03:57:08.876420 sonic ERR caclmgrd: Detect exception in Child thread, generating SIGKILL for main thread
syslog:2026 Mar 27 03:57:21.252620 sonic INFO end-LogAnalyzer-test_acl_config.2026-03-27-00:57:02
syslog:2026 Mar 27 03:57:22.329663 sonic INFO python3.13[156612]: ansible-extract_log Invoked with directory=/var/log file_prefix=syslog start_string=start-LogAnalyzer-test_acl_config.2026-03-27-00:57:02 target_filename=/tmp/syslog
syslog:2026 Mar 27 03:57:22.329874 sonic DEBUG extract_log for start string LogAnalyzer-test_acl_config.2026-03-27-00:57:02
```
#### A picture of a cute animal (not mandatory but encouraged)
```
.---. .-----------
/ \ __ / ------
/ / \( )/ -----
////// ' \/ ` ---
//// / // : : ---
// / / /` '--
// //..\\
====UU====UU====
'//||\\`
''``
```
Signed-off-by: Sonic Build Admin <sonicbld@microsoft.com>
Author
|
Original PR: #370 |
Author
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR propagates a fix for timing issue caused by
caclmgrdprocessing of ACL tables during Config DB record removal operation.Evidence from code
caclmgrddoesConfigDBConnector.get_table()is implemented asThis is a two-step read:
KEYS ACL_TABLE|*)HGETALLIf a key is deleted/changed between step 1 and step 2,
hgetallcan return{}.Then
table_dataexists as an empty map in_tables_db_info, andtable_data["type"]throwsKeyError.So missing
typedoes NOT require field-only deletion; plain key deletion racing withget_table()is enough.Also field-only deletion is possible in the stack:
set_entry()can remove extra fields withHDELwhile keeping the record key alive.Evidence from pytest
The log shows rapid table deletes:
That high churn exactly matches a race window for the non-atomic
get_table()read incaclmgrd.So missing
typecan happen via:HDELupdatesKEYS+HGETALLsnapshot collectionSYSLOG:
A picture of a cute animal (not mandatory but encouraged)
Signed-off-by: Sonic Build Admin sonicbld@microsoft.com