fix(dhcp): re-apply KEA config when running config drifts back to boo…#23
fix(dhcp): re-apply KEA config when running config drifts back to boo…#23dmathur0 wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughDetects KEA running-config drift for remote lease-db setups and forces a re-apply of the injected KEA config when divergence is observed; adds async pytest tests and a Kubernetes livenessProbe backstop. ChangesKEA Configuration Divergence Detection and Recovery
Sequence DiagramsequenceDiagram
participant RedisClient as Redis Poll
participant SyncLoop as Sync Loop
participant DivergenceCheck as Divergence Check
participant KeaClient as KEA Query
participant KeaControl as KEA Set Config
loop Every poll interval
RedisClient->>SyncLoop: Load config from Redis
SyncLoop->>DivergenceCheck: Check if KEA config diverged
alt Remote lease-db mode
DivergenceCheck->>KeaClient: Query running config
alt Lease-database present and correct
KeaClient-->>DivergenceCheck: Config OK
DivergenceCheck-->>SyncLoop: No divergence
else Lease-database missing/unexpected
KeaClient-->>DivergenceCheck: Config drift detected
DivergenceCheck-->>SyncLoop: Divergence found
else Query fails
DivergenceCheck-->>SyncLoop: No divergence (best-effort)
end
else Local lease-db mode
DivergenceCheck-->>SyncLoop: Skip check (local mode)
end
alt Divergence detected
SyncLoop->>KeaControl: Re-apply config to KEA
KeaControl-->>SyncLoop: Config applied
else Config changed in Redis
SyncLoop->>KeaControl: Apply updated config to KEA
KeaControl-->>SyncLoop: Config applied
else No change
SyncLoop-->>SyncLoop: Skip KEA update
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
2160eda to
79b0f65
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/nv_config_manager/dhcp/cli.py`:
- Line 259: The code reads ini_config["dhcp.lease_db"].getboolean("local") into
remote_lease_db without guarding for a missing section/option; change the access
to either check ini_config.has_section("dhcp.lease_db") and
ini_config.has_option("dhcp.lease_db", "local") before indexing, or call
ini_config["dhcp.lease_db"].getboolean("local", fallback=False) (or another
sensible default) and log a clear warning when the section/option is absent;
apply the same defensive change to the analogous lookups in api.py and
kea_dhcp_confgen.py (search for usages of ini_config[...] .getboolean("local")
and the remote_lease_db assignment) so startup/healthcheck won’t crash on
incomplete INI.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 547532aa-f2fc-48a6-b8b4-27a4f152fdb2
📒 Files selected for processing (3)
deploy/helm/templates/dhcp.yamlsrc/nv_config_manager/dhcp/cli.pysrc/tests/dhcp/test_cli.py
| ini_config = load_config() | ||
| kea_client = KeaClient.from_config(ini_config, attached=True) | ||
| redis_client = RedisClient.from_config(ini_config) | ||
| remote_lease_db = not ini_config["dhcp.lease_db"].getboolean("local") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how dhcp.lease_db / local is read elsewhere and whether a default exists.
rg -nP -C3 "lease_db|getboolean\(\s*['\"]local['\"]" --type=pyRepository: NVIDIA/nv-config-manager
Length of output: 15942
Guard dhcp.lease_db.local lookup in DHCP startup/sync
In src/nv_config_manager/dhcp/cli.py:259, this lookup happens before the try: block, so a missing [dhcp.lease_db] section or local option would crash startup:
remote_lease_db = not ini_config["dhcp.lease_db"].getboolean("local")The same direct access pattern is also used in src/nv_config_manager/dhcp/api.py and src/nv_config_manager/dhcp/kea_dhcp_confgen.py, so the app/healthcheck can fail similarly if the INI schema is incomplete. Tests/OpenAPI samples include [dhcp.lease_db]/local, but the runtime code doesn’t enforce it—add a guard (e.g., has_section/has_option) or a getboolean(..., fallback=...) with a clear error/log when absent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/nv_config_manager/dhcp/cli.py` at line 259, The code reads
ini_config["dhcp.lease_db"].getboolean("local") into remote_lease_db without
guarding for a missing section/option; change the access to either check
ini_config.has_section("dhcp.lease_db") and
ini_config.has_option("dhcp.lease_db", "local") before indexing, or call
ini_config["dhcp.lease_db"].getboolean("local", fallback=False) (or another
sensible default) and log a clear warning when the section/option is absent;
apply the same defensive change to the analogous lookups in api.py and
kea_dhcp_confgen.py (search for usages of ini_config[...] .getboolean("local")
and the remote_lease_db assignment) so startup/healthcheck won’t crash on
incomplete INI.
…tstrap When the kea container is recycled it reloads the baked-in bootstrap config (no remote lease-database), but the sync loop only re-applied on Redis change, so the pod deadlocked with /healthcheck returning 500. Adds _kea_running_config_diverged() to detect the drift and force a re-apply, plus a looser backstop livenessProbe on config-sync-v4 so kubelet can recover if the in-loop check itself wedges. Signed-off-by: Davesh Mathur <daveshm@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/nv_config_manager/dhcp/cli.py (1)
259-259:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard
dhcp.lease_db.locallookup to avoid startup crash.At Line 259, direct indexing can raise if
[dhcp.lease_db]orlocalis missing, and this is outside the looptry, so the process can fail before sync starts.Proposed defensive read
- remote_lease_db = not ini_config["dhcp.lease_db"].getboolean("local") + if ini_config.has_section("dhcp.lease_db"): + local_lease_db = ini_config["dhcp.lease_db"].getboolean("local", fallback=False) + else: + logger.warning("Missing [dhcp.lease_db] section; defaulting to remote lease-db mode.") + local_lease_db = False + remote_lease_db = not local_lease_db🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/nv_config_manager/dhcp/cli.py` at line 259, The direct access ini_config["dhcp.lease_db"]["local"] can raise if the section or option is missing; change the assignment to safely read the value by using ConfigParser's safe lookup (e.g., ini_config.getboolean("dhcp.lease_db", "local", fallback=...) or check ini_config.has_section("dhcp.lease_db") and ini_config.has_option("dhcp.lease_db", "local")) so remote_lease_db is computed defensively (default to False or a sensible fallback) and avoid raising before sync starts; update the code that sets remote_lease_db to use this guarded lookup instead of direct indexing.
🧹 Nitpick comments (1)
src/nv_config_manager/dhcp/cli.py (1)
226-228: ⚡ Quick winAlign docstring with actual probe-failure behavior.
At Line 226, the docstring says unreachable states can warrant re-apply, but the implementation returns
Falseon KEA query errors (Line 240). Please update the wording to match current behavior.Proposed docstring correction
- Returns True if the running config looks like the bootstrap (or is - otherwise unreachable in a way that warrants re-applying). + Returns True if the running config looks like bootstrap drift for + remote lease-db setups. Probe/query failures are treated as + non-divergence (best-effort) and return False.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/nv_config_manager/dhcp/cli.py` around lines 226 - 228, The docstring for the function that checks whether the running config looks like the bootstrap (the probe that can return True/False) is inaccurate: update its wording to state that probe failures (specifically KEA query errors) are treated as non-unreachable and result in False rather than triggering a re-apply; mention KEA query errors explicitly so the docstring matches the implementation that returns False on KEA probe failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/nv_config_manager/dhcp/cli.py`:
- Line 259: The direct access ini_config["dhcp.lease_db"]["local"] can raise if
the section or option is missing; change the assignment to safely read the value
by using ConfigParser's safe lookup (e.g.,
ini_config.getboolean("dhcp.lease_db", "local", fallback=...) or check
ini_config.has_section("dhcp.lease_db") and
ini_config.has_option("dhcp.lease_db", "local")) so remote_lease_db is computed
defensively (default to False or a sensible fallback) and avoid raising before
sync starts; update the code that sets remote_lease_db to use this guarded
lookup instead of direct indexing.
---
Nitpick comments:
In `@src/nv_config_manager/dhcp/cli.py`:
- Around line 226-228: The docstring for the function that checks whether the
running config looks like the bootstrap (the probe that can return True/False)
is inaccurate: update its wording to state that probe failures (specifically KEA
query errors) are treated as non-unreachable and result in False rather than
triggering a re-apply; mention KEA query errors explicitly so the docstring
matches the implementation that returns False on KEA probe failures.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 60adcd1c-8f9c-4b58-8efd-b15e7d0266b4
📒 Files selected for processing (3)
deploy/helm/templates/dhcp.yamlsrc/nv_config_manager/dhcp/cli.pysrc/tests/dhcp/test_cli.py
🚧 Files skipped from review as they are similar to previous changes (2)
- deploy/helm/templates/dhcp.yaml
- src/tests/dhcp/test_cli.py
79b0f65 to
82ee2c1
Compare
| asyncio.run(_refresh_loop_async(ip_version, check, refresh_interval)) | ||
|
|
||
|
|
||
| async def _kea_running_config_diverged( |
There was a problem hiding this comment.
I would rename this, diverged implies you're doing a direct comparison, which would be the cleanest approach but would always report divergence as kea has a lot of defaults and minor formatting changes between what we generate and what it reports back as its config. In the absence of being able to do a true divergence test, we should name this in a way that indicates it's only checking for the default configuration state
…tstrap
Ported from kiwi-platform MR !240. When the kea container is recycled it reloads the baked-in bootstrap config (no remote lease-database), but the sync loop only re-applied on Redis change, so the pod deadlocked with /healthcheck returning 500. Adds _kea_running_config_diverged() to detect the drift and force a re-apply, plus a looser backstop livenessProbe on config-sync-v4 so kubelet can recover if the in-loop check itself wedges.
Description
Validation
The kind integration test is manual due to taking ~30 min to complete. When the PR is ready for review,
run Actions -> Kind Integration -> Run workflow against the copy-pr-bot generated
pull-request/<PR_NUMBER>branch. Use the defaulttest_pathfor the full suite,or narrow it only while debugging.
Passing Kind Integration run:
Checklist
CONTRIBUTING.md.docs screenshots, or Helm/rendered outputs.
Summary by CodeRabbit
Bug Fixes
Improvements
Tests