Skip to content

fix(dhcp): re-apply KEA config when running config drifts back to boo…#23

Open
dmathur0 wants to merge 1 commit into
mainfrom
port/dhcp-config-drift-recovery-mr240
Open

fix(dhcp): re-apply KEA config when running config drifts back to boo…#23
dmathur0 wants to merge 1 commit into
mainfrom
port/dhcp-config-drift-recovery-mr240

Conversation

@dmathur0

@dmathur0 dmathur0 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

…tstrap

Ported from kiwi-platform MR !240. When the kea container is recycled it reloads the baked-in bootstrap config (no remote lease-database), but the sync loop only re-applied on Redis change, so the pod deadlocked with /healthcheck returning 500. Adds _kea_running_config_diverged() to detect the drift and force a re-apply, plus a looser backstop livenessProbe on config-sync-v4 so kubelet can recover if the in-loop check itself wedges.

Description

Validation

  • Standard CI passes.
  • Kind integration passes, or this PR explains why it was not run.

The kind integration test is manual due to taking ~30 min to complete. When the PR is ready for review,
run Actions -> Kind Integration -> Run workflow against the copy-pr-bot generated
pull-request/<PR_NUMBER> branch. Use the default test_path for the full suite,
or narrow it only while debugging.

Passing Kind Integration run:

Checklist

  • I am familiar with the contributing guidelines in CONTRIBUTING.md.
  • Commits are signed off for DCO compliance.
  • New or existing tests cover these changes, or the PR explains why tests are not needed.
  • Documentation is updated for user-facing behavior changes.
  • Generated artifacts are updated when applicable, such as OpenAPI specs,
    docs screenshots, or Helm/rendered outputs.

Summary by CodeRabbit

  • Bug Fixes

    • DHCP sync now detects runtime configuration drift after service restarts and automatically re-applies the correct configuration to recover.
  • Improvements

    • Added a conservative “backstop” health check for the config-sync service to avoid unnecessary restarts while ensuring recovery from hard failures.
    • Sync loop now probes running KEA state and forces re-apply when divergence is detected.
  • Tests

    • Added unit and end-to-end tests covering drift detection, re-apply behavior, skipping checks for local mode, and error handling.

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0800bce2-20e8-44ff-82ce-e80141b84ea0

📥 Commits

Reviewing files that changed from the base of the PR and between 79b0f65 and 82ee2c1.

📒 Files selected for processing (3)
  • deploy/helm/templates/dhcp.yaml
  • src/nv_config_manager/dhcp/cli.py
  • src/tests/dhcp/test_cli.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • deploy/helm/templates/dhcp.yaml
  • src/nv_config_manager/dhcp/cli.py
  • src/tests/dhcp/test_cli.py

📝 Walkthrough

Walkthrough

Detects KEA running-config drift for remote lease-db setups and forces a re-apply of the injected KEA config when divergence is observed; adds async pytest tests and a Kubernetes livenessProbe backstop.

Changes

KEA Configuration Divergence Detection and Recovery

Layer / File(s) Summary
Divergence detection helper
src/nv_config_manager/dhcp/cli.py
_kea_running_config_diverged async function queries KEA's running config and checks the DHCP lease-database type for postgresql; returns non-divergence on query failures or when local lease-db is selected.
Sync loop integration
src/nv_config_manager/dhcp/cli.py
Compute remote_lease_db from dhcp.lease_db.local (inverted) and call the divergence helper each Redis poll; on divergence re-apply injected KEA config and update previous_config, otherwise only update when config changes.
Divergence detection and sync loop tests
src/tests/dhcp/test_cli.py
Async pytest unit and end-to-end tests verify divergence detection (missing/present lease-db, local mode, error handling) and sync-loop behavior (re-apply on divergence, no-op when stable).
Kubernetes liveness probe recovery mechanism
deploy/helm/templates/dhcp.yaml
Adds a livenessProbe to config-sync-v4 that GETs the pod /healthcheck on port 9090 with delayed and infrequent polling as a secondary recovery mechanism if in-loop checks cannot self-heal.

Sequence Diagram

sequenceDiagram
  participant RedisClient as Redis Poll
  participant SyncLoop as Sync Loop
  participant DivergenceCheck as Divergence Check
  participant KeaClient as KEA Query
  participant KeaControl as KEA Set Config
  
  loop Every poll interval
    RedisClient->>SyncLoop: Load config from Redis
    SyncLoop->>DivergenceCheck: Check if KEA config diverged
    alt Remote lease-db mode
      DivergenceCheck->>KeaClient: Query running config
      alt Lease-database present and correct
        KeaClient-->>DivergenceCheck: Config OK
        DivergenceCheck-->>SyncLoop: No divergence
      else Lease-database missing/unexpected
        KeaClient-->>DivergenceCheck: Config drift detected
        DivergenceCheck-->>SyncLoop: Divergence found
      else Query fails
        DivergenceCheck-->>SyncLoop: No divergence (best-effort)
      end
    else Local lease-db mode
      DivergenceCheck-->>SyncLoop: Skip check (local mode)
    end
    
    alt Divergence detected
      SyncLoop->>KeaControl: Re-apply config to KEA
      KeaControl-->>SyncLoop: Config applied
    else Config changed in Redis
      SyncLoop->>KeaControl: Apply updated config to KEA
      KeaControl-->>SyncLoop: Config applied
    else No change
      SyncLoop-->>SyncLoop: Skip KEA update
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A bootstrap slips when containers wake,
I check the leases for sanity's sake.
If Postgres' trace is gone from sight,
I reapply swiftly to make things right.
A liveness poke will finish the fight. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: re-applying KEA config when running config drifts back to bootstrap, which directly addresses the core issue described in PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch port/dhcp-config-drift-recovery-mr240

Comment @coderabbitai help to get the list of available commands and usage tips.

@dmathur0 dmathur0 force-pushed the port/dhcp-config-drift-recovery-mr240 branch from 2160eda to 79b0f65 Compare June 1, 2026 22:52

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/nv_config_manager/dhcp/cli.py`:
- Line 259: The code reads ini_config["dhcp.lease_db"].getboolean("local") into
remote_lease_db without guarding for a missing section/option; change the access
to either check ini_config.has_section("dhcp.lease_db") and
ini_config.has_option("dhcp.lease_db", "local") before indexing, or call
ini_config["dhcp.lease_db"].getboolean("local", fallback=False) (or another
sensible default) and log a clear warning when the section/option is absent;
apply the same defensive change to the analogous lookups in api.py and
kea_dhcp_confgen.py (search for usages of ini_config[...] .getboolean("local")
and the remote_lease_db assignment) so startup/healthcheck won’t crash on
incomplete INI.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 547532aa-f2fc-48a6-b8b4-27a4f152fdb2

📥 Commits

Reviewing files that changed from the base of the PR and between 6bec6d7 and 2160eda.

📒 Files selected for processing (3)
  • deploy/helm/templates/dhcp.yaml
  • src/nv_config_manager/dhcp/cli.py
  • src/tests/dhcp/test_cli.py

ini_config = load_config()
kea_client = KeaClient.from_config(ini_config, attached=True)
redis_client = RedisClient.from_config(ini_config)
remote_lease_db = not ini_config["dhcp.lease_db"].getboolean("local")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how dhcp.lease_db / local is read elsewhere and whether a default exists.
rg -nP -C3 "lease_db|getboolean\(\s*['\"]local['\"]" --type=py

Repository: NVIDIA/nv-config-manager

Length of output: 15942


Guard dhcp.lease_db.local lookup in DHCP startup/sync

In src/nv_config_manager/dhcp/cli.py:259, this lookup happens before the try: block, so a missing [dhcp.lease_db] section or local option would crash startup:

remote_lease_db = not ini_config["dhcp.lease_db"].getboolean("local")

The same direct access pattern is also used in src/nv_config_manager/dhcp/api.py and src/nv_config_manager/dhcp/kea_dhcp_confgen.py, so the app/healthcheck can fail similarly if the INI schema is incomplete. Tests/OpenAPI samples include [dhcp.lease_db]/local, but the runtime code doesn’t enforce it—add a guard (e.g., has_section/has_option) or a getboolean(..., fallback=...) with a clear error/log when absent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/nv_config_manager/dhcp/cli.py` at line 259, The code reads
ini_config["dhcp.lease_db"].getboolean("local") into remote_lease_db without
guarding for a missing section/option; change the access to either check
ini_config.has_section("dhcp.lease_db") and
ini_config.has_option("dhcp.lease_db", "local") before indexing, or call
ini_config["dhcp.lease_db"].getboolean("local", fallback=False) (or another
sensible default) and log a clear warning when the section/option is absent;
apply the same defensive change to the analogous lookups in api.py and
kea_dhcp_confgen.py (search for usages of ini_config[...] .getboolean("local")
and the remote_lease_db assignment) so startup/healthcheck won’t crash on
incomplete INI.

…tstrap

When the kea container is recycled it reloads the baked-in bootstrap config
(no remote lease-database), but the sync loop only re-applied on Redis
change, so the pod deadlocked with /healthcheck returning 500. Adds
_kea_running_config_diverged() to detect the drift and force a re-apply, plus
a looser backstop livenessProbe on config-sync-v4 so kubelet can recover if
the in-loop check itself wedges.

Signed-off-by: Davesh Mathur <daveshm@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/nv_config_manager/dhcp/cli.py (1)

259-259: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard dhcp.lease_db.local lookup to avoid startup crash.

At Line 259, direct indexing can raise if [dhcp.lease_db] or local is missing, and this is outside the loop try, so the process can fail before sync starts.

Proposed defensive read
-    remote_lease_db = not ini_config["dhcp.lease_db"].getboolean("local")
+    if ini_config.has_section("dhcp.lease_db"):
+        local_lease_db = ini_config["dhcp.lease_db"].getboolean("local", fallback=False)
+    else:
+        logger.warning("Missing [dhcp.lease_db] section; defaulting to remote lease-db mode.")
+        local_lease_db = False
+    remote_lease_db = not local_lease_db
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/nv_config_manager/dhcp/cli.py` at line 259, The direct access
ini_config["dhcp.lease_db"]["local"] can raise if the section or option is
missing; change the assignment to safely read the value by using ConfigParser's
safe lookup (e.g., ini_config.getboolean("dhcp.lease_db", "local", fallback=...)
or check ini_config.has_section("dhcp.lease_db") and
ini_config.has_option("dhcp.lease_db", "local")) so remote_lease_db is computed
defensively (default to False or a sensible fallback) and avoid raising before
sync starts; update the code that sets remote_lease_db to use this guarded
lookup instead of direct indexing.
🧹 Nitpick comments (1)
src/nv_config_manager/dhcp/cli.py (1)

226-228: ⚡ Quick win

Align docstring with actual probe-failure behavior.

At Line 226, the docstring says unreachable states can warrant re-apply, but the implementation returns False on KEA query errors (Line 240). Please update the wording to match current behavior.

Proposed docstring correction
-    Returns True if the running config looks like the bootstrap (or is
-    otherwise unreachable in a way that warrants re-applying).
+    Returns True if the running config looks like bootstrap drift for
+    remote lease-db setups. Probe/query failures are treated as
+    non-divergence (best-effort) and return False.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/nv_config_manager/dhcp/cli.py` around lines 226 - 228, The docstring for
the function that checks whether the running config looks like the bootstrap
(the probe that can return True/False) is inaccurate: update its wording to
state that probe failures (specifically KEA query errors) are treated as
non-unreachable and result in False rather than triggering a re-apply; mention
KEA query errors explicitly so the docstring matches the implementation that
returns False on KEA probe failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/nv_config_manager/dhcp/cli.py`:
- Line 259: The direct access ini_config["dhcp.lease_db"]["local"] can raise if
the section or option is missing; change the assignment to safely read the value
by using ConfigParser's safe lookup (e.g.,
ini_config.getboolean("dhcp.lease_db", "local", fallback=...) or check
ini_config.has_section("dhcp.lease_db") and
ini_config.has_option("dhcp.lease_db", "local")) so remote_lease_db is computed
defensively (default to False or a sensible fallback) and avoid raising before
sync starts; update the code that sets remote_lease_db to use this guarded
lookup instead of direct indexing.

---

Nitpick comments:
In `@src/nv_config_manager/dhcp/cli.py`:
- Around line 226-228: The docstring for the function that checks whether the
running config looks like the bootstrap (the probe that can return True/False)
is inaccurate: update its wording to state that probe failures (specifically KEA
query errors) are treated as non-unreachable and result in False rather than
triggering a re-apply; mention KEA query errors explicitly so the docstring
matches the implementation that returns False on KEA probe failures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 60adcd1c-8f9c-4b58-8efd-b15e7d0266b4

📥 Commits

Reviewing files that changed from the base of the PR and between 2160eda and 79b0f65.

📒 Files selected for processing (3)
  • deploy/helm/templates/dhcp.yaml
  • src/nv_config_manager/dhcp/cli.py
  • src/tests/dhcp/test_cli.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • deploy/helm/templates/dhcp.yaml
  • src/tests/dhcp/test_cli.py

@dmathur0 dmathur0 force-pushed the port/dhcp-config-drift-recovery-mr240 branch from 79b0f65 to 82ee2c1 Compare June 1, 2026 22:59
asyncio.run(_refresh_loop_async(ip_version, check, refresh_interval))


async def _kea_running_config_diverged(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this, diverged implies you're doing a direct comparison, which would be the cleanest approach but would always report divergence as kea has a lot of defaults and minor formatting changes between what we generate and what it reports back as its config. In the absence of being able to do a true divergence test, we should name this in a way that indicates it's only checking for the default configuration state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants