Skip to content

doublezerod: add periodic kernel route reconciliation#3672

Open
nikw9944 wants to merge 4 commits intomainfrom
nikw9944/doublezero-3669
Open

doublezerod: add periodic kernel route reconciliation#3672
nikw9944 wants to merge 4 commits intomainfrom
nikw9944/doublezero-3669

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented May 5, 2026

Resolves: #3669

Summary of Changes

  • Add a periodic route reconciliation goroutine to the liveness manager that scans the kernel routing table every 30s (configurable via --route-liveness-reconcile-interval), detects BGP routes that should already be installed but are missing, and reinstalls them
  • This mitigates the case where another process or an administrator mistakenly deletes a doublezero route from the kernel routing table
  • Add doublezero_liveness_route_reinstalls_total and doublezero_liveness_route_install_failures_total Prometheus metrics to track reinstalls and failures
  • Prevent TOCTOU race in active mode: re-check installed state under lock before each reinstall so reconcileRoutes cannot resurrect a route that onSessionDown intentionally withdrew between snapshot and reinstall
  • Include source IP in the kernel route lookup key so routes with the same (table, dst, nexthop) but different source IPs are matched independently in multi-interface setups
  • Promote "session down (passive; keeping route)" log messages from Debug to Info, otherwise it's possible to log multiple 'liveness: session up' messages in a row

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 1 +116 / -2 +114
Scaffolding 2 +30 / -13 +17
Tests 1 +124 / -0 +124
Docs 1 +2 / -0 +2
Total 5 +272 / -15 +257

Bulk of the change is the reconciliation loop and its tests.

Key files (click to expand)

Testing Verification

  • Unit tests cover: reinstalling a missing route, skipping a route present in kernel, skipping an uninstalled route (active mode, session never went Up), and incrementing the install failure metric on RouteAdd error
  • Tests use mock Netlinker to simulate kernel route state; reconciliation ticker set to time.Hour in tests to prevent background interference while calling reconcileRoutes() directly
  • Full liveness test suite passes (40s, all existing tests unaffected)
  • go vet and go build clean

@nikw9944 nikw9944 marked this pull request as ready for review May 5, 2026 21:07
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3669 branch from f31a780 to bd203a8 Compare May 5, 2026 21:37
nikw9944 added 4 commits May 5, 2026 21:49
Add a reconciliation loop to the liveness manager that periodically
scans the kernel routing table for missing BGP routes and reinstalls
them, mitigating connectivity loss caused by external processes
removing routes.

Also promote liveness session down logs from DEBUG to INFO for
passive/peer-passive modes so operators can see the full up/down
lifecycle.
Increment RouteInstallFailures counter when a reconciliation reinstall
fails, matching the observability pattern in onSessionUp. Also
pre-allocate the toCheck slice.
- Re-check installed state under lock before RouteAdd to prevent
  resurrecting routes intentionally withdrawn by onSessionDown
- Add SrcIP to kernel route lookup key for tighter matching in
  multi-interface setups
- Reject negative RouteReconcileInterval in Validate()
- Use named const for reconcile interval flag default
- Log when route reconciliation is enabled at startup
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3669 branch from bd203a8 to 99d373a Compare May 5, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Route installed by doublezerod removed by unknown process

1 participant