AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays#8613
AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays#8613cssjr wants to merge 7 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@cssjr: This pull request references AROSLSRE-830 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR implements etcd self-registration: adds manifest constructors and YAML assets (ServiceAccount, etcd-self-register Role/RoleBinding; updates defrag RoleBinding subject), updates the StatefulSet initContainer to pass --self-register and POD_IP and uses the unified etcd ServiceAccount, updates component wiring, and extends dnsresolver with a --self-register flag plus selfRegisterEndpointSlice/ensureEndpointSlice/parseServiceName and unit tests. Sequence Diagram(s)sequenceDiagram
participant dnsresolver
participant KubeAPI
participant EndpointSlice
participant DNS
dnsresolver->>KubeAPI: ensureEndpointSlice(create/update EndpointSlice for pod IP and targetRef)
KubeAPI->>EndpointSlice: persist EndpointSlice
dnsresolver->>DNS: resolveDNS(dnsName)
DNS-->>dnsresolver: resolution results
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dnsresolver/cmd.go`:
- Around line 150-164: The Get call in selfRegisterEndpointSlice currently
treats any error as "not found" and proceeds to Create; change the error
handling to return immediately for non-NotFound errors from
client.DiscoveryV1().EndpointSlices(...).Get(ctx, sliceName, ...). Specifically,
import and use apierrors.IsNotFound(err) (k8s.io/apimachinery/pkg/api/errors)
and: if err == nil keep the Update path, else if apierrors.IsNotFound(err)
perform the Create path, otherwise return the original Get error (wrap with
fmt.Errorf). This prevents masking RBAC/timeout errors while preserving the
existing Update/Create logic and messages.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 64afd625-994c-4a36-ad76-4ecbbb5a19ca
📒 Files selected for processing (10)
control-plane-operator/controllers/hostedcontrolplane/manifests/etcd.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/defrag-rolebinding.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-self-register-role.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-self-register-rolebinding.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-serviceaccount.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yamlcontrol-plane-operator/controllers/hostedcontrolplane/v2/etcd/component.gocontrol-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.godnsresolver/cmd.godnsresolver/cmd_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8613 +/- ##
==========================================
+ Coverage 40.61% 41.28% +0.67%
==========================================
Files 755 755
Lines 93227 93597 +370
==========================================
+ Hits 37864 38642 +778
+ Misses 52640 52221 -419
- Partials 2723 2734 +11
... and 40 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
…ntainer Under high HCP density (~2,500 namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, delaying etcd-discovery headless Service DNS records by minutes to hours. This causes etcd pods to block in the ensure-dns init container, preventing cluster creation and triggering E2E test timeouts. Each etcd pod now self-registers its IP into a custom EndpointSlice (managed-by: control-plane-operator) before polling DNS. The standard EndpointSlice controller ignores slices with a different managed-by label, so there is no conflict. CoreDNS picks up the custom slice via its watch within ~100ms, making DNS resolution near-instant regardless of controller informer health. A new dedicated "etcd" service account replaces the conditional "etcd-defrag-controller" SA, with RBAC granting EndpointSlice and pod get permissions. The defrag Role/RoleBinding remain HA-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
87cf121 to
39ffc19
Compare
Use apierrors.IsNotFound to only fall through to Create when the EndpointSlice genuinely doesn't exist. RBAC, timeout, and other errors are now returned immediately instead of being masked by a subsequent Create attempt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor selfRegisterEndpointSlice to extract testable ensureEndpointSlice function that accepts a kubernetes.Interface, enabling tests with the fake client. Tests cover: create, update, IPv6, missing pod, invalid DNS name, multiple pods, and field correctness. Coverage: ensureEndpointSlice 88.9%, parseServiceName 100%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/cc patjlm |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dnsresolver/cmd.go (1)
177-182:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winReject malformed non-service DNS names before self-registering.
parseServiceNameonly checks the label count and then trustsparts[1]. With--self-register, an input likefoo.bar.bazwould still create or update an EndpointSlice for servicebarin the current namespace instead of failing fast. Please validate the documented<pod>.<service>.<namespace>.svc[.cluster.local]shape here before returning a service name.As per coding guidelines, "Validate at trust boundaries with allow-lists, not deny-lists".Proposed fix
func parseServiceName(dnsName string) (string, error) { parts := strings.Split(dnsName, ".") - if len(parts) < 3 { - return "", fmt.Errorf("expected at least 3 dot-separated components, got %d", len(parts)) + if len(parts) < 4 || parts[0] == "" || parts[1] == "" || parts[2] == "" || parts[3] != "svc" { + return "", fmt.Errorf("expected <pod>.<service>.<namespace>.svc[.cluster.local], got %q", dnsName) } return parts[1], nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dnsresolver/cmd.go` around lines 177 - 182, parseServiceName currently only checks label count and trusts parts[1]; instead validate the full DNS shape pod.service.namespace.svc[.cluster.local] before returning the service. In parseServiceName, split on '.', ensure no empty labels, require parts[3] == "svc", allow either len(parts)==4 or len(parts)==6 with parts[4]=="cluster" && parts[5]=="local", and only then return parts[1]; otherwise return a descriptive error rejecting the malformed name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@dnsresolver/cmd.go`:
- Around line 177-182: parseServiceName currently only checks label count and
trusts parts[1]; instead validate the full DNS shape
pod.service.namespace.svc[.cluster.local] before returning the service. In
parseServiceName, split on '.', ensure no empty labels, require parts[3] ==
"svc", allow either len(parts)==4 or len(parts)==6 with parts[4]=="cluster" &&
parts[5]=="local", and only then return parts[1]; otherwise return a descriptive
error rejecting the malformed name.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 11fd82fa-916a-486b-a98c-7ddd6ed5ad67
📒 Files selected for processing (2)
dnsresolver/cmd.godnsresolver/cmd_test.go
|
/cc @garrickdabbs |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/test all |
|
/pipeline required |
|
Scheduling tests matching the |
Test Resultse2e-aws
e2e-aks
|
✅ All PR feedback addressed in commit 6597d3dAll blocking issues and suggestions from @bryan-cox have been implemented: Blocking Issues Fixed
Suggestions Implemented
Verification
All review comments have been resolved. Ready for re-review! 🚀 |
|
/retest |
|
/approve |
|
/test e2e-aks |
|
/lgtm |
|
@galchammat: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest-required |
Address feedback from @enxebre to document why --self-register is needed and improve IP address parsing. Changes: - Added comment in statefulset.yaml explaining --self-register flag purpose (bypasses stale EndpointSlice cache under high HCP density) - Added comment in dnsresolver/cmd.go documenting the flag - Replaced manual net.ParseIP().To4() check with netutil.IsIPv4Address() which properly handles ParseIP returning nil and returns an error instead of silently failing Signed-off-by: Cliff Schomburg <7424213+cssjr@users.noreply.github.com> Commit-Message-Assisted-by: Claude (via Claude Code)
✅ Addressed feedback from @enxebre in commit c8b0fa1All 4 comments have been addressed: 1. Added comment in statefulset.yaml (line 192) ✅Added a 3-line comment explaining why # Self-register this pod's IP into an EndpointSlice before waiting for DNS resolution.
# This bypasses stale kube-controller-manager EndpointSlice cache issues under high HCP density
# (2,500+ namespaces), where the standard controller can delay DNS updates by minutes to hours.2. Added comment in cmd.go (line 46) ✅Added a 3-line comment documenting the // --self-register creates an EndpointSlice for this pod before waiting for DNS resolution.
// This bypasses stale kube-controller-manager EndpointSlice cache issues under high HCP density,
// where the standard controller can delay DNS updates by minutes to hours.3. Fixed IP parsing to handle nil and avoid silent failure (line 112) ✅Replaced manual
Before: addressType := discoveryv1.AddressTypeIPv4
if net.ParseIP(podIP) != nil && net.ParseIP(podIP).To4() == nil {
addressType = discoveryv1.AddressTypeIPv6
}After: isIPv4, err := netutil.IsIPv4Address(podIP)
if err != nil {
return fmt.Errorf("failed to parse pod IP %s: %w", podIP, err)
}
addressType := discoveryv1.AddressTypeIPv6
if isIPv4 {
addressType = discoveryv1.AddressTypeIPv4
}4. Used
|
Add test to verify that ensureEndpointSlice properly returns an error when given a malformed IP address (not IPv4 or IPv6). This ensures the netutil.IsIPv4Address() error handling works correctly and doesn't silently fail on invalid input. Signed-off-by: Cliff Schomburg <7424213+cssjr@users.noreply.github.com> Commit-Message-Assisted-by: Claude (via Claude Code)
|
/test e2e-aks |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/retest-required |
|
No symptom labels available. Now I have all the evidence needed for the report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseThe failure chain is:
Why this is unrelated to PR #8613: The PR modifies only etcd-related files (etcd StatefulSet, ServiceAccount, RBAC roles/bindings, and Recommendations
Evidence
|
csrwng
left a comment
There was a problem hiding this comment.
LGTM — clean implementation of the EndpointSlice self-registration pattern.
RBAC is well-scoped (get/create/update EndpointSlices + get Pods), the custom managed-by label avoids conflict with the standard controller, and the OwnerReference to the Pod ensures proper GC. The backwards-compatible SA split (keeping etcd-defrag-controller for HA, adding etcd for all topologies) addresses the upgrade concern. Tests cover create, update, IPv6, error cases, and multi-pod scenarios.
|
Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox, csrwng, cssjr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
/test e2e-azure-v2-self-managed |
|
@cssjr: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
ensure-dnsinit container, bypassing stale kube-controller-manager informer caches that delay DNS resolution under high HCP densityetcdservice account with minimal RBAC for EndpointSlice and pod get permissionsetcd-defrag-controllerSA with the always-presentetcdSA, binding both self-register and defrag (HA-only) roles to itProblem
Under high cluster density (~2,500 HCP namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, causing
etcd-discoveryheadless Service DNS records to be delayed by minutes to hours. Theensure-dnsinit container blocks waiting for DNS, preventing etcd from starting, which cascades into cluster creation timeouts.Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)
Scale: Hundreds to thousands of clusters affected daily across CI:
Breakdown (last 3 days): Of 2,299 etcd pods that experienced DNS failures, 89.5% resolved within 30 seconds, but 8.1% (187 pods) took >5 minutes — these are the ones that cause E2E test timeouts.
Root Cause: EndpointSlice informer cache staleness
Confirmed via
aksEventstable (categorykube-controller-manager) on May 14:This appeared for dozens of HCP namespaces — individual namespaces saw 136–227 sync errors. A secondary pattern (
"skipping Pod for Service, Node not found") shows the same informer staleness affecting node visibility, with top nodes seeing 330–557 skip events.What's ruled out:
Specific case timeline (etcd-0 in a CI namespace, May 26):
etcd-discoveryServiceensure-dnsstarts failing —"no such host"10.128.64.186The pod had its IP within 4 seconds, the Service existed 30 seconds before the pod, CoreDNS was healthy — the EndpointSlice controller simply hadn't updated the headless Service's endpoints.
Impact on E2E stability
All three stage E2E jobs show poor stability:
periodic-stage-e2e-parallelperiodic-stage-e2e-parallel-ocp-nightlybranch-...-stage-e2e-parallelThe Jira's specific failure was in
branch-ci-Azure-ARO-HCP-main-e2e-stage-e2e-parallelbuild2053695569343287296(May 11 04:36 UTC), where thepatch-name-clusterhit the 45-minute E2E timeout due to this etcd DNS issue.Solution: Etcd Pod Self-Registration
Each etcd pod creates its own EndpointSlice during the
ensure-dnsinit container phase:Why this works:
managed-by: endpointslice-controller.k8s.io— it completely ignores slices with a differentmanaged-bylabelPrecedent: The HCCO already uses this pattern for the
kubernetesEndpointSlice in the guest cluster (ReconcileKASEndpointSliceinhostedclusterconfigoperator/controllers/resources/kas/reconcile.go).Changed files
dnsresolver/cmd.go--self-registerflag andselfRegisterEndpointSlice()— creates EndpointSlice via K8s API before polling DNS, falls back to DNS-only on failurednsresolver/cmd_test.gov2/assets/etcd/statefulset.yamlPOD_IPenv var and--self-registerflag toensure-dnsinit containerv2/assets/etcd/etcd-serviceaccount.yamletcdservice account (always created)v2/assets/etcd/etcd-self-register-role.yamlv2/assets/etcd/etcd-self-register-rolebinding.yamletcdSAv2/assets/etcd/defrag-rolebinding.yamletcd-defrag-controllertoetcdSAv2/etcd/component.gov2/etcd/statefulset.goServiceAccountName = "etcd"(not conditionally for HA)manifests/etcd.goEtcdServiceAccount,EtcdSelfRegisterRole,EtcdSelfRegisterRoleBindinghelpersRisks and mitigations
Test plan
dnsresolver/cmd_test.go)make buildcompiles cleanlymake lintpasses with 0 issues🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests