Skip to content

AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays#8613

Open
cssjr wants to merge 7 commits into
openshift:mainfrom
cssjr:etcd-endpointslice-self-register
Open

AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays#8613
cssjr wants to merge 7 commits into
openshift:mainfrom
cssjr:etcd-endpointslice-self-register

Conversation

@cssjr
Copy link
Copy Markdown

@cssjr cssjr commented May 27, 2026

Summary

  • Adds EndpointSlice self-registration to the etcd ensure-dns init container, bypassing stale kube-controller-manager informer caches that delay DNS resolution under high HCP density
  • Creates a dedicated etcd service account with minimal RBAC for EndpointSlice and pod get permissions
  • Replaces the conditional etcd-defrag-controller SA with the always-present etcd SA, binding both self-register and defrag (HA-only) roles to it

Problem

Under high cluster density (~2,500 HCP namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, causing etcd-discovery headless Service DNS records to be delayed by minutes to hours. The ensure-dns init container blocks waiting for DNS, preventing etcd from starting, which cascades into cluster creation timeouts.

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Scale: Hundreds to thousands of clusters affected daily across CI:

Date         Clusters   Failures   Prolonged(>5m)   Brief(<=5m)
2026-05-13        675     15,947           37            638
2026-05-14      1,210     24,869          142          1,063
2026-05-19      1,542     29,622          248          1,297
2026-05-20      1,228     22,563          180          1,039
2026-05-21      1,311     23,130          174          1,135
2026-05-26        980     17,884          138            842

Breakdown (last 3 days): Of 2,299 etcd pods that experienced DNS failures, 89.5% resolved within 30 seconds, but 8.1% (187 pods) took >5 minutes — these are the ones that cause E2E test timeouts.

Root Cause: EndpointSlice informer cache staleness

Confirmed via aksEvents table (category kube-controller-manager) on May 14:

"Error syncing endpoint slices for service, retrying"
  key="ocm-arohcpprow-<cluster-id>/etcd-discovery"
  err="EndpointSlice informer cache is out of date"

This appeared for dozens of HCP namespaces — individual namespaces saw 136–227 sync errors. A secondary pattern ("skipping Pod for Service, Node not found") shows the same informer staleness affecting node visibility, with top nodes seeing 330–557 skip events.

What's ruled out:

Hypothesis Evidence Verdict
CoreDNS unhealthy No CoreDNS errors in logs Ruled out
Node-specific issue Failures evenly distributed across all nodes Ruled out
CNI/IP assignment delay CNS assigned IPs within seconds Ruled out
Service creation race Framework guarantees Service before StatefulSet Ruled out

Specific case timeline (etcd-0 in a CI namespace, May 26):

Time Event
17:57:52 CPO reconciles etcd-discovery Service
17:57:53 CPO reconciles etcd StatefulSet
17:58:22 ensure-dns starts failing — "no such host"
17:58:26 CNS assigns pod IP 10.128.64.186
18:18:54 DNS finally resolves — 20 minutes after IP assignment

The pod had its IP within 4 seconds, the Service existed 30 seconds before the pod, CoreDNS was healthy — the EndpointSlice controller simply hadn't updated the headless Service's endpoints.

Impact on E2E stability

All three stage E2E jobs show poor stability:

Job Success Rate (last 20 runs)
periodic-stage-e2e-parallel 40%
periodic-stage-e2e-parallel-ocp-nightly 20%
branch-...-stage-e2e-parallel 60%

The Jira's specific failure was in branch-ci-Azure-ARO-HCP-main-e2e-stage-e2e-parallel build 2053695569343287296 (May 11 04:36 UTC), where the patch-name-cluster hit the 45-minute E2E timeout due to this etcd DNS issue.

Solution: Etcd Pod Self-Registration

Each etcd pod creates its own EndpointSlice during the ensure-dns init container phase:

etcd-discovery Service (headless, selector: app=etcd)
    |
    +-- etcd-discovery-xxxxx  (managed-by: endpointslice-controller.k8s.io)
    |   [Standard controller — may be slow/stale under high density]
    |
    +-- etcd-discovery-self-etcd-0  (managed-by: control-plane-operator)
    |   [Created by etcd-0 init container — immediate]
    |
    +-- etcd-discovery-self-etcd-1  (managed-by: control-plane-operator)
    +-- etcd-discovery-self-etcd-2  (managed-by: control-plane-operator)

Why this works:

  • The standard EndpointSlice controller only manages slices labeled managed-by: endpointslice-controller.k8s.io — it completely ignores slices with a different managed-by label
  • CoreDNS merges endpoints from ALL EndpointSlices for a given Service regardless of who created them
  • Once the self-registered slice exists, CoreDNS picks it up via its watch within ~100ms
  • When the standard controller catches up, duplicate IPs are harmless for DNS
  • OwnerReference to the Pod ensures cleanup on pod deletion

Precedent: The HCCO already uses this pattern for the kubernetes EndpointSlice in the guest cluster (ReconcileKASEndpointSlice in hostedclusterconfigoperator/controllers/resources/kas/reconcile.go).

Changed files

File Change
dnsresolver/cmd.go Added --self-register flag and selfRegisterEndpointSlice() — creates EndpointSlice via K8s API before polling DNS, falls back to DNS-only on failure
dnsresolver/cmd_test.go Unit tests for DNS name parsing
v2/assets/etcd/statefulset.yaml Added POD_IP env var and --self-register flag to ensure-dns init container
v2/assets/etcd/etcd-serviceaccount.yaml New etcd service account (always created)
v2/assets/etcd/etcd-self-register-role.yaml Role granting EndpointSlice get/create/update and pod get
v2/assets/etcd/etcd-self-register-rolebinding.yaml Binds role to etcd SA
v2/assets/etcd/defrag-rolebinding.yaml Updated subject from etcd-defrag-controller to etcd SA
v2/etcd/component.go Registered new RBAC assets, removed old defrag SA
v2/etcd/statefulset.go Always set ServiceAccountName = "etcd" (not conditionally for HA)
manifests/etcd.go Added EtcdServiceAccount, EtcdSelfRegisterRole, EtcdSelfRegisterRoleBinding helpers

Risks and mitigations

Risk Mitigation
Duplicate endpoints when controller catches up CoreDNS deduplicates; harmless for DNS resolution
Self-registered EndpointSlice outlives pod OwnerReference to Pod triggers GC on pod deletion
RBAC escalation Scoped to get/create/update EndpointSlices + get pods in own namespace only
Init container fails to create EndpointSlice Falls back to DNS-only behavior with a warning log

Test plan

  • Unit tests for DNS name parsing (dnsresolver/cmd_test.go)
  • Existing etcd component tests pass (14/14)
  • make build compiles cleanly
  • make lint passes with 0 issues
  • E2E validation in staging environment
  • Verify EndpointSlice creation in a live HCP namespace
  • Verify DNS resolution timing improvement

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Etcd pods can self-register EndpointSlices via an optional DNS-resolver self-register mode (warns and falls back to DNS-only on failure).
    • RBAC and ServiceAccount updates to support etcd self-registration and unify controller bindings; defrag controller now uses the standardized etcd ServiceAccount.
    • Init behavior updated to expose pod IP for self-registration.
  • Tests

    • Added unit tests for DNS service-name parsing and EndpointSlice reconciliation.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 27, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 27, 2026

@cssjr: This pull request references AROSLSRE-830 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds EndpointSlice self-registration to the etcd ensure-dns init container, bypassing stale kube-controller-manager informer caches that delay DNS resolution under high HCP density
  • Creates a dedicated etcd service account with minimal RBAC for EndpointSlice and pod get permissions
  • Replaces the conditional etcd-defrag-controller SA with the always-present etcd SA, binding both self-register and defrag (HA-only) roles to it

Problem

Under high cluster density (~2,500 HCP namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, causing etcd-discovery headless Service DNS records to be delayed by minutes to hours. The ensure-dns init container blocks waiting for DNS, preventing etcd from starting, which cascades into cluster creation timeouts.

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Scale: Hundreds to thousands of clusters affected daily across CI:

Date         Clusters   Failures   Prolonged(>5m)   Brief(<=5m)
2026-05-13        675     15,947           37            638
2026-05-14      1,210     24,869          142          1,063
2026-05-19      1,542     29,622          248          1,297
2026-05-20      1,228     22,563          180          1,039
2026-05-21      1,311     23,130          174          1,135
2026-05-26        980     17,884          138            842

Breakdown (last 3 days): Of 2,299 etcd pods that experienced DNS failures, 89.5% resolved within 30 seconds, but 8.1% (187 pods) took >5 minutes — these are the ones that cause E2E test timeouts.

Root Cause: EndpointSlice informer cache staleness

Confirmed via aksEvents table (category kube-controller-manager) on May 14:

"Error syncing endpoint slices for service, retrying"
 key="ocm-arohcpprow-<cluster-id>/etcd-discovery"
 err="EndpointSlice informer cache is out of date"

This appeared for dozens of HCP namespaces — individual namespaces saw 136–227 sync errors. A secondary pattern ("skipping Pod for Service, Node not found") shows the same informer staleness affecting node visibility, with top nodes seeing 330–557 skip events.

What's ruled out:

Hypothesis Evidence Verdict
CoreDNS unhealthy No CoreDNS errors in logs Ruled out
Node-specific issue Failures evenly distributed across all nodes Ruled out
CNI/IP assignment delay CNS assigned IPs within seconds Ruled out
Service creation race Framework guarantees Service before StatefulSet Ruled out

Specific case timeline (etcd-0 in a CI namespace, May 26):

Time Event
17:57:52 CPO reconciles etcd-discovery Service
17:57:53 CPO reconciles etcd StatefulSet
17:58:22 ensure-dns starts failing — "no such host"
17:58:26 CNS assigns pod IP 10.128.64.186
18:18:54 DNS finally resolves — 20 minutes after IP assignment

The pod had its IP within 4 seconds, the Service existed 30 seconds before the pod, CoreDNS was healthy — the EndpointSlice controller simply hadn't updated the headless Service's endpoints.

Impact on E2E stability

All three stage E2E jobs show poor stability:

Job Success Rate (last 20 runs)
periodic-stage-e2e-parallel 40%
periodic-stage-e2e-parallel-ocp-nightly 20%
branch-...-stage-e2e-parallel 60%

The Jira's specific failure was in branch-ci-Azure-ARO-HCP-main-e2e-stage-e2e-parallel build 2053695569343287296 (May 11 04:36 UTC), where the patch-name-cluster hit the 45-minute E2E timeout due to this etcd DNS issue.

Solution: Etcd Pod Self-Registration

Each etcd pod creates its own EndpointSlice during the ensure-dns init container phase:

etcd-discovery Service (headless, selector: app=etcd)
   |
   +-- etcd-discovery-xxxxx  (managed-by: endpointslice-controller.k8s.io)
   |   [Standard controller — may be slow/stale under high density]
   |
   +-- etcd-discovery-self-etcd-0  (managed-by: control-plane-operator)
   |   [Created by etcd-0 init container — immediate]
   |
   +-- etcd-discovery-self-etcd-1  (managed-by: control-plane-operator)
   +-- etcd-discovery-self-etcd-2  (managed-by: control-plane-operator)

Why this works:

  • The standard EndpointSlice controller only manages slices labeled managed-by: endpointslice-controller.k8s.io — it completely ignores slices with a different managed-by label
  • CoreDNS merges endpoints from ALL EndpointSlices for a given Service regardless of who created them
  • Once the self-registered slice exists, CoreDNS picks it up via its watch within ~100ms
  • When the standard controller catches up, duplicate IPs are harmless for DNS
  • OwnerReference to the Pod ensures cleanup on pod deletion

Precedent: The HCCO already uses this pattern for the kubernetes EndpointSlice in the guest cluster (ReconcileKASEndpointSlice in hostedclusterconfigoperator/controllers/resources/kas/reconcile.go).

Changed files

File Change
dnsresolver/cmd.go Added --self-register flag and selfRegisterEndpointSlice() — creates EndpointSlice via K8s API before polling DNS, falls back to DNS-only on failure
dnsresolver/cmd_test.go Unit tests for DNS name parsing
v2/assets/etcd/statefulset.yaml Added POD_IP env var and --self-register flag to ensure-dns init container
v2/assets/etcd/etcd-serviceaccount.yaml New etcd service account (always created)
v2/assets/etcd/etcd-self-register-role.yaml Role granting EndpointSlice get/create/update and pod get
v2/assets/etcd/etcd-self-register-rolebinding.yaml Binds role to etcd SA
v2/assets/etcd/defrag-rolebinding.yaml Updated subject from etcd-defrag-controller to etcd SA
v2/etcd/component.go Registered new RBAC assets, removed old defrag SA
v2/etcd/statefulset.go Always set ServiceAccountName = "etcd" (not conditionally for HA)
manifests/etcd.go Added EtcdServiceAccount, EtcdSelfRegisterRole, EtcdSelfRegisterRoleBinding helpers

Risks and mitigations

Risk Mitigation
Duplicate endpoints when controller catches up CoreDNS deduplicates; harmless for DNS resolution
Self-registered EndpointSlice outlives pod OwnerReference to Pod triggers GC on pod deletion
RBAC escalation Scoped to get/create/update EndpointSlices + get pods in own namespace only
Init container fails to create EndpointSlice Falls back to DNS-only behavior with a warning log

Test plan

  • Unit tests for DNS name parsing (dnsresolver/cmd_test.go)
  • Existing etcd component tests pass (14/14)
  • make build compiles cleanly
  • make lint passes with 0 issues
  • E2E validation in staging environment
  • Verify EndpointSlice creation in a live HCP namespace
  • Verify DNS resolution timing improvement

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements etcd self-registration: adds manifest constructors and YAML assets (ServiceAccount, etcd-self-register Role/RoleBinding; updates defrag RoleBinding subject), updates the StatefulSet initContainer to pass --self-register and POD_IP and uses the unified etcd ServiceAccount, updates component wiring, and extends dnsresolver with a --self-register flag plus selfRegisterEndpointSlice/ensureEndpointSlice/parseServiceName and unit tests.

Sequence Diagram(s)

sequenceDiagram
  participant dnsresolver
  participant KubeAPI
  participant EndpointSlice
  participant DNS

  dnsresolver->>KubeAPI: ensureEndpointSlice(create/update EndpointSlice for pod IP and targetRef)
  KubeAPI->>EndpointSlice: persist EndpointSlice
  dnsresolver->>DNS: resolveDNS(dnsName)
  DNS-->>dnsresolver: resolution results
Loading

Suggested reviewers

  • patjlm
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Tests lack meaningful assertion failure messages (criterion #4). 26+ assertions use bare Expect() calls without messages; custom check requires context messages per documented examples. Add failure context messages to all assertions: e.g., Expect(err).NotTo(HaveOccurred(), "failed to ensure EndpointSlice") and Expect(slice.Labels[...]).To(Equal(...), "service name label mismatch").
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: adding etcd EndpointSlice self-registration to resolve DNS delays, which aligns with the core purpose of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names added in PR are stable, deterministic. No dynamic values in titles—pod names, UUIDs, timestamps, node names, IPs appear only in test bodies, not names.
Topology-Aware Scheduling Compatibility ✅ Passed No topology-unaware scheduling constraints found in statefulset, RBAC, PDB, or code. Defrag controller is topology-aware and only runs on HighlyAvailable clusters.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds standard Go unit tests (testing package), not Ginkgo e2e tests. Custom check applies only to Ginkgo e2e tests, which are absent.
No-Weak-Crypto ✅ Passed No weak cryptography found. PR adds etcd EndpointSlice self-registration using only standard Kubernetes APIs with no cryptographic operations, weak algorithms, or insecure comparisons.
Container-Privileges ✅ Passed No privileged containers, host access, elevated capabilities, or inappropriate privilege escalation settings found. All RBAC is properly scoped with minimal necessary permissions.
No-Sensitive-Data-In-Logs ✅ Passed No passwords, tokens, API keys, PII, session IDs, or customer data found in logging statements. Infrastructure metadata (pod IPs, DNS names) are logged but not classified as sensitive.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 27, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dnsresolver/cmd.go`:
- Around line 150-164: The Get call in selfRegisterEndpointSlice currently
treats any error as "not found" and proceeds to Create; change the error
handling to return immediately for non-NotFound errors from
client.DiscoveryV1().EndpointSlices(...).Get(ctx, sliceName, ...). Specifically,
import and use apierrors.IsNotFound(err) (k8s.io/apimachinery/pkg/api/errors)
and: if err == nil keep the Update path, else if apierrors.IsNotFound(err)
perform the Create path, otherwise return the original Get error (wrap with
fmt.Errorf). This prevents masking RBAC/timeout errors while preserving the
existing Update/Create logic and messages.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 64afd625-994c-4a36-ad76-4ecbbb5a19ca

📥 Commits

Reviewing files that changed from the base of the PR and between 0aa20fc and 87cf121.

📒 Files selected for processing (10)
  • control-plane-operator/controllers/hostedcontrolplane/manifests/etcd.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/defrag-rolebinding.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-self-register-role.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-self-register-rolebinding.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/component.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go
  • dnsresolver/cmd.go
  • dnsresolver/cmd_test.go

Comment thread dnsresolver/cmd.go
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 49.67320% with 77 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.28%. Comparing base (2f52041) to head (824480d).
⚠️ Report is 137 commits behind head on main.

Files with missing lines Patch % Lines
dnsresolver/cmd.go 65.51% 38 Missing and 2 partials ⚠️
...r/controllers/hostedcontrolplane/manifests/etcd.go 0.00% 21 Missing ⚠️
...ontrollers/hostedcontrolplane/v2/etcd/component.go 0.00% 13 Missing ⚠️
...trollers/hostedcontrolplane/v2/etcd/statefulset.go 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8613      +/-   ##
==========================================
+ Coverage   40.61%   41.28%   +0.67%     
==========================================
  Files         755      755              
  Lines       93227    93597     +370     
==========================================
+ Hits        37864    38642     +778     
+ Misses      52640    52221     -419     
- Partials     2723     2734      +11     
Files with missing lines Coverage Δ
...trollers/hostedcontrolplane/v2/etcd/statefulset.go 40.16% <0.00%> (-0.67%) ⬇️
...ontrollers/hostedcontrolplane/v2/etcd/component.go 10.20% <0.00%> (-3.69%) ⬇️
...r/controllers/hostedcontrolplane/manifests/etcd.go 45.83% <0.00%> (-7.83%) ⬇️
dnsresolver/cmd.go 53.52% <65.51%> (+53.52%) ⬆️

... and 40 files with indirect coverage changes

Flag Coverage Δ
cmd-support 34.86% <ø> (+0.16%) ⬆️
cpo-hostedcontrolplane 43.41% <0.00%> (+1.63%) ⬆️
cpo-other 42.79% <ø> (+1.72%) ⬆️
hypershift-operator 51.00% <ø> (+0.24%) ⬆️
other 32.04% <65.51%> (+0.45%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ntainer

Under high HCP density (~2,500 namespaces), the management cluster's
kube-controller-manager EndpointSlice informer cache goes stale, delaying
etcd-discovery headless Service DNS records by minutes to hours. This
causes etcd pods to block in the ensure-dns init container, preventing
cluster creation and triggering E2E test timeouts.

Each etcd pod now self-registers its IP into a custom EndpointSlice
(managed-by: control-plane-operator) before polling DNS. The standard
EndpointSlice controller ignores slices with a different managed-by
label, so there is no conflict. CoreDNS picks up the custom slice via
its watch within ~100ms, making DNS resolution near-instant regardless
of controller informer health.

A new dedicated "etcd" service account replaces the conditional
"etcd-defrag-controller" SA, with RBAC granting EndpointSlice and pod
get permissions. The defrag Role/RoleBinding remain HA-only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cssjr cssjr force-pushed the etcd-endpointslice-self-register branch from 87cf121 to 39ffc19 Compare May 27, 2026 21:13
cssjr and others added 2 commits May 27, 2026 14:16
Use apierrors.IsNotFound to only fall through to Create when the
EndpointSlice genuinely doesn't exist. RBAC, timeout, and other errors
are now returned immediately instead of being masked by a subsequent
Create attempt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor selfRegisterEndpointSlice to extract testable
ensureEndpointSlice function that accepts a kubernetes.Interface,
enabling tests with the fake client. Tests cover: create, update,
IPv6, missing pod, invalid DNS name, multiple pods, and field
correctness.

Coverage: ensureEndpointSlice 88.9%, parseServiceName 100%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cssjr
Copy link
Copy Markdown
Author

cssjr commented May 27, 2026

/cc patjlm

@openshift-ci openshift-ci Bot requested a review from patjlm May 27, 2026 21:22
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dnsresolver/cmd.go (1)

177-182: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject malformed non-service DNS names before self-registering.

parseServiceName only checks the label count and then trusts parts[1]. With --self-register, an input like foo.bar.baz would still create or update an EndpointSlice for service bar in the current namespace instead of failing fast. Please validate the documented <pod>.<service>.<namespace>.svc[.cluster.local] shape here before returning a service name.

Proposed fix
 func parseServiceName(dnsName string) (string, error) {
 	parts := strings.Split(dnsName, ".")
-	if len(parts) < 3 {
-		return "", fmt.Errorf("expected at least 3 dot-separated components, got %d", len(parts))
+	if len(parts) < 4 || parts[0] == "" || parts[1] == "" || parts[2] == "" || parts[3] != "svc" {
+		return "", fmt.Errorf("expected <pod>.<service>.<namespace>.svc[.cluster.local], got %q", dnsName)
 	}
 	return parts[1], nil
 }
As per coding guidelines, "Validate at trust boundaries with allow-lists, not deny-lists".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dnsresolver/cmd.go` around lines 177 - 182, parseServiceName currently only
checks label count and trusts parts[1]; instead validate the full DNS shape
pod.service.namespace.svc[.cluster.local] before returning the service. In
parseServiceName, split on '.', ensure no empty labels, require parts[3] ==
"svc", allow either len(parts)==4 or len(parts)==6 with parts[4]=="cluster" &&
parts[5]=="local", and only then return parts[1]; otherwise return a descriptive
error rejecting the malformed name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@dnsresolver/cmd.go`:
- Around line 177-182: parseServiceName currently only checks label count and
trusts parts[1]; instead validate the full DNS shape
pod.service.namespace.svc[.cluster.local] before returning the service. In
parseServiceName, split on '.', ensure no empty labels, require parts[3] ==
"svc", allow either len(parts)==4 or len(parts)==6 with parts[4]=="cluster" &&
parts[5]=="local", and only then return parts[1]; otherwise return a descriptive
error rejecting the malformed name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 11fd82fa-916a-486b-a98c-7ddd6ed5ad67

📥 Commits

Reviewing files that changed from the base of the PR and between b877681 and 30ffd78.

📒 Files selected for processing (2)
  • dnsresolver/cmd.go
  • dnsresolver/cmd_test.go

@cssjr
Copy link
Copy Markdown
Author

cssjr commented May 27, 2026

/cc @garrickdabbs

@openshift-ci openshift-ci Bot requested a review from garrickdabbs May 27, 2026 21:30
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cssjr
Copy link
Copy Markdown
Author

cssjr commented May 27, 2026

/test all

@cssjr
Copy link
Copy Markdown
Author

cssjr commented May 27, 2026

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented May 28, 2026

Test Results

e2e-aws

e2e-aks

@cssjr cssjr marked this pull request as ready for review May 28, 2026 14:29
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and sjenning May 28, 2026 14:30
@cssjr
Copy link
Copy Markdown
Author

cssjr commented May 29, 2026

✅ All PR feedback addressed in commit 6597d3d

All blocking issues and suggestions from @bryan-cox have been implemented:

Blocking Issues Fixed

  1. ServiceAccount backwards compatibility

    • Kept etcd-defrag-controller SA (HA-only) separate from etcd SA (always present)
    • Restored defrag-serviceaccount.yaml to avoid orphaned SAs on upgrade
    • Added etcd-self-register-rolebinding-defrag.yaml to bind self-register role to defrag SA in HA mode
    • Updated defrag-rolebinding.yaml to point to etcd-defrag-controller
    • Updated statefulset.go to conditionally use correct SA
  2. Fully qualified managed-by label

    • Changed to "etcd-self-register.hypershift.openshift.io"

Suggestions Implemented

  1. Explicit GC fields in OwnerReference

    • Added Controller: ptr.To(true) and BlockOwnerDeletion: ptr.To(true)
  2. SA usage documentation

    • Added comprehensive comment explaining HA vs non-HA SA usage
  3. Test context handling

    • Replaced all 12 instances of context.Background() with t.Context()
  4. Better error assertions

    • Used gomega's MatchError matcher (2 instances)
  5. Cleaner IP generation

    • Replaced string(rune('1'+i)) with fmt.Sprintf("10.128.64.%d", 186+i)

Verification

  • ✅ All tests pass
  • make build succeeds
  • make lint passes
  • make pre-commit passes
  • ✅ Commit message follows HyperShift conventions

All review comments have been resolved. Ready for re-review! 🚀

@cssjr
Copy link
Copy Markdown
Author

cssjr commented Jun 1, 2026

/retest

@bryan-cox
Copy link
Copy Markdown
Member

/approve

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2026
@cssjr
Copy link
Copy Markdown
Author

cssjr commented Jun 2, 2026

/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@galchammat
Copy link
Copy Markdown

/lgtm

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

@galchammat: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cssjr
Copy link
Copy Markdown
Author

cssjr commented Jun 2, 2026

/retest-required

Comment thread dnsresolver/cmd.go
Comment thread dnsresolver/cmd.go Outdated
Address feedback from @enxebre to document why --self-register is
needed and improve IP address parsing.

Changes:
- Added comment in statefulset.yaml explaining --self-register flag
  purpose (bypasses stale EndpointSlice cache under high HCP density)
- Added comment in dnsresolver/cmd.go documenting the flag
- Replaced manual net.ParseIP().To4() check with
  netutil.IsIPv4Address() which properly handles ParseIP returning
  nil and returns an error instead of silently failing

Signed-off-by: Cliff Schomburg <7424213+cssjr@users.noreply.github.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@cssjr
Copy link
Copy Markdown
Author

cssjr commented Jun 3, 2026

✅ Addressed feedback from @enxebre in commit c8b0fa1

All 4 comments have been addressed:

1. Added comment in statefulset.yaml (line 192) ✅

Added a 3-line comment explaining why --self-register is needed:

# Self-register this pod's IP into an EndpointSlice before waiting for DNS resolution.
# This bypasses stale kube-controller-manager EndpointSlice cache issues under high HCP density
# (2,500+ namespaces), where the standard controller can delay DNS updates by minutes to hours.

2. Added comment in cmd.go (line 46) ✅

Added a 3-line comment documenting the --self-register flag:

// --self-register creates an EndpointSlice for this pod before waiting for DNS resolution.
// This bypasses stale kube-controller-manager EndpointSlice cache issues under high HCP density,
// where the standard controller can delay DNS updates by minutes to hours.

3. Fixed IP parsing to handle nil and avoid silent failure (line 112) ✅

Replaced manual net.ParseIP().To4() check with netutil.IsIPv4Address() which:

  • Returns an error if ParseIP returns nil (no more silent failures)
  • Properly handles the To4() check
  • Follows existing codebase patterns

Before:

addressType := discoveryv1.AddressTypeIPv4
if net.ParseIP(podIP) != nil && net.ParseIP(podIP).To4() == nil {
    addressType = discoveryv1.AddressTypeIPv6
}

After:

isIPv4, err := netutil.IsIPv4Address(podIP)
if err != nil {
    return fmt.Errorf("failed to parse pod IP %s: %w", podIP, err)
}
addressType := discoveryv1.AddressTypeIPv6
if isIPv4 {
    addressType = discoveryv1.AddressTypeIPv4
}

4. Used support/netutil.IsIPv4Address()

Implemented as part of #3 above.

Verification

  • ✅ All tests pass
  • make build succeeds
  • make lint-fix applied
  • make pre-commit passes

Add test to verify that ensureEndpointSlice properly returns an error
when given a malformed IP address (not IPv4 or IPv6).

This ensures the netutil.IsIPv4Address() error handling works correctly
and doesn't silently fail on invalid input.

Signed-off-by: Cliff Schomburg <7424213+cssjr@users.noreply.github.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@cssjr cssjr requested a review from enxebre June 3, 2026 15:33
@cssjr
Copy link
Copy Markdown
Author

cssjr commented Jun 3, 2026

/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2062196402125017088 | Cost: $4.831960000000001 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cssjr
Copy link
Copy Markdown
Author

cssjr commented Jun 3, 2026

/retest-required

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented Jun 3, 2026

No symptom labels available. Now I have all the evidence needed for the report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout (1205.12s)

globalps.go:217:
    Expected success, but got an error:
        failed to wait for DaemonSet global-pull-secret-syncer to be ready: context deadline exceeded

Summary

The TestCreateCluster test failed because the global-pull-secret-syncer DaemonSet in the hosted cluster remained stuck at 2/3 pods ready for over 20 minutes (1205s), exceeding the test's timeout. This was caused by repeated DaemonSet rolling updates (the DaemonSet went through 6 generations with at least 12 distinct pod instances) triggered by the pull secret patch in the EnsureGlobalPullSecret subtest. The maxUnavailable: 1 rolling update strategy means only 2 of 3 pods can be ready at any point during a rollout, and back-to-back generation changes prevented the DaemonSet from ever reaching 3/3 within the test's deadline. All other 596 passing tests confirm the cluster was healthy. This failure is unrelated to PR #8613, which only modifies etcd StatefulSet, ServiceAccount, RBAC, and DNS resolver code — none of the global-pull-secret-syncer or EnsureGlobalPullSecret test code was changed.

Root Cause

The failure chain is:

  1. Primary failure: TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout — The test patches the management-cluster pull secret with a dummy auth entry. This triggers the HyperShift control-plane-operator to detect the pull secret change, recompute the globalps-config-hash label, and update the global-pull-secret-syncer DaemonSet spec. The DaemonSet's generation climbed from 1 to 6 (6 spec updates), spawning at least 12 distinct pod instances (m8qr9, whnvd, rc7xl, jdsn6, d5gw9, lm584, qstsc, q7t89, 2dvzz, q95dk, b8lnz, hhfxk, ln5hm). With maxUnavailable: 1, at most 2/3 pods are ready during any rolling update cycle. Back-to-back rolling updates from repeated spec changes meant the DaemonSet never stabilized at 3/3 within the 20-minute test deadline.

  2. Cascade failure: TestCreateCluster/Main/EnsureGlobalPullSecret/Check_if_the_config.json_is_correct_in_all_of_the_nodes — This subtest tried to create the kubelet-config-verifier DaemonSet, but it already existed from the previous (timed-out) subtest that created it before the wait. The error daemonsets.apps "kubelet-config-verifier" already exists is a direct consequence of the first failure leaving the DaemonSet behind.

  3. Teardown condition message: The top-level TestCreateCluster reported ClusterVersionSucceeding=False: ClusterOperatorNotAvailable(Cluster operator monitoring is not available) — this was a transient 2-second condition check during post-test cleanup (hypershift_framework.go:179). The monitoring ClusterOperator was actually Available=True and Degraded=False at cluster dump time, confirming this was a brief transient during teardown, not a real monitoring failure.

Why this is unrelated to PR #8613: The PR modifies only etcd-related files (etcd StatefulSet, ServiceAccount, RBAC roles/bindings, and dnsresolver/cmd.go). It does not touch the global-pull-secret-syncer DaemonSet, the control-plane-operator's pull secret reconciliation logic, or any of the EnsureGlobalPullSecret test code. This is a pre-existing flaky test scenario where rapid DaemonSet rolling updates under the pull secret mutation test can exceed the hardcoded timeout.

Recommendations
  1. Rerun the CI job — This is a pre-existing flake unrelated to the PR's etcd changes. A /retest should pass.

  2. File a bug for the EnsureGlobalPullSecret test flake — The test's DaemonSet readiness wait does not account for back-to-back rolling updates triggered by the pull secret patch. The timeout (≈20 minutes) is insufficient when the DaemonSet goes through multiple generations. Potential fixes:

    • Wait for the DaemonSet's observedGeneration to match generation before checking pod readiness
    • Add a stabilization check that ensures no rolling update is in progress before starting the readiness poll
    • Clean up the kubelet-config-verifier DaemonSet between subtests to prevent the cascade failure
  3. No changes needed to PR AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays #8613 — The etcd EndpointSlice self-registration changes are completely orthogonal to this failure.

Evidence
Evidence Detail
Failing test TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout
Failure duration 1205.12s (≈20 minutes) waiting for DaemonSet readiness
DaemonSet state during test global-pull-secret-syncer not ready: 2/3 pods ready — 368 consecutive poll attempts
DaemonSet generations 6 (indicating 6 spec updates / rolling updates)
Distinct pod instances 12+ pods created across rolling updates (m8qr9, whnvd, rc7xl, jdsn6, d5gw9, lm584, qstsc, q7t89, 2dvzz, q95dk, b8lnz, hhfxk, ln5hm)
DaemonSet state after test 3/3 ready (recovered after timeout)
Monitoring operator at dump time Available=True, Degraded=False (healthy)
Cascade failure kubelet-config-verifier DaemonSet already exists (409 Conflict)
Total test results 601 tests, 25 skipped, 5 failures (all in TestCreateCluster tree)
PR #8613 files changed etcd StatefulSet, ServiceAccount, RBAC, dnsresolver — none touch pull-secret or test code
Cluster nodes 3 nodes (ip-10-0-14-72, ip-10-0-29-245, ip-10-0-39-68), all Ready with nodepool-globalps-enabled: "true"

Copy link
Copy Markdown
Contributor

@csrwng csrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean implementation of the EndpointSlice self-registration pattern.

RBAC is well-scoped (get/create/update EndpointSlices + get Pods), the custom managed-by label avoids conflict with the standard controller, and the OwnerReference to the Pod ensures proper GC. The backwards-compatible SA split (keeping etcd-defrag-controller for HA, adding etcd for all topologies) addresses the upgrade concern. Tests cover create, update, IPv6, error cases, and multi-pod scenarios.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, csrwng, cssjr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cssjr
Copy link
Copy Markdown
Author

cssjr commented Jun 3, 2026

/retest-required

@cssjr
Copy link
Copy Markdown
Author

cssjr commented Jun 4, 2026

/test e2e-azure-v2-self-managed

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 4, 2026

@cssjr: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants