AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays by cssjr · Pull Request #8613 · openshift/hypershift

cssjr · 2026-05-27T20:54:33Z

Summary

Adds EndpointSlice self-registration to the etcd ensure-dns init container, bypassing stale kube-controller-manager informer caches that delay DNS resolution under high HCP density
Creates a dedicated etcd service account with minimal RBAC for EndpointSlice and pod get permissions
Replaces the conditional etcd-defrag-controller SA with the always-present etcd SA, binding both self-register and defrag (HA-only) roles to it

Problem

Under high cluster density (~2,500 HCP namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, causing etcd-discovery headless Service DNS records to be delayed by minutes to hours. The ensure-dns init container blocks waiting for DNS, preventing etcd from starting, which cascades into cluster creation timeouts.

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Scale: Hundreds to thousands of clusters affected daily across CI:

Date         Clusters   Failures   Prolonged(>5m)   Brief(<=5m)
2026-05-13        675     15,947           37            638
2026-05-14      1,210     24,869          142          1,063
2026-05-19      1,542     29,622          248          1,297
2026-05-20      1,228     22,563          180          1,039
2026-05-21      1,311     23,130          174          1,135
2026-05-26        980     17,884          138            842

Breakdown (last 3 days): Of 2,299 etcd pods that experienced DNS failures, 89.5% resolved within 30 seconds, but 8.1% (187 pods) took >5 minutes — these are the ones that cause E2E test timeouts.

Root Cause: EndpointSlice informer cache staleness

Confirmed via aksEvents table (category kube-controller-manager) on May 14:

"Error syncing endpoint slices for service, retrying"
  key="ocm-arohcpprow-<cluster-id>/etcd-discovery"
  err="EndpointSlice informer cache is out of date"

This appeared for dozens of HCP namespaces — individual namespaces saw 136–227 sync errors. A secondary pattern ("skipping Pod for Service, Node not found") shows the same informer staleness affecting node visibility, with top nodes seeing 330–557 skip events.

What's ruled out:

Hypothesis	Evidence	Verdict
CoreDNS unhealthy	No CoreDNS errors in logs	Ruled out
Node-specific issue	Failures evenly distributed across all nodes	Ruled out
CNI/IP assignment delay	CNS assigned IPs within seconds	Ruled out
Service creation race	Framework guarantees Service before StatefulSet	Ruled out

Specific case timeline (etcd-0 in a CI namespace, May 26):

Time	Event
17:57:52	CPO reconciles `etcd-discovery` Service
17:57:53	CPO reconciles etcd StatefulSet
17:58:22	`ensure-dns` starts failing — `"no such host"`
17:58:26	CNS assigns pod IP `10.128.64.186`
18:18:54	DNS finally resolves — 20 minutes after IP assignment

The pod had its IP within 4 seconds, the Service existed 30 seconds before the pod, CoreDNS was healthy — the EndpointSlice controller simply hadn't updated the headless Service's endpoints.

Impact on E2E stability

All three stage E2E jobs show poor stability:

Job	Success Rate (last 20 runs)
`periodic-stage-e2e-parallel`	40%
`periodic-stage-e2e-parallel-ocp-nightly`	20%
`branch-...-stage-e2e-parallel`	60%

The Jira's specific failure was in branch-ci-Azure-ARO-HCP-main-e2e-stage-e2e-parallel build 2053695569343287296 (May 11 04:36 UTC), where the patch-name-cluster hit the 45-minute E2E timeout due to this etcd DNS issue.

Solution: Etcd Pod Self-Registration

Each etcd pod creates its own EndpointSlice during the ensure-dns init container phase:

etcd-discovery Service (headless, selector: app=etcd)
    |
    +-- etcd-discovery-xxxxx  (managed-by: endpointslice-controller.k8s.io)
    |   [Standard controller — may be slow/stale under high density]
    |
    +-- etcd-discovery-self-etcd-0  (managed-by: control-plane-operator)
    |   [Created by etcd-0 init container — immediate]
    |
    +-- etcd-discovery-self-etcd-1  (managed-by: control-plane-operator)
    +-- etcd-discovery-self-etcd-2  (managed-by: control-plane-operator)

Why this works:

The standard EndpointSlice controller only manages slices labeled managed-by: endpointslice-controller.k8s.io — it completely ignores slices with a different managed-by label
CoreDNS merges endpoints from ALL EndpointSlices for a given Service regardless of who created them
Once the self-registered slice exists, CoreDNS picks it up via its watch within ~100ms
When the standard controller catches up, duplicate IPs are harmless for DNS
OwnerReference to the Pod ensures cleanup on pod deletion

Precedent: The HCCO already uses this pattern for the kubernetes EndpointSlice in the guest cluster (ReconcileKASEndpointSlice in hostedclusterconfigoperator/controllers/resources/kas/reconcile.go).

Changed files

File	Change
`dnsresolver/cmd.go`	Added `--self-register` flag and `selfRegisterEndpointSlice()` — creates EndpointSlice via K8s API before polling DNS, falls back to DNS-only on failure
`dnsresolver/cmd_test.go`	Unit tests for DNS name parsing
`v2/assets/etcd/statefulset.yaml`	Added `POD_IP` env var and `--self-register` flag to `ensure-dns` init container
`v2/assets/etcd/etcd-serviceaccount.yaml`	New `etcd` service account (always created)
`v2/assets/etcd/etcd-self-register-role.yaml`	Role granting EndpointSlice get/create/update and pod get
`v2/assets/etcd/etcd-self-register-rolebinding.yaml`	Binds role to `etcd` SA
`v2/assets/etcd/defrag-rolebinding.yaml`	Updated subject from `etcd-defrag-controller` to `etcd` SA
`v2/etcd/component.go`	Registered new RBAC assets, removed old defrag SA
`v2/etcd/statefulset.go`	Always set `ServiceAccountName = "etcd"` (not conditionally for HA)
`manifests/etcd.go`	Added `EtcdServiceAccount`, `EtcdSelfRegisterRole`, `EtcdSelfRegisterRoleBinding` helpers

Risks and mitigations

Risk	Mitigation
Duplicate endpoints when controller catches up	CoreDNS deduplicates; harmless for DNS resolution
Self-registered EndpointSlice outlives pod	OwnerReference to Pod triggers GC on pod deletion
RBAC escalation	Scoped to get/create/update EndpointSlices + get pods in own namespace only
Init container fails to create EndpointSlice	Falls back to DNS-only behavior with a warning log

Test plan

Unit tests for DNS name parsing (dnsresolver/cmd_test.go)
Existing etcd component tests pass (14/14)
make build compiles cleanly
make lint passes with 0 issues
E2E validation in staging environment
Verify EndpointSlice creation in a live HCP namespace
Verify DNS resolution timing improvement

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Etcd pods can self-register EndpointSlices via an optional DNS-resolver self-register mode (warns and falls back to DNS-only on failure).
- RBAC and ServiceAccount updates to support etcd self-registration and unify controller bindings; defrag controller now uses the standardized etcd ServiceAccount.
- Init behavior updated to expose pod IP for self-registration.
Tests
- Added unit tests for DNS service-name parsing and EndpointSlice reconciliation.

openshift-merge-bot · 2026-05-27T20:54:35Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-27T20:54:37Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-05-27T20:54:37Z

@cssjr: This pull request references AROSLSRE-830 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Adds EndpointSlice self-registration to the etcd ensure-dns init container, bypassing stale kube-controller-manager informer caches that delay DNS resolution under high HCP density

Creates a dedicated etcd service account with minimal RBAC for EndpointSlice and pod get permissions

Replaces the conditional etcd-defrag-controller SA with the always-present etcd SA, binding both self-register and defrag (HA-only) roles to it

Problem

Under high cluster density (~2,500 HCP namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, causing etcd-discovery headless Service DNS records to be delayed by minutes to hours. The ensure-dns init container blocks waiting for DNS, preventing etcd from starting, which cascades into cluster creation timeouts.

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Scale: Hundreds to thousands of clusters affected daily across CI:
Date         Clusters   Failures   Prolonged(>5m)   Brief(<=5m)
2026-05-13        675     15,947           37            638
2026-05-14      1,210     24,869          142          1,063
2026-05-19      1,542     29,622          248          1,297
2026-05-20      1,228     22,563          180          1,039
2026-05-21      1,311     23,130          174          1,135
2026-05-26        980     17,884          138            842
Breakdown (last 3 days): Of 2,299 etcd pods that experienced DNS failures, 89.5% resolved within 30 seconds, but 8.1% (187 pods) took >5 minutes — these are the ones that cause E2E test timeouts.

Root Cause: EndpointSlice informer cache staleness

Confirmed via aksEvents table (category kube-controller-manager) on May 14:
"Error syncing endpoint slices for service, retrying"
 key="ocm-arohcpprow-<cluster-id>/etcd-discovery"
 err="EndpointSlice informer cache is out of date"
This appeared for dozens of HCP namespaces — individual namespaces saw 136–227 sync errors. A secondary pattern ("skipping Pod for Service, Node not found") shows the same informer staleness affecting node visibility, with top nodes seeing 330–557 skip events.

What's ruled out:

Hypothesis Evidence Verdict

CoreDNS unhealthy No CoreDNS errors in logs Ruled out

Node-specific issue Failures evenly distributed across all nodes Ruled out

CNI/IP assignment delay CNS assigned IPs within seconds Ruled out

Service creation race Framework guarantees Service before StatefulSet Ruled out

Specific case timeline (etcd-0 in a CI namespace, May 26):

Time Event

17:57:52 CPO reconciles etcd-discovery Service

17:57:53 CPO reconciles etcd StatefulSet

17:58:22 ensure-dns starts failing — "no such host"

17:58:26 CNS assigns pod IP 10.128.64.186

18:18:54 DNS finally resolves — 20 minutes after IP assignment

The pod had its IP within 4 seconds, the Service existed 30 seconds before the pod, CoreDNS was healthy — the EndpointSlice controller simply hadn't updated the headless Service's endpoints.

Impact on E2E stability

All three stage E2E jobs show poor stability:

Job Success Rate (last 20 runs)

periodic-stage-e2e-parallel 40%

periodic-stage-e2e-parallel-ocp-nightly 20%

branch-...-stage-e2e-parallel 60%

The Jira's specific failure was in branch-ci-Azure-ARO-HCP-main-e2e-stage-e2e-parallel build 2053695569343287296 (May 11 04:36 UTC), where the patch-name-cluster hit the 45-minute E2E timeout due to this etcd DNS issue.

Solution: Etcd Pod Self-Registration

Each etcd pod creates its own EndpointSlice during the ensure-dns init container phase:
etcd-discovery Service (headless, selector: app=etcd)
   |
   +-- etcd-discovery-xxxxx  (managed-by: endpointslice-controller.k8s.io)
   |   [Standard controller — may be slow/stale under high density]
   |
   +-- etcd-discovery-self-etcd-0  (managed-by: control-plane-operator)
   |   [Created by etcd-0 init container — immediate]
   |
   +-- etcd-discovery-self-etcd-1  (managed-by: control-plane-operator)
   +-- etcd-discovery-self-etcd-2  (managed-by: control-plane-operator)
Why this works:

The standard EndpointSlice controller only manages slices labeled managed-by: endpointslice-controller.k8s.io — it completely ignores slices with a different managed-by label

CoreDNS merges endpoints from ALL EndpointSlices for a given Service regardless of who created them

Once the self-registered slice exists, CoreDNS picks it up via its watch within ~100ms

When the standard controller catches up, duplicate IPs are harmless for DNS

OwnerReference to the Pod ensures cleanup on pod deletion

Precedent: The HCCO already uses this pattern for the kubernetes EndpointSlice in the guest cluster (ReconcileKASEndpointSlice in hostedclusterconfigoperator/controllers/resources/kas/reconcile.go).

Changed files

File Change

dnsresolver/cmd.go Added --self-register flag and selfRegisterEndpointSlice() — creates EndpointSlice via K8s API before polling DNS, falls back to DNS-only on failure

dnsresolver/cmd_test.go Unit tests for DNS name parsing

v2/assets/etcd/statefulset.yaml Added POD_IP env var and --self-register flag to ensure-dns init container

v2/assets/etcd/etcd-serviceaccount.yaml New etcd service account (always created)

v2/assets/etcd/etcd-self-register-role.yaml Role granting EndpointSlice get/create/update and pod get

v2/assets/etcd/etcd-self-register-rolebinding.yaml Binds role to etcd SA

v2/assets/etcd/defrag-rolebinding.yaml Updated subject from etcd-defrag-controller to etcd SA

v2/etcd/component.go Registered new RBAC assets, removed old defrag SA

v2/etcd/statefulset.go Always set ServiceAccountName = "etcd" (not conditionally for HA)

manifests/etcd.go Added EtcdServiceAccount, EtcdSelfRegisterRole, EtcdSelfRegisterRoleBinding helpers

Risks and mitigations

Risk Mitigation

Duplicate endpoints when controller catches up CoreDNS deduplicates; harmless for DNS resolution

Self-registered EndpointSlice outlives pod OwnerReference to Pod triggers GC on pod deletion

RBAC escalation Scoped to get/create/update EndpointSlices + get pods in own namespace only

Init container fails to create EndpointSlice Falls back to DNS-only behavior with a warning log

Test plan

Unit tests for DNS name parsing (dnsresolver/cmd_test.go)

Existing etcd component tests pass (14/14)

make build compiles cleanly

make lint passes with 0 issues

E2E validation in staging environment

Verify EndpointSlice creation in a live HCP namespace

Verify DNS resolution timing improvement

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-27T20:54:49Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR implements etcd self-registration: adds manifest constructors and YAML assets (ServiceAccount, etcd-self-register Role/RoleBinding; updates defrag RoleBinding subject), updates the StatefulSet initContainer to pass --self-register and POD_IP and uses the unified etcd ServiceAccount, updates component wiring, and extends dnsresolver with a --self-register flag plus selfRegisterEndpointSlice/ensureEndpointSlice/parseServiceName and unit tests.

Sequence Diagram(s)

sequenceDiagram
  participant dnsresolver
  participant KubeAPI
  participant EndpointSlice
  participant DNS

  dnsresolver->>KubeAPI: ensureEndpointSlice(create/update EndpointSlice for pod IP and targetRef)
  KubeAPI->>EndpointSlice: persist EndpointSlice
  dnsresolver->>DNS: resolveDNS(dnsName)
  DNS-->>dnsresolver: resolution results

Suggested reviewers

patjlm

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Structure And Quality	⚠️ Warning	Tests lack meaningful assertion failure messages (criterion `#4`). 26+ assertions use bare Expect() calls without messages; custom check requires context messages per documented examples.	Add failure context messages to all assertions: e.g., Expect(err).NotTo(HaveOccurred(), "failed to ensure EndpointSlice") and Expect(slice.Labels[...]).To(Equal(...), "service name label mismatch").

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: adding etcd EndpointSlice self-registration to resolve DNS delays, which aligns with the core purpose of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names added in PR are stable, deterministic. No dynamic values in titles—pod names, UUIDs, timestamps, node names, IPs appear only in test bodies, not names.
Topology-Aware Scheduling Compatibility	✅ Passed	No topology-unaware scheduling constraints found in statefulset, RBAC, PDB, or code. Defrag controller is topology-aware and only runs on HighlyAvailable clusters.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR adds standard Go unit tests (testing package), not Ginkgo e2e tests. Custom check applies only to Ginkgo e2e tests, which are absent.
No-Weak-Crypto	✅ Passed	No weak cryptography found. PR adds etcd EndpointSlice self-registration using only standard Kubernetes APIs with no cryptographic operations, weak algorithms, or insecure comparisons.
Container-Privileges	✅ Passed	No privileged containers, host access, elevated capabilities, or inappropriate privilege escalation settings found. All RBAC is properly scoped with minimal necessary permissions.
No-Sensitive-Data-In-Logs	✅ Passed	No passwords, tokens, API keys, PII, session IDs, or customer data found in logging statements. Infrastructure metadata (pod IPs, DNS names) are logged but not classified as sensitive.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dnsresolver/cmd.go`:
- Around line 150-164: The Get call in selfRegisterEndpointSlice currently
treats any error as "not found" and proceeds to Create; change the error
handling to return immediately for non-NotFound errors from
client.DiscoveryV1().EndpointSlices(...).Get(ctx, sliceName, ...). Specifically,
import and use apierrors.IsNotFound(err) (k8s.io/apimachinery/pkg/api/errors)
and: if err == nil keep the Update path, else if apierrors.IsNotFound(err)
perform the Create path, otherwise return the original Get error (wrap with
fmt.Errorf). This prevents masking RBAC/timeout errors while preserving the
existing Update/Create logic and messages.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 64afd625-994c-4a36-ad76-4ecbbb5a19ca

📥 Commits

Reviewing files that changed from the base of the PR and between 0aa20fc and 87cf121.

📒 Files selected for processing (10)

control-plane-operator/controllers/hostedcontrolplane/manifests/etcd.go
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/defrag-rolebinding.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-self-register-role.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-self-register-rolebinding.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-serviceaccount.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/component.go
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go
dnsresolver/cmd.go
dnsresolver/cmd_test.go

codecov · 2026-05-27T21:03:46Z

Codecov Report

❌ Patch coverage is 49.67320% with 77 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.28%. Comparing base (2f52041) to head (824480d).
⚠️ Report is 137 commits behind head on main.

Files with missing lines	Patch %	Lines
dnsresolver/cmd.go	65.51%	38 Missing and 2 partials ⚠️
...r/controllers/hostedcontrolplane/manifests/etcd.go	0.00%	21 Missing ⚠️
...ontrollers/hostedcontrolplane/v2/etcd/component.go	0.00%	13 Missing ⚠️
...trollers/hostedcontrolplane/v2/etcd/statefulset.go	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8613      +/-   ##
==========================================
+ Coverage   40.61%   41.28%   +0.67%     
==========================================
  Files         755      755              
  Lines       93227    93597     +370     
==========================================
+ Hits        37864    38642     +778     
+ Misses      52640    52221     -419     
- Partials     2723     2734      +11

Files with missing lines	Coverage Δ
...trollers/hostedcontrolplane/v2/etcd/statefulset.go	`40.16% <0.00%> (-0.67%)`	⬇️
...ontrollers/hostedcontrolplane/v2/etcd/component.go	`10.20% <0.00%> (-3.69%)`	⬇️
...r/controllers/hostedcontrolplane/manifests/etcd.go	`45.83% <0.00%> (-7.83%)`	⬇️
dnsresolver/cmd.go	`53.52% <65.51%> (+53.52%)`	⬆️

... and 40 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`34.86% <ø> (+0.16%)`	⬆️
cpo-hostedcontrolplane	`43.41% <0.00%> (+1.63%)`	⬆️
cpo-other	`42.79% <ø> (+1.72%)`	⬆️
hypershift-operator	`51.00% <ø> (+0.24%)`	⬆️
other	`32.04% <65.51%> (+0.45%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ntainer Under high HCP density (~2,500 namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, delaying etcd-discovery headless Service DNS records by minutes to hours. This causes etcd pods to block in the ensure-dns init container, preventing cluster creation and triggering E2E test timeouts. Each etcd pod now self-registers its IP into a custom EndpointSlice (managed-by: control-plane-operator) before polling DNS. The standard EndpointSlice controller ignores slices with a different managed-by label, so there is no conflict. CoreDNS picks up the custom slice via its watch within ~100ms, making DNS resolution near-instant regardless of controller informer health. A new dedicated "etcd" service account replaces the conditional "etcd-defrag-controller" SA, with RBAC granting EndpointSlice and pod get permissions. The defrag Role/RoleBinding remain HA-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use apierrors.IsNotFound to only fall through to Create when the EndpointSlice genuinely doesn't exist. RBAC, timeout, and other errors are now returned immediately instead of being masked by a subsequent Create attempt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Refactor selfRegisterEndpointSlice to extract testable ensureEndpointSlice function that accepts a kubernetes.Interface, enabling tests with the fake client. Tests cover: create, update, IPv6, missing pod, invalid DNS name, multiple pods, and field correctness. Coverage: ensureEndpointSlice 88.9%, parseServiceName 100%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cssjr · 2026-05-27T21:22:57Z

/cc patjlm

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dnsresolver/cmd.go (1)
177-182: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject malformed non-service DNS names before self-registering.

parseServiceName only checks the label count and then trusts parts[1]. With --self-register, an input like foo.bar.baz would still create or update an EndpointSlice for service bar in the current namespace instead of failing fast. Please validate the documented <pod>.<service>.<namespace>.svc[.cluster.local] shape here before returning a service name.
Proposed fix
 func parseServiceName(dnsName string) (string, error) {
 	parts := strings.Split(dnsName, ".")
-	if len(parts) < 3 {
-		return "", fmt.Errorf("expected at least 3 dot-separated components, got %d", len(parts))
+	if len(parts) < 4 || parts[0] == "" || parts[1] == "" || parts[2] == "" || parts[3] != "svc" {
+		return "", fmt.Errorf("expected <pod>.<service>.<namespace>.svc[.cluster.local], got %q", dnsName)
 	}
 	return parts[1], nil
 }
As per coding guidelines, "Validate at trust boundaries with allow-lists, not deny-lists".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dnsresolver/cmd.go` around lines 177 - 182, parseServiceName currently only
checks label count and trusts parts[1]; instead validate the full DNS shape
pod.service.namespace.svc[.cluster.local] before returning the service. In
parseServiceName, split on '.', ensure no empty labels, require parts[3] ==
"svc", allow either len(parts)==4 or len(parts)==6 with parts[4]=="cluster" &&
parts[5]=="local", and only then return parts[1]; otherwise return a descriptive
error rejecting the malformed name.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@dnsresolver/cmd.go`:
- Around line 177-182: parseServiceName currently only checks label count and
trusts parts[1]; instead validate the full DNS shape
pod.service.namespace.svc[.cluster.local] before returning the service. In
parseServiceName, split on '.', ensure no empty labels, require parts[3] ==
"svc", allow either len(parts)==4 or len(parts)==6 with parts[4]=="cluster" &&
parts[5]=="local", and only then return parts[1]; otherwise return a descriptive
error rejecting the malformed name.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 11fd82fa-916a-486b-a98c-7ddd6ed5ad67

📥 Commits

Reviewing files that changed from the base of the PR and between b877681 and 30ffd78.

📒 Files selected for processing (2)

dnsresolver/cmd.go
dnsresolver/cmd_test.go

cssjr · 2026-05-27T21:30:18Z

/cc @garrickdabbs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cssjr · 2026-05-27T21:55:48Z

/test all

cssjr · 2026-05-27T21:57:07Z

/pipeline required

openshift-merge-bot · 2026-05-27T21:57:10Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

cwbotbot · 2026-05-28T00:05:03Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-06-03T22:13:25Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-06-03T19:10:46Z
View Job
View Job History

cssjr · 2026-05-29T17:55:46Z

✅ All PR feedback addressed in commit `6597d3d`

All blocking issues and suggestions from @bryan-cox have been implemented:

Blocking Issues Fixed

ServiceAccount backwards compatibility ✅
- Kept etcd-defrag-controller SA (HA-only) separate from etcd SA (always present)
- Restored defrag-serviceaccount.yaml to avoid orphaned SAs on upgrade
- Added etcd-self-register-rolebinding-defrag.yaml to bind self-register role to defrag SA in HA mode
- Updated defrag-rolebinding.yaml to point to etcd-defrag-controller
- Updated statefulset.go to conditionally use correct SA
Fully qualified managed-by label ✅
- Changed to "etcd-self-register.hypershift.openshift.io"

Suggestions Implemented

Explicit GC fields in OwnerReference ✅
- Added Controller: ptr.To(true) and BlockOwnerDeletion: ptr.To(true)
SA usage documentation ✅
- Added comprehensive comment explaining HA vs non-HA SA usage
Test context handling ✅
- Replaced all 12 instances of context.Background() with t.Context()
Better error assertions ✅
- Used gomega's MatchError matcher (2 instances)
Cleaner IP generation ✅
- Replaced string(rune('1'+i)) with fmt.Sprintf("10.128.64.%d", 186+i)

Verification

✅ All tests pass
✅ make build succeeds
✅ make lint passes
✅ make pre-commit passes
✅ Commit message follows HyperShift conventions

All review comments have been resolved. Ready for re-review! 🚀

cssjr · 2026-06-01T13:34:29Z

/retest

bryan-cox · 2026-06-02T12:21:00Z

/approve

cssjr · 2026-06-02T14:14:42Z

/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

galchammat · 2026-06-02T18:51:03Z

/lgtm

openshift-ci · 2026-06-02T18:51:17Z

@galchammat: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cssjr · 2026-06-02T20:34:59Z

/retest-required

@enxebre

Address feedback from @enxebre to document why --self-register is needed and improve IP address parsing. Changes: - Added comment in statefulset.yaml explaining --self-register flag purpose (bypasses stale EndpointSlice cache under high HCP density) - Added comment in dnsresolver/cmd.go documenting the flag - Replaced manual net.ParseIP().To4() check with netutil.IsIPv4Address() which properly handles ParseIP returning nil and returns an error instead of silently failing Signed-off-by: Cliff Schomburg <7424213+cssjr@users.noreply.github.com> Commit-Message-Assisted-by: Claude (via Claude Code)

cssjr · 2026-06-03T14:13:35Z

✅ Addressed feedback from @enxebre in commit `c8b0fa1`

All 4 comments have been addressed:

1. Added comment in statefulset.yaml (line 192) ✅

Added a 3-line comment explaining why --self-register is needed:

# Self-register this pod's IP into an EndpointSlice before waiting for DNS resolution.
# This bypasses stale kube-controller-manager EndpointSlice cache issues under high HCP density
# (2,500+ namespaces), where the standard controller can delay DNS updates by minutes to hours.

2. Added comment in cmd.go (line 46) ✅

Added a 3-line comment documenting the --self-register flag:

// --self-register creates an EndpointSlice for this pod before waiting for DNS resolution.
// This bypasses stale kube-controller-manager EndpointSlice cache issues under high HCP density,
// where the standard controller can delay DNS updates by minutes to hours.

3. Fixed IP parsing to handle nil and avoid silent failure (line 112) ✅

Replaced manual net.ParseIP().To4() check with netutil.IsIPv4Address() which:

Returns an error if ParseIP returns nil (no more silent failures)
Properly handles the To4() check
Follows existing codebase patterns

Before:

addressType := discoveryv1.AddressTypeIPv4
if net.ParseIP(podIP) != nil && net.ParseIP(podIP).To4() == nil {
    addressType = discoveryv1.AddressTypeIPv6
}

After:

isIPv4, err := netutil.IsIPv4Address(podIP)
if err != nil {
    return fmt.Errorf("failed to parse pod IP %s: %w", podIP, err)
}
addressType := discoveryv1.AddressTypeIPv6
if isIPv4 {
    addressType = discoveryv1.AddressTypeIPv4
}

4. Used `support/netutil.IsIPv4Address()` ✅

Implemented as part of #3 above.

Verification

✅ All tests pass
✅ make build succeeds
✅ make lint-fix applied
✅ make pre-commit passes

Add test to verify that ensureEndpointSlice properly returns an error when given a malformed IP address (not IPv4 or IPv6). This ensures the netutil.IsIPv4Address() error handling works correctly and doesn't silently fail on invalid input. Signed-off-by: Cliff Schomburg <7424213+cssjr@users.noreply.github.com> Commit-Message-Assisted-by: Claude (via Claude Code)

cssjr · 2026-06-03T15:34:29Z

/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

hypershift-jira-solve-ci · 2026-06-03T17:27:47Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2062196402125017088 | Cost: $4.831960000000001 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

cssjr · 2026-06-03T19:10:29Z

/retest-required

hypershift-jira-solve-ci · 2026-06-03T21:29:08Z

No symptom labels available. Now I have all the evidence needed for the report:

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2062250617467834368
Target: e2e-aws
PR: AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays #8613 (AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays)
Result: 601 tests, 25 skipped, 5 failures

Test Failure Analysis

Error

TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout (1205.12s)

globalps.go:217:
    Expected success, but got an error:
        failed to wait for DaemonSet global-pull-secret-syncer to be ready: context deadline exceeded

Summary

The TestCreateCluster test failed because the global-pull-secret-syncer DaemonSet in the hosted cluster remained stuck at 2/3 pods ready for over 20 minutes (1205s), exceeding the test's timeout. This was caused by repeated DaemonSet rolling updates (the DaemonSet went through 6 generations with at least 12 distinct pod instances) triggered by the pull secret patch in the EnsureGlobalPullSecret subtest. The maxUnavailable: 1 rolling update strategy means only 2 of 3 pods can be ready at any point during a rollout, and back-to-back generation changes prevented the DaemonSet from ever reaching 3/3 within the test's deadline. All other 596 passing tests confirm the cluster was healthy. This failure is unrelated to PR #8613, which only modifies etcd StatefulSet, ServiceAccount, RBAC, and DNS resolver code — none of the global-pull-secret-syncer or EnsureGlobalPullSecret test code was changed.

Root Cause

The failure chain is:

Primary failure: TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout — The test patches the management-cluster pull secret with a dummy auth entry. This triggers the HyperShift control-plane-operator to detect the pull secret change, recompute the globalps-config-hash label, and update the global-pull-secret-syncer DaemonSet spec. The DaemonSet's generation climbed from 1 to 6 (6 spec updates), spawning at least 12 distinct pod instances (m8qr9, whnvd, rc7xl, jdsn6, d5gw9, lm584, qstsc, q7t89, 2dvzz, q95dk, b8lnz, hhfxk, ln5hm). With maxUnavailable: 1, at most 2/3 pods are ready during any rolling update cycle. Back-to-back rolling updates from repeated spec changes meant the DaemonSet never stabilized at 3/3 within the 20-minute test deadline.
Cascade failure: TestCreateCluster/Main/EnsureGlobalPullSecret/Check_if_the_config.json_is_correct_in_all_of_the_nodes — This subtest tried to create the kubelet-config-verifier DaemonSet, but it already existed from the previous (timed-out) subtest that created it before the wait. The error daemonsets.apps "kubelet-config-verifier" already exists is a direct consequence of the first failure leaving the DaemonSet behind.
Teardown condition message: The top-level TestCreateCluster reported ClusterVersionSucceeding=False: ClusterOperatorNotAvailable(Cluster operator monitoring is not available) — this was a transient 2-second condition check during post-test cleanup (hypershift_framework.go:179). The monitoring ClusterOperator was actually Available=True and Degraded=False at cluster dump time, confirming this was a brief transient during teardown, not a real monitoring failure.

Why this is unrelated to PR #8613: The PR modifies only etcd-related files (etcd StatefulSet, ServiceAccount, RBAC roles/bindings, and dnsresolver/cmd.go). It does not touch the global-pull-secret-syncer DaemonSet, the control-plane-operator's pull secret reconciliation logic, or any of the EnsureGlobalPullSecret test code. This is a pre-existing flaky test scenario where rapid DaemonSet rolling updates under the pull secret mutation test can exceed the hardcoded timeout.

Recommendations

Rerun the CI job — This is a pre-existing flake unrelated to the PR's etcd changes. A /retest should pass.
File a bug for the EnsureGlobalPullSecret test flake — The test's DaemonSet readiness wait does not account for back-to-back rolling updates triggered by the pull secret patch. The timeout (≈20 minutes) is insufficient when the DaemonSet goes through multiple generations. Potential fixes:
- Wait for the DaemonSet's observedGeneration to match generation before checking pod readiness
- Add a stabilization check that ensures no rolling update is in progress before starting the readiness poll
- Clean up the kubelet-config-verifier DaemonSet between subtests to prevent the cascade failure
No changes needed to PR AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays #8613 — The etcd EndpointSlice self-registration changes are completely orthogonal to this failure.

Evidence

Evidence	Detail
Failing test	`TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout`
Failure duration	1205.12s (≈20 minutes) waiting for DaemonSet readiness
DaemonSet state during test	`global-pull-secret-syncer not ready: 2/3 pods ready` — 368 consecutive poll attempts
DaemonSet generations	6 (indicating 6 spec updates / rolling updates)
Distinct pod instances	12+ pods created across rolling updates (m8qr9, whnvd, rc7xl, jdsn6, d5gw9, lm584, qstsc, q7t89, 2dvzz, q95dk, b8lnz, hhfxk, ln5hm)
DaemonSet state after test	3/3 ready (recovered after timeout)
Monitoring operator at dump time	`Available=True`, `Degraded=False` (healthy)
Cascade failure	`kubelet-config-verifier` DaemonSet already exists (409 Conflict)
Total test results	601 tests, 25 skipped, 5 failures (all in TestCreateCluster tree)
PR #8613 files changed	etcd StatefulSet, ServiceAccount, RBAC, dnsresolver — none touch pull-secret or test code
Cluster nodes	3 nodes (ip-10-0-14-72, ip-10-0-29-245, ip-10-0-39-68), all Ready with `nodepool-globalps-enabled: "true"`

csrwng

LGTM — clean implementation of the EndpointSlice self-registration pattern.

RBAC is well-scoped (get/create/update EndpointSlices + get Pods), the custom managed-by label avoids conflict with the standard controller, and the OwnerReference to the Pod ensures proper GC. The backwards-compatible SA split (keeping etcd-defrag-controller for HA, adding etcd for all topologies) addresses the upgrade concern. Tests cover create, update, IPv6, error cases, and multi-pod scenarios.

openshift-merge-bot · 2026-06-03T21:54:07Z

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

openshift-ci · 2026-06-03T21:54:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, csrwng, cssjr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox,csrwng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cssjr · 2026-06-03T22:13:11Z

/retest-required

cssjr · 2026-06-04T14:14:15Z

/test e2e-azure-v2-self-managed

openshift-ci · 2026-06-04T17:12:48Z

@cssjr: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 27, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 27, 2026

openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread dnsresolver/cmd.go

cssjr force-pushed the etcd-endpointslice-self-register branch from 87cf121 to 39ffc19 Compare May 27, 2026 21:13

cssjr and others added 2 commits May 27, 2026 14:16

openshift-ci Bot requested a review from patjlm May 27, 2026 21:22

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

openshift-ci Bot requested a review from garrickdabbs May 27, 2026 21:30

fix(etcd): replace deprecated fake.NewSimpleClientset with NewClientset

8db6372

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hypershift-jira-solve-ci Bot mentioned this pull request May 28, 2026

OCPBUGS-86661: Konnectivity retry proxy connection on timeout #8579

Open

cssjr marked this pull request as ready for review May 28, 2026 14:29

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026

openshift-ci Bot requested review from devguyio and sjenning May 28, 2026 14:30

cssjr requested a review from bryan-cox May 29, 2026 18:06

hypershift-jira-solve-ci Bot mentioned this pull request Jun 1, 2026

CNTRLPLANE-2207: Upgrade to CAPI 1.11 #8594

Merged

12 tasks

hypershift-jira-solve-ci Bot mentioned this pull request Jun 1, 2026

CNTRLPLANE-3524: feat: add etcd data re-encryption after encryption key rotation #8219

Open

7 tasks

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2026

enxebre reviewed Jun 3, 2026

View reviewed changes

Comment thread control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml

enxebre reviewed Jun 3, 2026

View reviewed changes

Comment thread dnsresolver/cmd.go

enxebre reviewed Jun 3, 2026

View reviewed changes

Comment thread dnsresolver/cmd.go Outdated

cssjr requested a review from enxebre June 3, 2026 15:33

csrwng approved these changes Jun 3, 2026

View reviewed changes

openshift-ci Bot assigned csrwng Jun 3, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2026

Conversation

cssjr commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Root Cause: EndpointSlice informer cache staleness

Impact on E2E stability

Solution: Etcd Pod Self-Registration

Changed files

Risks and mitigations

Test plan

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 27, 2026

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

openshift-ci-robot commented May 27, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Root Cause: EndpointSlice informer cache staleness

Impact on E2E stability

Solution: Etcd Pod Self-Registration

Changed files

Risks and mitigations

Test plan

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cssjr commented May 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cssjr commented May 27, 2026

Uh oh!

cssjr commented May 27, 2026

Uh oh!

cssjr commented May 27, 2026

Uh oh!

openshift-merge-bot Bot commented May 27, 2026

Uh oh!

cwbotbot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

cssjr commented May 29, 2026

✅ All PR feedback addressed in commit 6597d3d

Blocking Issues Fixed

Suggestions Implemented

Verification

Uh oh!

cssjr commented Jun 1, 2026

Uh oh!

bryan-cox commented Jun 2, 2026

Uh oh!

cssjr commented Jun 2, 2026

Uh oh!

galchammat commented Jun 2, 2026

Uh oh!

openshift-ci Bot commented Jun 2, 2026

cssjr commented May 27, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented May 27, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading

codecov Bot commented May 27, 2026 •

edited

Loading

cwbotbot commented May 28, 2026 •

edited

Loading

✅ All PR feedback addressed in commit `6597d3d`

✅ Addressed feedback from @enxebre in commit `c8b0fa1`

4. Used `support/netutil.IsIPv4Address()` ✅

hypershift-jira-solve-ci Bot commented Jun 3, 2026 •

edited by openshift-ci Bot

Loading