Skip to content

OCPBUGS-78557: Fix admin context race condition in parallel tests#1256

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
kuiwang02:enhancerace
Mar 17, 2026
Merged

OCPBUGS-78557: Fix admin context race condition in parallel tests#1256
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
kuiwang02:enhancerace

Conversation

@kuiwang02
Copy link
Contributor

/cc @jianzhangbjz @Xia-Zhao-rh @bandrade

Problem

Intermittent test failures where AsAdmin() uses wrong service account instead of cluster-admin credentials.

Affected tests: PolarionID:25757, PolarionID:24027

Error symptom:
User "system:serviceaccount:ci-op-xxx:xxx" cannot get/delete resource

Root Cause

Race condition: concurrent tests share the same admin kubeconfig file. When one test changes current-context, it affects other tests.

Solution

Double insurance mechanism:

  1. Per-test temporary kubeconfig - Each CLI instance gets isolated copy
  2. Explicit --context=admin - Ensures correct context regardless of current-context

@openshift-ci-robot
Copy link

@kuiwang02: An error was encountered searching for bug OCPQE-31804 on the Jira server at https://issues.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. request failed. Please analyze the request body for more details. Status code: 403:

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

/cc @jianzhangbjz @Xia-Zhao-rh @bandrade

Problem

Intermittent test failures where AsAdmin() uses wrong service account instead of cluster-admin credentials.

Affected tests: PolarionID:25757, PolarionID:24027

Error symptom:
User "system:serviceaccount:ci-op-xxx:xxx" cannot get/delete resource

Root Cause

Race condition: concurrent tests share the same admin kubeconfig file. When one test changes current-context, it affects other tests.

Solution

Double insurance mechanism:

  1. Per-test temporary kubeconfig - Each CLI instance gets isolated copy
  2. Explicit --context=admin - Ensures correct context regardless of current-context

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

Walkthrough

Adds per-test temporary admin kubeconfig handling and global lazy caching for the admin context; updates CLI flows to use the temp config and cached context. Introduces safe file-duplication helpers that return errors (and a fallback returning empty string) and refactors public duplication functions to delegate to them.

Changes

Cohort / File(s) Summary
Admin Context Caching & CLI changes
tests-extension/test/qe/util/client.go
Adds package-level globalAdmContextName and globalAdmContextOnce; extends CLI with tempAdmConfPath, admContextName, isUseAdminCxt; adds getAdminContextName and getAdminContextNameCached; updates NewCLI*, AsAdmin, SetAdminKubeconf, SetupProject, TeardownProject, AdminConfig, Run and command construction to prefer per-test temp admin kubeconfig and use cached admin context; ensures temp file creation and cleanup.
Safe File Operations
tests-extension/test/qe/util/tools.go
Adds unexported safe helpers duplicateFileToPathSafe, duplicateFileToTempSafe, duplicateFileToTempOrEmpty that return errors/empty string for initialization-safe usage; refactors DuplicateFileToPath and DuplicateFileToTemp to call safe variants and assert via Gomega in test execution paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests-extension/test/qe/util/client.go`:
- Around line 549-555: The teardown currently removes c.tempAdmConfPath and
clears it, but SetupProject does not recreate that per-test temp admin
kubeconfig, so subsequent AsAdmin() calls fall back to shared adminConfigPath
and reintroduce the race; update SetupProject to always create and assign a
fresh temp admin kubeconfig when c.tempAdmConfPath is empty (or missing) so each
spec gets its own temp file, and ensure the code paths that rely on
c.tempAdmConfPath (e.g., AsAdmin(), any temp file creation helpers) use that
fresh value; alternatively, stop clearing c.tempAdmConfPath in the teardown and
ensure SetupProject overwrites it on each setup to guarantee per-test isolation.
- Around line 1218-1224: The code currently uses a single global cache guarded
by globalAdmContextOnce and globalAdmContextName which ignores the
kubeConfigFile parameter; change this to cache per kubeconfig path (keyed by
kubeConfigFile) instead of a single global value—replace the single
sync.Once/globalAdmContextName with a safe map (e.g., map[string]string)
protected by a sync.RWMutex or a sync.Map and store/look up the admin context by
kubeConfigFile inside the function that calls
getAdminContextName(kubeConfigFile); update callers like NewCLI and
SetAdminKubeconf to rely on the per-path cache so each distinct adminConfigPath
returns its correct context and keep the existing logging (e2e.Logf) when
caching a new entry.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 591a24f1-8c47-4026-bb83-c97e09060772

📥 Commits

Reviewing files that changed from the base of the PR and between bf71e98 and e2089cdfb86cbc7718f1b1c9107eb2038498cacf.

📒 Files selected for processing (2)
  • tests-extension/test/qe/util/client.go
  • tests-extension/test/qe/util/tools.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
tests-extension/test/qe/util/client.go (1)

143-146: ⚠️ Potential issue | 🟠 Major

Recreate the temp admin kubeconfig for every NewCLIWithoutNamespace spec.

TeardownProject() removes tempAdmConfPath after each spec, but this constructor only creates it once. From the second spec onward, AdminConfig() and AsAdmin() fall back to the shared admin kubeconfig again unless every caller manually runs SetupProject().

💡 Possible fix
 	client.adminConfigPath = KubeConfigPath()
 	client.showInfo = true
 	// Get cached admin context name (parsed only once globally for efficiency)
 	client.admContextName = getAdminContextNameCached(client.adminConfigPath)
-	client.tempAdmConfPath = duplicateFileToTempOrEmpty(client.adminConfigPath, "admin-config-")
+
+	g.BeforeEach(func() {
+		client.tempAdmConfPath = DuplicateFileToTemp(client.adminConfigPath, "admin-config-")
+	})
 
 	return client
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-extension/test/qe/util/client.go` around lines 143 - 146, The code
currently creates client.tempAdmConfPath once (via duplicateFileToTempOrEmpty)
so when TeardownProject deletes that file subsequent NewCLIWithoutNamespace
specs see the original shared admin kubeconfig; to fix, stop reusing a single
global tempAdmConfPath and instead create the temp admin kubeconfig inside the
NewCLIWithoutNamespace constructor: move the call to duplicateFileToTempOrEmpty
into NewCLIWithoutNamespace (or have NewCLIWithoutNamespace call a helper that
sets client.tempAdmConfPath) so each spec gets a fresh temp file, keep
client.admContextName cached if desired, and ensure TeardownProject still
removes the per-spec client.tempAdmConfPath.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests-extension/test/qe/util/client.go`:
- Around line 552-558: The cleanup that removes and clears c.tempAdmConfPath
should not run before AdminDynamicClient() is used; move the block that deletes
the temporary admin kubeconfig (the os.Remove + setting c.tempAdmConfPath = "")
out of its current location and place it at the very end of the teardown
routine, after the resource-deletion loop and after any calls that create or use
AdminDynamicClient(), so the teardown continues to use the temp admin kubeconfig
for context until all resources are deleted.
- Around line 195-202: SetAdminKubeconf currently only updates adminConfigPath
which leaves derived fields (tempAdmConfPath, admContextName and any cached
nc.configPath/nc.isUseAdminCxt) stale; update SetAdminKubeconf to also refresh
tempAdmConfPath and admContextName and recompute derived state so code paths
that choose nc.configPath = tempAdmConfPath or set
nc.isUseAdminCxt/admContextName reflect the new admin kubeconfig. Ensure the
same refresh logic is applied to the other occurrences noted (around the other
blocks referencing tempAdmConfPath, admContextName and nc.isUseAdminCxt) so
callers use the updated temp file and context.
- Around line 361-363: In SetupProject, don't let duplicateFileToTempOrEmpty
silently leave c.tempAdmConfPath empty; if
duplicateFileToTempOrEmpty(adminConfigPath, "admin-config-") returns an empty
string, fail SetupProject immediately (e.g., return an error or call the
test-failure helper) instead of leaving tempAdmConfPath unset, so AdminConfig()
won't fall back to the shared kubeconfig and reopen the race; update the
SetupProject code path that sets c.tempAdmConfPath to check the returned value
and abort the test on failure.

---

Duplicate comments:
In `@tests-extension/test/qe/util/client.go`:
- Around line 143-146: The code currently creates client.tempAdmConfPath once
(via duplicateFileToTempOrEmpty) so when TeardownProject deletes that file
subsequent NewCLIWithoutNamespace specs see the original shared admin
kubeconfig; to fix, stop reusing a single global tempAdmConfPath and instead
create the temp admin kubeconfig inside the NewCLIWithoutNamespace constructor:
move the call to duplicateFileToTempOrEmpty into NewCLIWithoutNamespace (or have
NewCLIWithoutNamespace call a helper that sets client.tempAdmConfPath) so each
spec gets a fresh temp file, keep client.admContextName cached if desired, and
ensure TeardownProject still removes the per-spec client.tempAdmConfPath.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ff9757e8-dc59-4212-90bd-08b524eeaf15

📥 Commits

Reviewing files that changed from the base of the PR and between e2089cd and 87fd105.

📒 Files selected for processing (2)
  • tests-extension/test/qe/util/client.go
  • tests-extension/test/qe/util/tools.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
tests-extension/test/qe/util/client.go (2)

314-319: ⚠️ Potential issue | 🟠 Major

Remove prior temp admin kubeconfig before replacing it.

Line 318 overwrites c.tempAdmConfPath without deleting the previous file, leaving credential-bearing temp kubeconfigs behind.

Proposed fix
 func (c *CLI) SetAdminKubeconf(adminKubeconf string) *CLI {
+	if c.tempAdmConfPath != "" {
+		if err := os.Remove(c.tempAdmConfPath); err != nil {
+			e2e.Logf("Failed to remove old temporary admin config file %s: %v", c.tempAdmConfPath, err)
+		}
+		c.tempAdmConfPath = ""
+	}
 	c.adminConfigPath = adminKubeconf
 	// Refresh derived fields to match new admin kubeconfig
 	c.admContextName = getAdminContextNameCached(adminKubeconf)
 	c.tempAdmConfPath = duplicateFileToTempOrEmpty(adminKubeconf, "admin-config-")
 	return c
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-extension/test/qe/util/client.go` around lines 314 - 319, In
SetAdminKubeconf, before overwriting c.tempAdmConfPath call out the existing
temp file and remove it if non-empty to avoid leaving credential-bearing files
behind; specifically, check c.tempAdmConfPath (the prior temp file path) and
delete it (handle errors appropriately) before assigning the new value from
duplicateFileToTempOrEmpty(adminKubeconf, "admin-config-"), while still updating
c.adminConfigPath and c.admContextName as currently implemented.

1224-1231: ⚠️ Potential issue | 🟠 Major

Global admin-context cache is not keyed by kubeconfig path.

Line 1225 uses a single sync.Once, so the first resolved context is reused for all future kubeConfigFile values. That can apply the wrong --context after SetAdminKubeconf(...) changes the admin kubeconfig.

Proposed fix
-var (
-	// Global cache for admin context name to avoid repeated parsing across all test suites
-	globalAdmContextName string
-	globalAdmContextOnce sync.Once
-)
+var (
+	// Cache admin context per kubeconfig path.
+	globalAdmContextCache sync.Map // map[string]string
+)

 func getAdminContextNameCached(kubeConfigFile string) string {
-	globalAdmContextOnce.Do(func() {
-		globalAdmContextName = getAdminContextName(kubeConfigFile)
-		if globalAdmContextName != "" {
-			e2e.Logf("Cached admin context for all test suites: %s", globalAdmContextName)
-		}
-	})
-	return globalAdmContextName
+	if kubeConfigFile == "" {
+		return ""
+	}
+	if v, ok := globalAdmContextCache.Load(kubeConfigFile); ok {
+		return v.(string)
+	}
+	ctx := getAdminContextName(kubeConfigFile)
+	globalAdmContextCache.Store(kubeConfigFile, ctx)
+	if ctx != "" {
+		e2e.Logf("Cached admin context for kubeconfig %s: %s", kubeConfigFile, ctx)
+	}
+	return ctx
 }
#!/bin/bash
set -euo pipefail

# Verify singleton cache implementation and call sites
rg -n -C3 --type=go 'globalAdmContextName|globalAdmContextOnce|getAdminContextNameCached\s*\(' tests-extension/test/qe/util/client.go

# Verify mutable admin kubeconfig path is supported by API
rg -n -C3 --type=go 'func \(c \*CLI\) SetAdminKubeconf|c\.adminConfigPath =|getAdminContextNameCached\(' tests-extension/test/qe/util/client.go
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-extension/test/qe/util/client.go` around lines 1224 - 1231, The cache
for admin context must be keyed by kubeConfigFile instead of a single
sync.Once/globalAdmContextName; replace
globalAdmContextOnce/globalAdmContextName with a threadsafe map (e.g.,
package-level map[string]string guarded by a sync.Mutex or sync.RWMutex) named
something like globalAdmContextCache, and change
getAdminContextNameCached(kubeConfigFile string) to look up the cache key, call
getAdminContextName(kubeConfigFile) only on cache-miss, store the result, and
log as before; also ensure SetAdminKubeconf (method c.SetAdminKubeconf)
invalidates or updates the corresponding cache entry (or clears the cache) when
the admin kubeconfig path changes so subsequent calls use the correct context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests-extension/test/qe/util/client.go`:
- Around line 314-319: In SetAdminKubeconf, before overwriting c.tempAdmConfPath
call out the existing temp file and remove it if non-empty to avoid leaving
credential-bearing files behind; specifically, check c.tempAdmConfPath (the
prior temp file path) and delete it (handle errors appropriately) before
assigning the new value from duplicateFileToTempOrEmpty(adminKubeconf,
"admin-config-"), while still updating c.adminConfigPath and c.admContextName as
currently implemented.
- Around line 1224-1231: The cache for admin context must be keyed by
kubeConfigFile instead of a single sync.Once/globalAdmContextName; replace
globalAdmContextOnce/globalAdmContextName with a threadsafe map (e.g.,
package-level map[string]string guarded by a sync.Mutex or sync.RWMutex) named
something like globalAdmContextCache, and change
getAdminContextNameCached(kubeConfigFile string) to look up the cache key, call
getAdminContextName(kubeConfigFile) only on cache-miss, store the result, and
log as before; also ensure SetAdminKubeconf (method c.SetAdminKubeconf)
invalidates or updates the corresponding cache entry (or clears the cache) when
the admin kubeconfig path changes so subsequent calls use the correct context.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bd454d36-c581-4fcd-b110-b091651bce78

📥 Commits

Reviewing files that changed from the base of the PR and between 87fd105 and 8e0301f.

📒 Files selected for processing (2)
  • tests-extension/test/qe/util/client.go
  • tests-extension/test/qe/util/tools.go

@kuiwang02
Copy link
Contributor Author

/payload-job periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-proxy-extended-f2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 16, 2026

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-proxy-extended-f2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/cc56ea70-2104-11f1-8db3-624666920e17-0

@kuiwang02
Copy link
Contributor Author

/payload-job periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-fips-techpreview-extended-f1

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 16, 2026

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-fips-techpreview-extended-f1

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ddd7cfd0-2104-11f1-8049-25ecc6f28b0c-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 16, 2026

@kuiwang02: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kuiwang02
Copy link
Contributor Author

/verified by @kuiwang02

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 17, 2026
@openshift-ci-robot
Copy link

@kuiwang02: This PR has been marked as verified by @kuiwang02.

Details

In response to this:

/verified by @kuiwang02

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jianzhangbjz
Copy link
Member

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 17, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 17, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jianzhangbjz, kuiwang02

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2026
@kuiwang02
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 17, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 17, 2026

@kuiwang02: This pull request references OCPQE-31804 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot bot merged commit c6dfbbd into openshift:main Mar 17, 2026
16 checks passed
@kuiwang02
Copy link
Contributor Author

/jira refresh

@kuiwang02
Copy link
Contributor Author

/retitle OCPBUGS-78557: Fix admin context race condition in parallel tests

@openshift-ci openshift-ci bot changed the title OCPQE-31804: Fix admin context race condition in parallel tests OCPBUGS-78557: Fix admin context race condition in parallel tests Mar 17, 2026
@openshift-ci-robot
Copy link

@kuiwang02: Jira Issue Verification Checks: Jira Issue OCPBUGS-78557
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-78557 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

/cc @jianzhangbjz @Xia-Zhao-rh @bandrade

Problem

Intermittent test failures where AsAdmin() uses wrong service account instead of cluster-admin credentials.

Affected tests: PolarionID:25757, PolarionID:24027

Error symptom:
User "system:serviceaccount:ci-op-xxx:xxx" cannot get/delete resource

Root Cause

Race condition: concurrent tests share the same admin kubeconfig file. When one test changes current-context, it affects other tests.

Solution

Double insurance mechanism:

  1. Per-test temporary kubeconfig - Each CLI instance gets isolated copy
  2. Explicit --context=admin - Ensures correct context regardless of current-context

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kuiwang02 kuiwang02 deleted the enhancerace branch March 17, 2026 01:47
@kuiwang02
Copy link
Contributor Author

/cherry-pick release-4.21

@openshift-cherrypick-robot

@kuiwang02: new pull request created: #1257

Details

In response to this:

/cherry-pick release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants