OCPBUGS-78557: Fix admin context race condition in parallel tests#1256
OCPBUGS-78557: Fix admin context race condition in parallel tests#1256openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
Conversation
|
@kuiwang02: An error was encountered searching for bug OCPQE-31804 on the Jira server at https://issues.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
request failed. Please analyze the request body for more details. Status code: 403:
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds per-test temporary admin kubeconfig handling and global lazy caching for the admin context; updates CLI flows to use the temp config and cached context. Introduces safe file-duplication helpers that return errors (and a fallback returning empty string) and refactors public duplication functions to delegate to them. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests-extension/test/qe/util/client.go`:
- Around line 549-555: The teardown currently removes c.tempAdmConfPath and
clears it, but SetupProject does not recreate that per-test temp admin
kubeconfig, so subsequent AsAdmin() calls fall back to shared adminConfigPath
and reintroduce the race; update SetupProject to always create and assign a
fresh temp admin kubeconfig when c.tempAdmConfPath is empty (or missing) so each
spec gets its own temp file, and ensure the code paths that rely on
c.tempAdmConfPath (e.g., AsAdmin(), any temp file creation helpers) use that
fresh value; alternatively, stop clearing c.tempAdmConfPath in the teardown and
ensure SetupProject overwrites it on each setup to guarantee per-test isolation.
- Around line 1218-1224: The code currently uses a single global cache guarded
by globalAdmContextOnce and globalAdmContextName which ignores the
kubeConfigFile parameter; change this to cache per kubeconfig path (keyed by
kubeConfigFile) instead of a single global value—replace the single
sync.Once/globalAdmContextName with a safe map (e.g., map[string]string)
protected by a sync.RWMutex or a sync.Map and store/look up the admin context by
kubeConfigFile inside the function that calls
getAdminContextName(kubeConfigFile); update callers like NewCLI and
SetAdminKubeconf to rely on the per-path cache so each distinct adminConfigPath
returns its correct context and keep the existing logging (e2e.Logf) when
caching a new entry.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 591a24f1-8c47-4026-bb83-c97e09060772
📥 Commits
Reviewing files that changed from the base of the PR and between bf71e98 and e2089cdfb86cbc7718f1b1c9107eb2038498cacf.
📒 Files selected for processing (2)
tests-extension/test/qe/util/client.gotests-extension/test/qe/util/tools.go
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
tests-extension/test/qe/util/client.go (1)
143-146:⚠️ Potential issue | 🟠 MajorRecreate the temp admin kubeconfig for every
NewCLIWithoutNamespacespec.
TeardownProject()removestempAdmConfPathafter each spec, but this constructor only creates it once. From the second spec onward,AdminConfig()andAsAdmin()fall back to the shared admin kubeconfig again unless every caller manually runsSetupProject().💡 Possible fix
client.adminConfigPath = KubeConfigPath() client.showInfo = true // Get cached admin context name (parsed only once globally for efficiency) client.admContextName = getAdminContextNameCached(client.adminConfigPath) - client.tempAdmConfPath = duplicateFileToTempOrEmpty(client.adminConfigPath, "admin-config-") + + g.BeforeEach(func() { + client.tempAdmConfPath = DuplicateFileToTemp(client.adminConfigPath, "admin-config-") + }) return client }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-extension/test/qe/util/client.go` around lines 143 - 146, The code currently creates client.tempAdmConfPath once (via duplicateFileToTempOrEmpty) so when TeardownProject deletes that file subsequent NewCLIWithoutNamespace specs see the original shared admin kubeconfig; to fix, stop reusing a single global tempAdmConfPath and instead create the temp admin kubeconfig inside the NewCLIWithoutNamespace constructor: move the call to duplicateFileToTempOrEmpty into NewCLIWithoutNamespace (or have NewCLIWithoutNamespace call a helper that sets client.tempAdmConfPath) so each spec gets a fresh temp file, keep client.admContextName cached if desired, and ensure TeardownProject still removes the per-spec client.tempAdmConfPath.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests-extension/test/qe/util/client.go`:
- Around line 552-558: The cleanup that removes and clears c.tempAdmConfPath
should not run before AdminDynamicClient() is used; move the block that deletes
the temporary admin kubeconfig (the os.Remove + setting c.tempAdmConfPath = "")
out of its current location and place it at the very end of the teardown
routine, after the resource-deletion loop and after any calls that create or use
AdminDynamicClient(), so the teardown continues to use the temp admin kubeconfig
for context until all resources are deleted.
- Around line 195-202: SetAdminKubeconf currently only updates adminConfigPath
which leaves derived fields (tempAdmConfPath, admContextName and any cached
nc.configPath/nc.isUseAdminCxt) stale; update SetAdminKubeconf to also refresh
tempAdmConfPath and admContextName and recompute derived state so code paths
that choose nc.configPath = tempAdmConfPath or set
nc.isUseAdminCxt/admContextName reflect the new admin kubeconfig. Ensure the
same refresh logic is applied to the other occurrences noted (around the other
blocks referencing tempAdmConfPath, admContextName and nc.isUseAdminCxt) so
callers use the updated temp file and context.
- Around line 361-363: In SetupProject, don't let duplicateFileToTempOrEmpty
silently leave c.tempAdmConfPath empty; if
duplicateFileToTempOrEmpty(adminConfigPath, "admin-config-") returns an empty
string, fail SetupProject immediately (e.g., return an error or call the
test-failure helper) instead of leaving tempAdmConfPath unset, so AdminConfig()
won't fall back to the shared kubeconfig and reopen the race; update the
SetupProject code path that sets c.tempAdmConfPath to check the returned value
and abort the test on failure.
---
Duplicate comments:
In `@tests-extension/test/qe/util/client.go`:
- Around line 143-146: The code currently creates client.tempAdmConfPath once
(via duplicateFileToTempOrEmpty) so when TeardownProject deletes that file
subsequent NewCLIWithoutNamespace specs see the original shared admin
kubeconfig; to fix, stop reusing a single global tempAdmConfPath and instead
create the temp admin kubeconfig inside the NewCLIWithoutNamespace constructor:
move the call to duplicateFileToTempOrEmpty into NewCLIWithoutNamespace (or have
NewCLIWithoutNamespace call a helper that sets client.tempAdmConfPath) so each
spec gets a fresh temp file, keep client.admContextName cached if desired, and
ensure TeardownProject still removes the per-spec client.tempAdmConfPath.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ff9757e8-dc59-4212-90bd-08b524eeaf15
📒 Files selected for processing (2)
tests-extension/test/qe/util/client.gotests-extension/test/qe/util/tools.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
tests-extension/test/qe/util/client.go (2)
314-319:⚠️ Potential issue | 🟠 MajorRemove prior temp admin kubeconfig before replacing it.
Line 318 overwrites
c.tempAdmConfPathwithout deleting the previous file, leaving credential-bearing temp kubeconfigs behind.Proposed fix
func (c *CLI) SetAdminKubeconf(adminKubeconf string) *CLI { + if c.tempAdmConfPath != "" { + if err := os.Remove(c.tempAdmConfPath); err != nil { + e2e.Logf("Failed to remove old temporary admin config file %s: %v", c.tempAdmConfPath, err) + } + c.tempAdmConfPath = "" + } c.adminConfigPath = adminKubeconf // Refresh derived fields to match new admin kubeconfig c.admContextName = getAdminContextNameCached(adminKubeconf) c.tempAdmConfPath = duplicateFileToTempOrEmpty(adminKubeconf, "admin-config-") return c }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-extension/test/qe/util/client.go` around lines 314 - 319, In SetAdminKubeconf, before overwriting c.tempAdmConfPath call out the existing temp file and remove it if non-empty to avoid leaving credential-bearing files behind; specifically, check c.tempAdmConfPath (the prior temp file path) and delete it (handle errors appropriately) before assigning the new value from duplicateFileToTempOrEmpty(adminKubeconf, "admin-config-"), while still updating c.adminConfigPath and c.admContextName as currently implemented.
1224-1231:⚠️ Potential issue | 🟠 MajorGlobal admin-context cache is not keyed by kubeconfig path.
Line 1225 uses a single
sync.Once, so the first resolved context is reused for all futurekubeConfigFilevalues. That can apply the wrong--contextafterSetAdminKubeconf(...)changes the admin kubeconfig.Proposed fix
-var ( - // Global cache for admin context name to avoid repeated parsing across all test suites - globalAdmContextName string - globalAdmContextOnce sync.Once -) +var ( + // Cache admin context per kubeconfig path. + globalAdmContextCache sync.Map // map[string]string +) func getAdminContextNameCached(kubeConfigFile string) string { - globalAdmContextOnce.Do(func() { - globalAdmContextName = getAdminContextName(kubeConfigFile) - if globalAdmContextName != "" { - e2e.Logf("Cached admin context for all test suites: %s", globalAdmContextName) - } - }) - return globalAdmContextName + if kubeConfigFile == "" { + return "" + } + if v, ok := globalAdmContextCache.Load(kubeConfigFile); ok { + return v.(string) + } + ctx := getAdminContextName(kubeConfigFile) + globalAdmContextCache.Store(kubeConfigFile, ctx) + if ctx != "" { + e2e.Logf("Cached admin context for kubeconfig %s: %s", kubeConfigFile, ctx) + } + return ctx }#!/bin/bash set -euo pipefail # Verify singleton cache implementation and call sites rg -n -C3 --type=go 'globalAdmContextName|globalAdmContextOnce|getAdminContextNameCached\s*\(' tests-extension/test/qe/util/client.go # Verify mutable admin kubeconfig path is supported by API rg -n -C3 --type=go 'func \(c \*CLI\) SetAdminKubeconf|c\.adminConfigPath =|getAdminContextNameCached\(' tests-extension/test/qe/util/client.go🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-extension/test/qe/util/client.go` around lines 1224 - 1231, The cache for admin context must be keyed by kubeConfigFile instead of a single sync.Once/globalAdmContextName; replace globalAdmContextOnce/globalAdmContextName with a threadsafe map (e.g., package-level map[string]string guarded by a sync.Mutex or sync.RWMutex) named something like globalAdmContextCache, and change getAdminContextNameCached(kubeConfigFile string) to look up the cache key, call getAdminContextName(kubeConfigFile) only on cache-miss, store the result, and log as before; also ensure SetAdminKubeconf (method c.SetAdminKubeconf) invalidates or updates the corresponding cache entry (or clears the cache) when the admin kubeconfig path changes so subsequent calls use the correct context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests-extension/test/qe/util/client.go`:
- Around line 314-319: In SetAdminKubeconf, before overwriting c.tempAdmConfPath
call out the existing temp file and remove it if non-empty to avoid leaving
credential-bearing files behind; specifically, check c.tempAdmConfPath (the
prior temp file path) and delete it (handle errors appropriately) before
assigning the new value from duplicateFileToTempOrEmpty(adminKubeconf,
"admin-config-"), while still updating c.adminConfigPath and c.admContextName as
currently implemented.
- Around line 1224-1231: The cache for admin context must be keyed by
kubeConfigFile instead of a single sync.Once/globalAdmContextName; replace
globalAdmContextOnce/globalAdmContextName with a threadsafe map (e.g.,
package-level map[string]string guarded by a sync.Mutex or sync.RWMutex) named
something like globalAdmContextCache, and change
getAdminContextNameCached(kubeConfigFile string) to look up the cache key, call
getAdminContextName(kubeConfigFile) only on cache-miss, store the result, and
log as before; also ensure SetAdminKubeconf (method c.SetAdminKubeconf)
invalidates or updates the corresponding cache entry (or clears the cache) when
the admin kubeconfig path changes so subsequent calls use the correct context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bd454d36-c581-4fcd-b110-b091651bce78
📒 Files selected for processing (2)
tests-extension/test/qe/util/client.gotests-extension/test/qe/util/tools.go
|
/payload-job periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-proxy-extended-f2 |
|
@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/cc56ea70-2104-11f1-8db3-624666920e17-0 |
|
/payload-job periodic-ci-openshift-operator-framework-olm-release-4.22-periodics-e2e-aws-ovn-fips-techpreview-extended-f1 |
|
@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ddd7cfd0-2104-11f1-8049-25ecc6f28b0c-0 |
|
@kuiwang02: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/verified by @kuiwang02 |
|
@kuiwang02: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jianzhangbjz, kuiwang02 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
|
@kuiwang02: This pull request references OCPQE-31804 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
/retitle OCPBUGS-78557: Fix admin context race condition in parallel tests |
|
@kuiwang02: Jira Issue Verification Checks: Jira Issue OCPBUGS-78557 Jira Issue OCPBUGS-78557 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.21 |
|
@kuiwang02: new pull request created: #1257 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc @jianzhangbjz @Xia-Zhao-rh @bandrade
Problem
Intermittent test failures where
AsAdmin()uses wrong service account instead of cluster-admin credentials.Affected tests: PolarionID:25757, PolarionID:24027
Error symptom:
User "system:serviceaccount:ci-op-xxx:xxx" cannot get/delete resource
Root Cause
Race condition: concurrent tests share the same admin kubeconfig file. When one test changes current-context, it affects other tests.
Solution
Double insurance mechanism:
--context=admin- Ensures correct context regardless of current-context