Skip to content

CNTRLPLANE-3329: Extend EFS-backed build cache to lint, verify, and envtest workflows#8495

Open
vismishr wants to merge 3 commits into
openshift:mainfrom
vismishr:CNTRLPLANE-3329/extend-efs-cache-to-all-workflows
Open

CNTRLPLANE-3329: Extend EFS-backed build cache to lint, verify, and envtest workflows#8495
vismishr wants to merge 3 commits into
openshift:mainfrom
vismishr:CNTRLPLANE-3329/extend-efs-cache-to-all-workflows

Conversation

@vismishr
Copy link
Copy Markdown
Contributor

@vismishr vismishr commented May 12, 2026

What this PR does / why we need it:

Extends the EFS-backed Go build cache to all Go-based CI workflows beyond
unit tests. Each workflow now includes a conditional step that copies the
shared cache from /cache/go-build to a local tmpdir, setting GOCACHE
to use it. This eliminates cold compilation overhead across all CI jobs.

Workflows updated:

  • lint-reusable.yaml
  • verify-reusable.yaml
  • envtest-ocp-reusable.yaml
  • envtest-kube-reusable.yaml

If the EFS mount is not present, the step silently skips and CI runs
without a cache — no functional change from today's behavior.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3329

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Chores
    • Added a reusable step to warm a job-local Go build cache and integrated it into verification, linting, and envtest workflows. Also added logic to pre-seed the job cache from shared storage when available. These CI updates result in fewer repeated Go rebuilds and generally shorter test, lint, and verification run times.

…test

Add conditional EFS cache warming step to lint, verify, envtest-ocp,
and envtest-kube workflows. Each job copies /cache/go-build to a local
tmpdir when available, falling back gracefully when the mount is absent.

Part-of: CNTRLPLANE-3329
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 12, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 12, 2026

@vismishr: This pull request references CNTRLPLANE-3329 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Extends the EFS-backed Go build cache to all Go-based CI workflows beyond
unit tests. Each workflow now includes a conditional step that copies the
shared cache from /cache/go-build to a local tmpdir, setting GOCACHE
to use it. This eliminates cold compilation overhead across all CI jobs.

Workflows updated:

  • lint-reusable.yaml
  • verify-reusable.yaml
  • envtest-ocp-reusable.yaml
  • envtest-kube-reusable.yaml

If the EFS mount is not present, the step silently skips and CI runs
without a cache — no functional change from today's behavior.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3329

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: eec45aa7-46f5-438e-884f-c8d8bc667247

📥 Commits

Reviewing files that changed from the base of the PR and between 4341d0c and cc7216b.

📒 Files selected for processing (5)
  • .github/actions/warm-go-cache/action.yaml
  • .github/workflows/envtest-kube-reusable.yaml
  • .github/workflows/envtest-ocp-reusable.yaml
  • .github/workflows/lint-reusable.yaml
  • .github/workflows/verify-reusable.yaml

📝 Walkthrough

Walkthrough

A new composite action .github/actions/warm-go-cache was added to set GOCACHE=/tmp/go-build-cache and, if /cache/go-build exists, copy its contents into /tmp/go-build-cache (warnings on failure). Four reusable workflows were updated to run this action immediately after actions/checkout: envtest-kube-reusable.yaml, envtest-ocp-reusable.yaml, lint-reusable.yaml, and verify-reusable.yaml.

Sequence Diagram(s)

sequenceDiagram
  participant Runner
  participant Checkout
  participant Warm
  participant Cache
  participant Build
  Runner->>Checkout: actions/checkout
  Runner->>Warm: run warm-go-cache
  Warm->>Cache: check /cache/go-build
  alt cache exists
    Cache-->>Warm: cached files
    Warm->>Warm: copy files to /tmp/go-build-cache
  else no cache
    Cache-->>Warm: no cache
    Warm-->>Warm: continue with warning
  end
  Runner->>Build: run make test / lint / verify (uses GOCACHE)
  Build-->>Runner: results
Loading

Possibly related PRs

  • openshift/hypershift#8493: Adds/mounts the shared EFS-backed go-cache PV at /cache/go-build, which the new warm-go-cache action copies from.
🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: extending an EFS-backed Go build cache to multiple CI workflows (lint, verify, envtest).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only CI workflow YAML files and a composite GitHub Action. No Ginkgo test files are modified and no test names appear in modified files. Check is not applicable.
Test Structure And Quality ✅ Passed The custom check for "Test Structure and Quality" is not applicable to this PR. The PR modifies only GitHub workflow YAML files and adds a GitHub Action YAML file—no Ginkgo test code is present.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests added. PR only modifies CI workflows to extend Go build cache. Existing tests use standard Go testing, not Ginkgo.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Check not applicable: PR makes no changes to Ginkgo e2e tests. Changes are limited to GitHub Actions workflow files for Go cache warming, which is CI infrastructure.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only GitHub Actions workflows and CI/CD composite actions, not deployment manifests, operator code, or Kubernetes controllers. The topology-aware scheduling check does not apply.
Ote Binary Stdout Contract ✅ Passed This PR only modifies GitHub Actions workflow YAML and adds a bash script composite action. No Go source code is changed, so the OTE Binary Stdout Contract check does not apply.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Check is not applicable. PR only modifies GitHub Actions workflow YAML files and adds a Go cache warming composite action. No new Ginkgo e2e tests (It, Describe, Context, When) are added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vismishr
Once this PR has been reviewed and has the lgtm label, please assign devguyio for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vismishr vismishr marked this pull request as ready for review May 12, 2026 19:02
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and clebs May 12, 2026 19:03
Use cp -a /cache/go-build/. to copy the contents of the cache
directory rather than the directory itself. This prevents a nested
go-build subdirectory if /tmp/go-build-cache already exists.
@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 13, 2026

I now have the complete picture. Here is the final report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

"e2e-azure-v2-self-managed" pod "e2e-azure-v2-self-managed-destroy-management-cluster" failed:
pod pending for more than 1h0m0s: containers have not started in 1h0m0.000176828s: inject-cli, sidecar, test:
* Container sidecar is not ready with reason PodInitializing
* Container test is not ready with reason PodInitializing

Summary

All 8 e2e tests (public, private, OAuth LB) passed successfully in 35 seconds. The job failure is entirely caused by a CI infrastructure issue in the post phase: the destroy-management-cluster cleanup pod was unable to start because the inject-cli container image (quay-proxy.ci.openshift.org/openshift/ci@sha256:077c1fe97d5...) hung indefinitely during pull on node ip-10-0-181-30.ec2.internal. The pod also experienced 3 sandbox creation failures (DeadlineExceeded, RST_STREAM/CANCEL) in the 12 minutes before init containers could even start. This is a CI cluster node-level issue — not related to the PR changes or the product under test.

Root Cause

The root cause is a CI infrastructure problem on node ip-10-0-181-30.ec2.internal in the build01 CI cluster, manifesting as two distinct issues:

  1. Pod sandbox creation failures (21:27–21:40): The pod was assigned to the node at ~21:27:44 but the kubelet failed to create the pod sandbox 3 times over ~12 minutes, with errors context deadline exceeded and stream terminated by RST_STREAM with error code: CANCEL. This indicates CRI-O or the container runtime on that node was under pressure, possibly from resource exhaustion or network issues with the container runtime socket.

  2. Indefinite image pull stall (21:40–22:40): After init containers finally started at ~21:40, the kubelet began pulling the inject-cli image from quay-proxy.ci.openshift.org at 21:40:46. This pull never completed — no "Successfully pulled" event was ever recorded. The pod was killed exactly 1 hour later at 22:40:45 by the ci-operator pending timeout.

The failing step is destroy-management-cluster, a post-phase cleanup step that tears down the management cluster. The actual test step (tests) completed successfully at 20:37:12 with all 8 tests passing (4 public, 2 private, 2 OAuth LB). The PR code changes (EFS-backed build cache for lint, verify, and envtest workflows) are entirely unrelated to this failure.

This is a flaky CI infrastructure failure — the node had connectivity or runtime issues that prevented container image pulls from the CI registry proxy.

Recommendations
  1. Rerun the job — This is a transient CI infrastructure issue unrelated to the PR. A rerun on a different node will likely succeed.
  2. No code changes needed — All 8 e2e tests passed. The PR changes (EFS-backed build cache) do not touch any code paths exercised by this job.
  3. If rerun fails identically — Report a CI infrastructure issue with node ip-10-0-181-30.ec2.internal on build01 cluster, noting the image pull stall from quay-proxy.ci.openshift.org.
Evidence
Evidence Detail
Failed step e2e-azure-v2-self-managed-destroy-management-cluster (post phase, cleanup only)
Failed phase Post phase — not pre (install) or test
Test results All 8 tests PASSED (4 public + 2 private + 2 OAuth LB) in 35s
Node ip-10-0-181-30.ec2.internal (build01 CI cluster)
Sandbox failures 3× sandbox creation failures between 21:27–21:40 (DeadlineExceeded, RST_STREAM/CANCEL)
Stuck image quay-proxy.ci.openshift.org/openshift/ci@sha256:077c1fe97d5dbc01f8ff417f209817eeda26e284ebd8c70bb24a621d5f46c126 (inject-cli)
Pull duration Started 21:40:46 — never completed — pod killed at 22:40:45 (1h timeout)
Init containers ci-scheduling-dns-wait, place-entrypoint, cp-entrypoint-wrapper all started OK
Blocked containers inject-cli, sidecar, test — never started (stuck in PodInitializing)
CI reason code executing_graph:step_failed:utilizing_lease:executing_test:utilizing_ip_pool:executing_test:executing_multi_stage_test:pod_pending
Relation to PR None — PR changes EFS build cache for lint/verify/envtest; failure is in CI cleanup infra

…ndling

Extract the EFS cache-warm block into a reusable composite action at
.github/actions/warm-go-cache/action.yaml. This adds graceful error
handling so copy failures log a warning instead of failing the job,
and sets GOCACHE via GITHUB_ENV to eliminate job-level env vars.

Commit-Message-Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/actions/warm-go-cache/action.yaml (1)

12-12: ⚡ Quick win

Use a GitHub warning annotation for copy failures.

Plain echo is easy to miss in logs; ::warning:: makes this visible in the Checks UI.

Proposed change
-          echo "Warning: failed to copy EFS cache, proceeding without cache"
+          echo "::warning::Failed to copy EFS cache; continuing without warmed cache"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/warm-go-cache/action.yaml at line 12, Replace the plain echo
warning used when EFS cache copy fails (the line containing echo "Warning:
failed to copy EFS cache, proceeding without cache") with a GitHub Actions log
annotation using the ::warning:: prefix so the message surfaces in the Checks
UI; keep the same message text but emit it as ::warning::Warning: failed to copy
EFS cache, proceeding without cache so failures become visible in the workflow
annotations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.github/actions/warm-go-cache/action.yaml:
- Line 12: Replace the plain echo warning used when EFS cache copy fails (the
line containing echo "Warning: failed to copy EFS cache, proceeding without
cache") with a GitHub Actions log annotation using the ::warning:: prefix so the
message surfaces in the Checks UI; keep the same message text but emit it as
::warning::Warning: failed to copy EFS cache, proceeding without cache so
failures become visible in the workflow annotations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 565e34dc-771c-44da-bb89-840ed9f31415

📥 Commits

Reviewing files that changed from the base of the PR and between babbae3 and cc7216b.

📒 Files selected for processing (5)
  • .github/actions/warm-go-cache/action.yaml
  • .github/workflows/envtest-kube-reusable.yaml
  • .github/workflows/envtest-ocp-reusable.yaml
  • .github/workflows/lint-reusable.yaml
  • .github/workflows/verify-reusable.yaml

@vismishr
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

✅ Actions performed

Full review triggered.

@vismishr
Copy link
Copy Markdown
Contributor Author

/area ci-tooling

@openshift-ci openshift-ci Bot added area/ci-tooling Indicates the PR includes changes for CI or tooling and removed do-not-merge/needs-area labels May 14, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

@vismishr: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci-tooling Indicates the PR includes changes for CI or tooling jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants