OCPSTRAT-3250: Konflux release gating pipeline for HyperShift Operator#2016
OCPSTRAT-3250: Konflux release gating pipeline for HyperShift Operator#2016bryan-cox wants to merge 3 commits into
Conversation
|
@bryan-cox: This pull request references OCPSTRAT-3250 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
|
||
| ## Proposal | ||
|
|
||
| A parallel, gated promotion path is added alongside the existing Konflux auto-release. A nightly pipeline resolves the latest HO Snapshot, runs e2e tests against the corresponding image, and promotes it to a verified repository only if tests pass. Each managed service platform receives its own independent promotion path with its own test suite and verified repository. |
There was a problem hiding this comment.
Bringing over @deads2k comment from Slack:
Add a parallel, gated promotion path alongside the existing auto-release — a nightly pipeline tests the latest HO image and only promotes it to a verified repository if e2e tests pass.
"I like this, the final solution should definitely include it"
|
|
||
| ## Proposal | ||
|
|
||
| A parallel, gated promotion path is added alongside the existing Konflux auto-release. A nightly pipeline resolves the latest HO Snapshot, runs e2e tests against the corresponding image, and promotes it to a verified repository only if tests pass. Each managed service platform receives its own independent promotion path with its own test suite and verified repository. |
There was a problem hiding this comment.
Bringing over @deads2k comment from Slack:
Each managed service platform (ARO HCP, ROSA HCP, GCP HCP) gets its own independent promotion path, so a failure on one platform does not block others.
"I'm ok with this, but I don't see that as a hard requirement. If y'all want to take the perspective that you want them unified, I'm ok with that too."
There was a problem hiding this comment.
Leaving as-is for now — we prefer independent paths but acknowledging it's not a hard requirement.
| | Phase 2 | Full CPO version matrix (every supported 4.y.z and 4.y.0) | | ||
| | Phase 3 | Platform-specific e2e (ARO HCP Azure ARM, platform QE co-authored tests) | | ||
|
|
||
| Subsequent phases add broader version coverage and platform-specific tests co-authored with platform QE teams. |
There was a problem hiding this comment.
Bringing over @deads2k comment from Slack:
Subsequent phases add broader version coverage and platform-specific tests co-authored with platform QE teams.
"I'm not certain this release controller is actually coupled to specific platforms. I see this release controller as encapsulating and automating hypershift's promise to platforms of phase 1 and phase 2 as you've laid them out. Keeping it at that level, plus informing per-platform would leave accountability and responsibility for failing promotion extremely clear."
There was a problem hiding this comment.
Updated the phase table to add ownership. Phases 1 and 2 are marked as required for completion, owned by HCP team. Phase 3 is reframed as informing jobs owned by platform teams — the release controller's responsibility ends with demonstrating it's possible to create such a job.
| | ----- | -------- | | ||
| | Phase 1 (MVP) | Cluster lifecycle, NodePool scaling, one upgrade path | | ||
| | Phase 2 | Full CPO version matrix (every supported 4.y.z and 4.y.0) | | ||
| | Phase 3 | Platform-specific e2e (ARO HCP Azure ARM, platform QE co-authored tests) | |
There was a problem hiding this comment.
Bringing over @deads2k comment from Slack:
David: Let's play out your phase3-platform specific jobs. Who would watch and how would we decide about responsiblity
David: well maybe back up to phase 2. Do you agree with phase 1 and phase 2 only, it's very clear that HCP owns "we haven't promoted a release, we must fix"
David: and that when we introduce phase 3, that becomes muddier, "it hasn't passed phase 3, but it's ARO-HCP's fault" (similar to our frequent failures with ROSA release-blocking jobs?
Bryan: re:phase 1 & 2 - yeah that seems reasonable to me.
Bryan: phase 3 - Agree it's not as clear. I think it would be a joint or shared responsibility between the teams to figure out why the tests are failing and how to resolve that.
David: can we make that explicit for phase 1 and phase 2, indicate that they are critical for considering this complete. and add the concept of informing jobs that would include phase 3, with the responsibility lying with platform teams for creating and watching their signal. The release controller responsibility ends with demonstrating it is possible to create such a job.
There was a problem hiding this comment.
Addressed in the same update as above. Phase 1-2 ownership is explicit. Phase 3 notes that HCP team can help debug and fix failing tests in coordination with platform teams.
e8dd6e1 to
68fbb5f
Compare
|
While we are implementing this effort for ARO HCP first, we are expecting to onboard ROSA HCP and GCP in the future. I wanted to make sure y'all were aware of this enhancement; please feel free to unsubscribe if you wish - @deads2k @joshbranham @cblecker |
5f6b081 to
a135a21
Compare
|
|
||
| ## Summary | ||
|
|
||
| This enhancement introduces a nightly, platform-independent gating system that validates HyperShift Operator (HO) images against end-to-end test suites before promoting them to verified repositories. The pipeline operates alongside the existing Konflux auto-release mechanism, adding a parallel promotion path that only advances images which have passed real-world e2e validation. |
There was a problem hiding this comment.
can we define "real-world e2e validation"? is this specific consumer owned e2e test suites / gates?
There was a problem hiding this comment.
Updated — clarified that "e2e validation" means test suites agreed upon between the HyperShift and managed service (HCM) teams. Tests may vary by platform.
AI-assisted response via Claude Code
| 3. Keep the existing auto-release to ACMD completely unchanged; the new pipeline is purely additive. | ||
| 4. Enable independent promotion paths per platform so that one platform's failure does not block others. | ||
| 5. Make the pipeline extensible to new platforms with only new Konflux resource definitions and no pipeline code changes. | ||
|
|
There was a problem hiding this comment.
is there any goal for per platform speed / granularity to ship? Why was 24h chosen?
There was a problem hiding this comment.
Is there any goal / non goal for alerting and/or troubleshooting failed pipelines?
There was a problem hiding this comment.
24h (nightly) was chosen to balance validation confidence with cloud infrastructure cost — each run provisions real clusters with cloud credentials. Per-commit gating is addressed in the Alternatives section: it's cost-prohibitive and would slow the development feedback loop. Per-platform cadence can differ if a platform team wants more frequent runs — each platform can have its own CronJob schedule (noted in the Platform Extensibility section).
AI-assisted response via Claude Code
There was a problem hiding this comment.
Alerting and troubleshooting are covered in a few places: the Error Handling table (every failure type triggers a Slack alert), the Stale Promotion Alert section (alerts if no successful promotion in N days, default 3), and the Support Procedures section (detection commands + remediation steps including manual re-trigger). These are tracked in CNTRLPLANE-3451 (stale alerting) and CNTRLPLANE-3450 (manual re-trigger). Let me know if you'd like more detail or if something specific is missing.
AI-assisted response via Claude Code
There was a problem hiding this comment.
can we capture these responses in the proposal?
There was a problem hiding this comment.
Done. Added a "Design Rationale" subsection after Non-Goals covering the nightly cadence choice and alerting/troubleshooting coverage, as discussed in earlier thread comments.
AI-assisted response via Claude Code
|
|
||
| ## Proposal | ||
|
|
||
| Add a parallel, gated promotion path alongside the existing auto-release. A nightly pipeline tests the latest HO image against platform-specific e2e suites and only promotes tested images to a verified repository. Each platform's promotion is independent — a failure on one does not block others. |
There was a problem hiding this comment.
A nightly pipeline tests the latest HO image
can we clarify what is this "latest HO image", e.g. who/how builds it?
There was a problem hiding this comment.
Clarified — "latest HO image" is the most recent image produced by Konflux's push build pipeline, triggered on every merge to main. Updated the Proposal paragraph to make this explicit.
AI-assisted response via Claude Code
| 1. **Trigger:** A Kubernetes CronJob in the `crt-redhat-acm-tenant` namespace runs nightly. | ||
| 2. **Resolve:** The CronJob queries Konflux Snapshots labeled with the push build's PipelineRun name, selects the most recent, and extracts the HO container image reference. | ||
| 3. **Launch:** The CronJob creates a Tekton `PipelineRun` referencing the e2e test pipeline (`.tekton/pipelines/ho-release-gate.yaml`), passing the snapshot name and HO image as parameters. | ||
| 4. **Test:** The pipeline launches Prow jobs that deploy the resolved HO image and run HyperShift e2e tests against it. Konflux orchestrates the run and consumes pass/fail results and links. |
There was a problem hiding this comment.
can we articulate how this happens per platform?
There was a problem hiding this comment.
Updated step 4 to articulate the per-platform mechanism: each platform has its own IntegrationTestScenario defining the test suite and target infrastructure (e.g., Azure for ARO HCP, AWS for ROSA HCP). Konflux orchestrates the run and consumes pass/fail results.
AI-assisted response via Claude Code
|
|
||
| #### ReleasePlan (per-platform) | ||
|
|
||
| A per-platform resource. The YAML below shows the ARO HCP pilot instance. Future platforms (ROSA HCP, GCP HCP) will each get their own ReleasePlan. All platforms push to the same verified repository, tagged differently per managed service. Auto-release is disabled (`auto-release: 'false'`), meaning images only reach the verified repo through explicit Release objects created after tests pass. |
There was a problem hiding this comment.
Done. Added explicit ownership — these resources are created by the HCP team in the `crt-redhat-acm-tenant` namespace.
AI-assisted response via Claude Code
|
|
||
| #### IntegrationTestScenario (per-platform) | ||
|
|
||
| A per-platform resource. This wires the e2e test Tekton pipeline as a gate on Snapshots. It references a pipeline definition stored in the HyperShift repository, allowing the test pipeline to evolve alongside the code it validates. |
There was a problem hiding this comment.
Done. Added explicit ownership — these resources are created by the HCP team in the `crt-redhat-acm-tenant` namespace.
AI-assisted response via Claude Code
Introduces a nightly, platform-independent gating system that validates HyperShift Operator images against e2e test suites before promoting them to verified repositories. The pipeline operates alongside the existing Konflux auto-release, adding a parallel promotion path per managed service platform (ARO HCP pilot, ROSA HCP and GCP HCP future). OCPSTRAT-3250 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| - name: application | ||
| description: HyperShift e2e tests for ARO HCP promotion gating | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Question about the interaction between the CronJob and this IntegrationTestScenario.
Looking at the existing ITS resources in crt-redhat-acm-tenant (e.g. hypershift-operator-main-enterprise-contract), they all use contexts: [{name: application}] and are triggered automatically by Konflux on every new Snapshot.
This ITS also uses contexts: [{name: application}] — wouldn't this cause Konflux to run the e2e test pipeline on every push build (i.e. every new Snapshot), rather than only on the nightly cadence the CronJob provides?
The CronJob already resolves the latest Snapshot and creates a PipelineRun directly via git resolver, bypassing the ITS entirely. So these two mechanisms seem to overlap.
Could you clarify how these are meant to interact? Specifically:
- Is the ITS needed for Konflux to consider a Snapshot "valid" before allowing a Release to be created from it?
- Or is the CronJob the sole trigger, and the ITS can be dropped?
There was a problem hiding this comment.
Good catch — you're right that these overlap. The documented Konflux periodic test pattern (https://konflux-ci.dev/docs/testing/integration/periodic-integration-tests/) uses contexts: [{name: disabled}] on the ITS so it doesn't trigger on every push, and the CronJob triggers it by labeling the latest snapshot with test.appstudio.openshift.io/run=<scenario-name>. Updated both the ITS (now uses disabled context) and CronJob (now labels snapshots instead of creating PipelineRuns directly) to follow this pattern. Also updated the RBAC, workflow diagrams, and step descriptions to match.
AI-assisted response via Claude Code
|
|
||
| **Konflux build pipeline** is the existing push build pipeline that creates Snapshots for every merged commit. | ||
|
|
||
| **e2e test pipeline** is a Tekton Pipeline defined at `.tekton/pipelines/ho-release-gate.yaml` in the HyperShift repository. |
There was a problem hiding this comment.
should this file named after the consumer? e.g. ho-aro-release-gate.yaml? will we have one per platform?
can we include the yaml example for ARO?
There was a problem hiding this comment.
The MVP uses a single pipeline file since ARO HCP is the only platform. The Konflux ITS spec.params field supports passing custom pipeline parameters, so a shared pipeline with per-platform params is viable if the task structure stays the same across platforms. If platforms need different task sequences or infrastructure setup, per-platform files (e.g., ho-aro-release-gate.yaml) would be the right call. Updated the doc to note both options with the decision deferred until a second platform is onboarded.
AI-assisted response via Claude Code
| 2. **Resolve:** The CronJob queries Konflux Snapshots labeled with the push build's PipelineRun name, selects the most recent, and extracts the HO container image reference. | ||
| 3. **Launch:** The CronJob creates a Tekton `PipelineRun` referencing the e2e test pipeline (`.tekton/pipelines/ho-release-gate.yaml`), passing the snapshot name and HO image as parameters. | ||
| 4. **Test:** The pipeline launches Prow jobs that deploy the resolved HO image and run e2e tests against it. Each platform defines its own `IntegrationTestScenario` that specifies the test suite and infrastructure — for example, ARO HCP tests run against Azure-provisioned clusters, while ROSA HCP tests would use AWS. Konflux orchestrates the run and consumes pass/fail results and links. | ||
| 5. **Promote:** On pass, the pipeline's `finally` block creates a Konflux Release object referencing the tested Snapshot and a platform-specific ReleasePlan. Konflux's release pipeline pushes the image to the verified repository. |
There was a problem hiding this comment.
what's the "verified repository"? Is there one per consumer? should this be in glossary?
There was a problem hiding this comment.
Added "Verified Repository" to the Glossary — single shared quay.io repo with per-platform image tags (e.g., aro-hcp-<digest>, rosa-hcp-<digest>).
AI-assisted response via Claude Code
|
|
||
| #### ReleasePlan (per-platform) | ||
|
|
||
| A per-platform resource created by the HCP team in the `crt-redhat-acm-tenant` namespace. The YAML below shows the ARO HCP pilot instance. Future platforms (ROSA HCP, GCP HCP) will each get their own ReleasePlan. All platforms push to the same verified repository, tagged differently per managed service. Auto-release is disabled (`auto-release: 'false'`), meaning images only reach the verified repo through explicit Release objects created after tests pass. |
There was a problem hiding this comment.
are all these resources created manually? will this be gitoped somehow?
There was a problem hiding this comment.
The pipeline definition lives in the HyperShift repo at .tekton/pipelines/, referenced by ITS via git resolver. Konflux namespace resources (ITS, ReleasePlan, CronJob, RBAC) are defined in contrib/konflux/ in the HyperShift repo and applied to the crt-redhat-acm-tenant namespace, following the same pattern used for existing Konflux config. Changes go through the standard PR review process.
AI-assisted response via Claude Code
| - name: revision | ||
| value: main | ||
| - name: pathInRepo | ||
| value: .tekton/pipelines/ho-release-gate.yaml |
There was a problem hiding this comment.
should this yaml have a consumer specific name?
There was a problem hiding this comment.
This is the same question addressed in the pipeline naming thread — for the MVP with only ARO HCP, a single ho-release-gate.yaml is used. When additional platforms are onboarded, this may become per-consumer (e.g., ho-aro-release-gate.yaml) if test suites differ enough, or stay shared with platform-specific params via ITS spec.params. Decision deferred until a second platform is added.
AI-assisted response via Claude Code
| A per-platform resource created by the HCP team in the `crt-redhat-acm-tenant` namespace. This wires the e2e test Tekton pipeline as a gate on Snapshots. It references a pipeline definition stored in the HyperShift repository, allowing the test pipeline to evolve alongside the code it validates. | ||
|
|
||
| ```yaml | ||
| apiVersion: appstudio.redhat.com/v1beta2 |
There was a problem hiding this comment.
it'd be nice a diagram showing how the cronjob, IntegrationTestScenario, ReleasePlan, ReleasePlanAdmission... CRs interact
There was a problem hiding this comment.
Done. Added a CR interaction diagram in the Implementation Details section showing how CronJob, Snapshot, IntegrationTestScenario, PipelineRun, Release, ReleasePlan, and ReleasePlanAdmission relate to each other.
AI-assisted response via Claude Code
| The nightly cadence means there is up to a 24-hour delay between a merge and its appearance in a verified repository. This is acceptable for production consumption but may require teams to continue using ACMD for rapid iteration. | ||
|
|
||
| ## Alternatives (Not Implemented) | ||
|
|
There was a problem hiding this comment.
do we want to include considerations to move HO into OLM? maybe beyond scope
There was a problem hiding this comment.
Agreed this is beyond scope for this enhancement — the gating pipeline is delivery-mechanism-agnostic and would work regardless of whether HO is delivered via OLM or the current direct image push. If HO moves to OLM in the future, the promotion step would change (OLM bundle vs raw image push) but the test-then-promote pattern stays the same.
AI-assisted response via Claude Code
|
|
||
| 4. **Platform e2e test integration:** Bryan is working with the ARO HCP team to integrate their platform-specific e2e tests into the HyperShift repo, following the same pattern used for HyperShift's existing presubmit e2e tests. | ||
|
|
||
| 5. **Regression analysis:** deads2k raised that this release, decoupled from OCP releases, needs its own regression analysis in component readiness — comparing current HO against a sliding baseline to track the trajectory of the project. This needs further discussion to determine what that mechanism looks like and how it integrates with existing component readiness tooling. |
There was a problem hiding this comment.
is there ticket/anyone from ship team aware of this?
There was a problem hiding this comment.
Not yet — we haven't coordinated with the SHIP team on this. The regression analysis mechanism for a release decoupled from OCP is still undefined. Adding a note here to track the need for SHIP team engagement.
AI-assisted response via Claude Code
|
dropped some more questions, lgtm |
- Add Design Rationale section capturing nightly cadence and alerting rationale from PR discussion threads (enxebre) - Clarify per-platform pipeline naming strategy with TBD for shared vs separate files when second platform onboards (enxebre) - Add Verified Repository to glossary as single shared quay.io repo with per-platform image tags (enxebre) - Document resource management: pipeline in .tekton/pipelines/, Konflux namespace resources in contrib/konflux/ (enxebre) - Fix CronJob/ITS interaction to follow Konflux periodic test pattern: ITS uses disabled context, CronJob labels snapshots instead of creating PipelineRuns directly (Nirshal) - Update RBAC, diagrams, and workflow steps to match new pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Shows how CronJob, Snapshot, IntegrationTestScenario, PipelineRun, Release, ReleasePlan, and ReleasePlanAdmission interact during the nightly gating flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@bryan-cox: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
JoelSpeed
left a comment
There was a problem hiding this comment.
One thing I'm hoping we solve as part of this EP is the ability to have confidence in the supported matrix of CPO to guest and CPO to management version skews. There is only a very light mention of cross version testing here, was that something you were considering in/out of scope?
|
|
||
| ## Summary | ||
|
|
||
| This enhancement introduces a nightly, platform-independent gating system that validates HyperShift Operator (HO) images against end-to-end test suites before promoting them to verified repositories. The pipeline operates alongside the existing Konflux auto-release mechanism, adding a parallel promotion path that only advances images which have passed e2e test suites agreed upon between the HyperShift and managed service (HCM) teams. Tests may vary by platform. |
There was a problem hiding this comment.
Promotion requires all tests pass across all platforms? Or are there separate promotion destinations such that we might see promotion succeed on ARO but not ROSA?
There was a problem hiding this comment.
There will be different promotion paths for each managed service since each one needs a different set of tests to pass. If ARO HCP tests fail but GCP and ROSA tests pass, they should still get a tagged HO for their managed services respectively.
|
|
||
| #### Design Rationale | ||
|
|
||
| **Nightly cadence (24h):** Each pipeline run provisions real cloud infrastructure with platform-specific credentials (e.g., Azure for ARO HCP). A nightly cadence balances validation confidence with cloud infrastructure cost. Per-commit gating is cost-prohibitive and would slow the development feedback loop (see Alternatives). Per-platform cadence can differ — each platform can have its own CronJob schedule. |
There was a problem hiding this comment.
OpenShift CI and nightly builds happen every 6h, have you considered making this more frequent than once per day? Is there enough change in a day to warrant more than once per day?
There was a problem hiding this comment.
Once a day might actually be too much. Some managed services update the HO more than others but I do not think any of the them are in a place to do more than one update within a 24h period.
|
|
||
| ## Proposal | ||
|
|
||
| Add a parallel, gated promotion path alongside the existing auto-release. A nightly pipeline resolves the most recent HO image built by Konflux's push build pipeline (triggered on every merge to `main`) and tests it against platform-specific e2e suites. Only tested images are promoted to a verified repository. Each platform's promotion is independent — a failure on one does not block others. |
There was a problem hiding this comment.
Are there any retest mechanisms here should it fail? Or is it then a case of wait until the next day?
Having this per platform makes the concept of a "green nightly" more elusive, is tracking the failures and escalation something you plan when there are consecutive failures?
There was a problem hiding this comment.
It's wait until the next day but retest is something we plan to follow up on later. It was not seen as a must have for a MVP.
| | Phase | Coverage | Ownership | | ||
| | ----- | -------- | --------- | | ||
| | Phase 1 (MVP) | Cluster lifecycle, NodePool scaling, one upgrade path | HCP team — required for completion | | ||
| | Phase 2 | Full CPO version matrix (every supported 4.y.z and 4.y.0) | HCP team — required for completion | |
There was a problem hiding this comment.
What does this actually mean? Is this "run CPO on lots of 4.Y management clusters" or "CPO can create lots of 4.Y workload clusters"
There was a problem hiding this comment.
There is some CPO testing being done outside this effort but those tests will be included in the promotion process of the image. @clebs could point you to that effort.
Summary
OCPSTRAT-3250 / CNTRLPLANE-3434
Test plan
🤖 Generated with Claude Code