Skip to content

OCPSTRAT-3250: Konflux release gating pipeline for HyperShift Operator#2016

Open
bryan-cox wants to merge 3 commits into
openshift:masterfrom
bryan-cox:OCPSTRAT-3250
Open

OCPSTRAT-3250: Konflux release gating pipeline for HyperShift Operator#2016
bryan-cox wants to merge 3 commits into
openshift:masterfrom
bryan-cox:OCPSTRAT-3250

Conversation

@bryan-cox
Copy link
Copy Markdown
Member

@bryan-cox bryan-cox commented May 19, 2026

Summary

  • Adds a new enhancement proposal for a nightly Konflux-based release gating pipeline that validates HyperShift Operator images against e2e test suites before promoting them to verified repositories.
  • The pipeline operates alongside the existing auto-release to ACMD, adding a parallel, per-platform promotion path (ARO HCP pilot, ROSA HCP and GCP HCP future).
  • Includes full Konflux resource definitions (CronJob, ReleasePlan, IntegrationTestScenario, RBAC, Release), e2e test pipeline structure, error handling, strategy alignment, and related Jira issue tracking.

OCPSTRAT-3250 / CNTRLPLANE-3434

Test plan

  • markdownlint passes cleanly
  • All required enhancement template headings present
  • YAML frontmatter validates

🤖 Generated with Claude Code

@openshift-ci openshift-ci Bot requested review from csrwng and enxebre May 19, 2026 17:26
@bryan-cox bryan-cox changed the title Enhancement: Konflux release gating pipeline for HyperShift Operator OCPSTRAT-3250: Enhancement: Konflux release gating pipeline for HyperShift Operator May 19, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 19, 2026

@bryan-cox: This pull request references OCPSTRAT-3250 which is a valid jira issue.

Details

In response to this:

Summary

  • Adds a new enhancement proposal for a nightly Konflux-based release gating pipeline that validates HyperShift Operator images against e2e test suites before promoting them to verified repositories.
  • The pipeline operates alongside the existing auto-release to ACMD, adding a parallel, per-platform promotion path (ARO HCP pilot, ROSA HCP and GCP HCP future).
  • Includes full Konflux resource definitions (CronJob, ReleasePlan, IntegrationTestScenario, RBAC, Release), e2e test pipeline structure, error handling, strategy alignment, and related Jira issue tracking.

OCPSTRAT-3250 / CNTRLPLANE-3434

Test plan

  • markdownlint passes cleanly
  • All required enhancement template headings present
  • YAML frontmatter validates

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox bryan-cox changed the title OCPSTRAT-3250: Enhancement: Konflux release gating pipeline for HyperShift Operator OCPSTRAT-3250: Konflux release gating pipeline for HyperShift Operator May 19, 2026

## Proposal

A parallel, gated promotion path is added alongside the existing Konflux auto-release. A nightly pipeline resolves the latest HO Snapshot, runs e2e tests against the corresponding image, and promotes it to a verified repository only if tests pass. Each managed service platform receives its own independent promotion path with its own test suite and verified repository.
Copy link
Copy Markdown
Member Author

@bryan-cox bryan-cox May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing over @deads2k comment from Slack:

Add a parallel, gated promotion path alongside the existing auto-release — a nightly pipeline tests the latest HO image and only promotes it to a verified repository if e2e tests pass.

"I like this, the final solution should definitely include it"


## Proposal

A parallel, gated promotion path is added alongside the existing Konflux auto-release. A nightly pipeline resolves the latest HO Snapshot, runs e2e tests against the corresponding image, and promotes it to a verified repository only if tests pass. Each managed service platform receives its own independent promotion path with its own test suite and verified repository.
Copy link
Copy Markdown
Member Author

@bryan-cox bryan-cox May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing over @deads2k comment from Slack:

Each managed service platform (ARO HCP, ROSA HCP, GCP HCP) gets its own independent promotion path, so a failure on one platform does not block others.

"I'm ok with this, but I don't see that as a hard requirement.  If y'all want to take the perspective that you want them unified, I'm ok with that too."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving as-is for now — we prefer independent paths but acknowledging it's not a hard requirement.

| Phase 2 | Full CPO version matrix (every supported 4.y.z and 4.y.0) |
| Phase 3 | Platform-specific e2e (ARO HCP Azure ARM, platform QE co-authored tests) |

Subsequent phases add broader version coverage and platform-specific tests co-authored with platform QE teams.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing over @deads2k comment from Slack:

Subsequent phases add broader version coverage and platform-specific tests co-authored with platform QE teams.

"I'm not certain this release controller is actually coupled to specific platforms.  I see this release controller as encapsulating and automating hypershift's promise to platforms of phase 1 and phase 2 as you've laid them out.  Keeping it at that level, plus informing per-platform would leave accountability and responsibility for failing promotion extremely clear."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the phase table to add ownership. Phases 1 and 2 are marked as required for completion, owned by HCP team. Phase 3 is reframed as informing jobs owned by platform teams — the release controller's responsibility ends with demonstrating it's possible to create such a job.

| ----- | -------- |
| Phase 1 (MVP) | Cluster lifecycle, NodePool scaling, one upgrade path |
| Phase 2 | Full CPO version matrix (every supported 4.y.z and 4.y.0) |
| Phase 3 | Platform-specific e2e (ARO HCP Azure ARM, platform QE co-authored tests) |
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing over @deads2k comment from Slack:

David: Let's play out your phase3-platform specific jobs.  Who would watch and how would we decide about responsiblity
David: well maybe back up to phase 2. Do you agree with phase 1 and phase 2 only, it's very clear that HCP owns "we haven't promoted a release, we must fix"
David: and that when we introduce phase 3, that becomes muddier, "it hasn't passed phase 3, but it's ARO-HCP's fault" (similar to our frequent failures with ROSA release-blocking jobs?
Bryan: re:phase 1 & 2 - yeah that seems reasonable to me.
Bryan: phase 3 - Agree it's not as clear. I think it would be a joint or shared responsibility between the teams to figure out why the tests are failing and how to resolve that.
David: can we make that explicit for phase 1 and phase 2, indicate that they are critical for considering this complete. and add the concept of informing jobs that would include phase 3, with the responsibility lying with platform teams for creating and watching their signal. The release controller responsibility ends with demonstrating it is possible to create such a job.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the same update as above. Phase 1-2 ownership is explicit. Phase 3 notes that HCP team can help debug and fix failing tests in coordination with platform teams.

Comment thread enhancements/hypershift/konflux-release-gating-pipeline.md
@bryan-cox bryan-cox force-pushed the OCPSTRAT-3250 branch 2 times, most recently from e8dd6e1 to 68fbb5f Compare May 19, 2026 18:05
@bryan-cox
Copy link
Copy Markdown
Member Author

While we are implementing this effort for ARO HCP first, we are expecting to onboard ROSA HCP and GCP in the future. I wanted to make sure y'all were aware of this enhancement; please feel free to unsubscribe if you wish - @deads2k @joshbranham @cblecker

@bryan-cox bryan-cox force-pushed the OCPSTRAT-3250 branch 2 times, most recently from 5f6b081 to a135a21 Compare May 20, 2026 14:39

## Summary

This enhancement introduces a nightly, platform-independent gating system that validates HyperShift Operator (HO) images against end-to-end test suites before promoting them to verified repositories. The pipeline operates alongside the existing Konflux auto-release mechanism, adding a parallel promotion path that only advances images which have passed real-world e2e validation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define "real-world e2e validation"? is this specific consumer owned e2e test suites / gates?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated — clarified that "e2e validation" means test suites agreed upon between the HyperShift and managed service (HCM) teams. Tests may vary by platform.


AI-assisted response via Claude Code

Comment thread enhancements/hypershift/konflux-release-gating-pipeline.md
3. Keep the existing auto-release to ACMD completely unchanged; the new pipeline is purely additive.
4. Enable independent promotion paths per platform so that one platform's failure does not block others.
5. Make the pipeline extensible to new platforms with only new Konflux resource definitions and no pipeline code changes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any goal for per platform speed / granularity to ship? Why was 24h chosen?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any goal / non goal for alerting and/or troubleshooting failed pipelines?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24h (nightly) was chosen to balance validation confidence with cloud infrastructure cost — each run provisions real clusters with cloud credentials. Per-commit gating is addressed in the Alternatives section: it's cost-prohibitive and would slow the development feedback loop. Per-platform cadence can differ if a platform team wants more frequent runs — each platform can have its own CronJob schedule (noted in the Platform Extensibility section).


AI-assisted response via Claude Code

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alerting and troubleshooting are covered in a few places: the Error Handling table (every failure type triggers a Slack alert), the Stale Promotion Alert section (alerts if no successful promotion in N days, default 3), and the Support Procedures section (detection commands + remediation steps including manual re-trigger). These are tracked in CNTRLPLANE-3451 (stale alerting) and CNTRLPLANE-3450 (manual re-trigger). Let me know if you'd like more detail or if something specific is missing.


AI-assisted response via Claude Code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we capture these responses in the proposal?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a "Design Rationale" subsection after Non-Goals covering the nightly cadence choice and alerting/troubleshooting coverage, as discussed in earlier thread comments.


AI-assisted response via Claude Code


## Proposal

Add a parallel, gated promotion path alongside the existing auto-release. A nightly pipeline tests the latest HO image against platform-specific e2e suites and only promotes tested images to a verified repository. Each platform's promotion is independent — a failure on one does not block others.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nightly pipeline tests the latest HO image

can we clarify what is this "latest HO image", e.g. who/how builds it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified — "latest HO image" is the most recent image produced by Konflux's push build pipeline, triggered on every merge to main. Updated the Proposal paragraph to make this explicit.


AI-assisted response via Claude Code

1. **Trigger:** A Kubernetes CronJob in the `crt-redhat-acm-tenant` namespace runs nightly.
2. **Resolve:** The CronJob queries Konflux Snapshots labeled with the push build's PipelineRun name, selects the most recent, and extracts the HO container image reference.
3. **Launch:** The CronJob creates a Tekton `PipelineRun` referencing the e2e test pipeline (`.tekton/pipelines/ho-release-gate.yaml`), passing the snapshot name and HO image as parameters.
4. **Test:** The pipeline launches Prow jobs that deploy the resolved HO image and run HyperShift e2e tests against it. Konflux orchestrates the run and consumes pass/fail results and links.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we articulate how this happens per platform?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated step 4 to articulate the per-platform mechanism: each platform has its own IntegrationTestScenario defining the test suite and target infrastructure (e.g., Azure for ARO HCP, AWS for ROSA HCP). Konflux orchestrates the run and consumes pass/fail results.


AI-assisted response via Claude Code


#### ReleasePlan (per-platform)

A per-platform resource. The YAML below shows the ARO HCP pilot instance. Future platforms (ROSA HCP, GCP HCP) will each get their own ReleasePlan. All platforms push to the same verified repository, tagged differently per managed service. Auto-release is disabled (`auto-release: 'false'`), meaning images only reach the verified repo through explicit Release objects created after tests pass.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who creates this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added explicit ownership — these resources are created by the HCP team in the `crt-redhat-acm-tenant` namespace.


AI-assisted response via Claude Code


#### IntegrationTestScenario (per-platform)

A per-platform resource. This wires the e2e test Tekton pipeline as a gate on Snapshots. It references a pipeline definition stored in the HyperShift repository, allowing the test pipeline to evolve alongside the code it validates.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who creates this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added explicit ownership — these resources are created by the HCP team in the `crt-redhat-acm-tenant` namespace.


AI-assisted response via Claude Code

Introduces a nightly, platform-independent gating system that validates
HyperShift Operator images against e2e test suites before promoting them
to verified repositories. The pipeline operates alongside the existing
Konflux auto-release, adding a parallel promotion path per managed service
platform (ARO HCP pilot, ROSA HCP and GCP HCP future).

OCPSTRAT-3250

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- name: application
description: HyperShift e2e tests for ARO HCP promotion gating
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about the interaction between the CronJob and this IntegrationTestScenario.

Looking at the existing ITS resources in crt-redhat-acm-tenant (e.g. hypershift-operator-main-enterprise-contract), they all use contexts: [{name: application}] and are triggered automatically by Konflux on every new Snapshot.

This ITS also uses contexts: [{name: application}] — wouldn't this cause Konflux to run the e2e test pipeline on every push build (i.e. every new Snapshot), rather than only on the nightly cadence the CronJob provides?

The CronJob already resolves the latest Snapshot and creates a PipelineRun directly via git resolver, bypassing the ITS entirely. So these two mechanisms seem to overlap.

Could you clarify how these are meant to interact? Specifically:

  • Is the ITS needed for Konflux to consider a Snapshot "valid" before allowing a Release to be created from it?
  • Or is the CronJob the sole trigger, and the ITS can be dropped?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right that these overlap. The documented Konflux periodic test pattern (https://konflux-ci.dev/docs/testing/integration/periodic-integration-tests/) uses contexts: [{name: disabled}] on the ITS so it doesn't trigger on every push, and the CronJob triggers it by labeling the latest snapshot with test.appstudio.openshift.io/run=<scenario-name>. Updated both the ITS (now uses disabled context) and CronJob (now labels snapshots instead of creating PipelineRuns directly) to follow this pattern. Also updated the RBAC, workflow diagrams, and step descriptions to match.


AI-assisted response via Claude Code


**Konflux build pipeline** is the existing push build pipeline that creates Snapshots for every merged commit.

**e2e test pipeline** is a Tekton Pipeline defined at `.tekton/pipelines/ho-release-gate.yaml` in the HyperShift repository.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this file named after the consumer? e.g. ho-aro-release-gate.yaml? will we have one per platform?
can we include the yaml example for ARO?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MVP uses a single pipeline file since ARO HCP is the only platform. The Konflux ITS spec.params field supports passing custom pipeline parameters, so a shared pipeline with per-platform params is viable if the task structure stays the same across platforms. If platforms need different task sequences or infrastructure setup, per-platform files (e.g., ho-aro-release-gate.yaml) would be the right call. Updated the doc to note both options with the decision deferred until a second platform is onboarded.


AI-assisted response via Claude Code

2. **Resolve:** The CronJob queries Konflux Snapshots labeled with the push build's PipelineRun name, selects the most recent, and extracts the HO container image reference.
3. **Launch:** The CronJob creates a Tekton `PipelineRun` referencing the e2e test pipeline (`.tekton/pipelines/ho-release-gate.yaml`), passing the snapshot name and HO image as parameters.
4. **Test:** The pipeline launches Prow jobs that deploy the resolved HO image and run e2e tests against it. Each platform defines its own `IntegrationTestScenario` that specifies the test suite and infrastructure — for example, ARO HCP tests run against Azure-provisioned clusters, while ROSA HCP tests would use AWS. Konflux orchestrates the run and consumes pass/fail results and links.
5. **Promote:** On pass, the pipeline's `finally` block creates a Konflux Release object referencing the tested Snapshot and a platform-specific ReleasePlan. Konflux's release pipeline pushes the image to the verified repository.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the "verified repository"? Is there one per consumer? should this be in glossary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "Verified Repository" to the Glossary — single shared quay.io repo with per-platform image tags (e.g., aro-hcp-<digest>, rosa-hcp-<digest>).


AI-assisted response via Claude Code


#### ReleasePlan (per-platform)

A per-platform resource created by the HCP team in the `crt-redhat-acm-tenant` namespace. The YAML below shows the ARO HCP pilot instance. Future platforms (ROSA HCP, GCP HCP) will each get their own ReleasePlan. All platforms push to the same verified repository, tagged differently per managed service. Auto-release is disabled (`auto-release: 'false'`), meaning images only reach the verified repo through explicit Release objects created after tests pass.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all these resources created manually? will this be gitoped somehow?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline definition lives in the HyperShift repo at .tekton/pipelines/, referenced by ITS via git resolver. Konflux namespace resources (ITS, ReleasePlan, CronJob, RBAC) are defined in contrib/konflux/ in the HyperShift repo and applied to the crt-redhat-acm-tenant namespace, following the same pattern used for existing Konflux config. Changes go through the standard PR review process.


AI-assisted response via Claude Code

Comment thread enhancements/hypershift/konflux-release-gating-pipeline.md
- name: revision
value: main
- name: pathInRepo
value: .tekton/pipelines/ho-release-gate.yaml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this yaml have a consumer specific name?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same question addressed in the pipeline naming thread — for the MVP with only ARO HCP, a single ho-release-gate.yaml is used. When additional platforms are onboarded, this may become per-consumer (e.g., ho-aro-release-gate.yaml) if test suites differ enough, or stay shared with platform-specific params via ITS spec.params. Decision deferred until a second platform is added.


AI-assisted response via Claude Code

A per-platform resource created by the HCP team in the `crt-redhat-acm-tenant` namespace. This wires the e2e test Tekton pipeline as a gate on Snapshots. It references a pipeline definition stored in the HyperShift repository, allowing the test pipeline to evolve alongside the code it validates.

```yaml
apiVersion: appstudio.redhat.com/v1beta2
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be nice a diagram showing how the cronjob, IntegrationTestScenario, ReleasePlan, ReleasePlanAdmission... CRs interact

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a CR interaction diagram in the Implementation Details section showing how CronJob, Snapshot, IntegrationTestScenario, PipelineRun, Release, ReleasePlan, and ReleasePlanAdmission relate to each other.


AI-assisted response via Claude Code

The nightly cadence means there is up to a 24-hour delay between a merge and its appearance in a verified repository. This is acceptable for production consumption but may require teams to continue using ACMD for rapid iteration.

## Alternatives (Not Implemented)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to include considerations to move HO into OLM? maybe beyond scope

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is beyond scope for this enhancement — the gating pipeline is delivery-mechanism-agnostic and would work regardless of whether HO is delivered via OLM or the current direct image push. If HO moves to OLM in the future, the promotion step would change (OLM bundle vs raw image push) but the test-then-promote pattern stays the same.


AI-assisted response via Claude Code


4. **Platform e2e test integration:** Bryan is working with the ARO HCP team to integrate their platform-specific e2e tests into the HyperShift repo, following the same pattern used for HyperShift's existing presubmit e2e tests.

5. **Regression analysis:** deads2k raised that this release, decoupled from OCP releases, needs its own regression analysis in component readiness — comparing current HO against a sliding baseline to track the trajectory of the project. This needs further discussion to determine what that mechanism looks like and how it integrates with existing component readiness tooling.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there ticket/anyone from ship team aware of this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet — we haven't coordinated with the SHIP team on this. The regression analysis mechanism for a release decoupled from OCP is still undefined. Adding a note here to track the need for SHIP team engagement.


AI-assisted response via Claude Code

@enxebre
Copy link
Copy Markdown
Member

enxebre commented May 28, 2026

dropped some more questions, lgtm

bryan-cox and others added 2 commits May 28, 2026 09:10
- Add Design Rationale section capturing nightly cadence and alerting
  rationale from PR discussion threads (enxebre)
- Clarify per-platform pipeline naming strategy with TBD for shared vs
  separate files when second platform onboards (enxebre)
- Add Verified Repository to glossary as single shared quay.io repo
  with per-platform image tags (enxebre)
- Document resource management: pipeline in .tekton/pipelines/,
  Konflux namespace resources in contrib/konflux/ (enxebre)
- Fix CronJob/ITS interaction to follow Konflux periodic test pattern:
  ITS uses disabled context, CronJob labels snapshots instead of
  creating PipelineRuns directly (Nirshal)
- Update RBAC, diagrams, and workflow steps to match new pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Shows how CronJob, Snapshot, IntegrationTestScenario, PipelineRun,
Release, ReleasePlan, and ReleasePlanAdmission interact during
the nightly gating flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

@bryan-cox: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@enxebre
Copy link
Copy Markdown
Member

enxebre commented May 28, 2026

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2026
Copy link
Copy Markdown
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm hoping we solve as part of this EP is the ability to have confidence in the supported matrix of CPO to guest and CPO to management version skews. There is only a very light mention of cross version testing here, was that something you were considering in/out of scope?


## Summary

This enhancement introduces a nightly, platform-independent gating system that validates HyperShift Operator (HO) images against end-to-end test suites before promoting them to verified repositories. The pipeline operates alongside the existing Konflux auto-release mechanism, adding a parallel promotion path that only advances images which have passed e2e test suites agreed upon between the HyperShift and managed service (HCM) teams. Tests may vary by platform.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promotion requires all tests pass across all platforms? Or are there separate promotion destinations such that we might see promotion succeed on ARO but not ROSA?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be different promotion paths for each managed service since each one needs a different set of tests to pass. If ARO HCP tests fail but GCP and ROSA tests pass, they should still get a tagged HO for their managed services respectively.


#### Design Rationale

**Nightly cadence (24h):** Each pipeline run provisions real cloud infrastructure with platform-specific credentials (e.g., Azure for ARO HCP). A nightly cadence balances validation confidence with cloud infrastructure cost. Per-commit gating is cost-prohibitive and would slow the development feedback loop (see Alternatives). Per-platform cadence can differ — each platform can have its own CronJob schedule.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenShift CI and nightly builds happen every 6h, have you considered making this more frequent than once per day? Is there enough change in a day to warrant more than once per day?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a day might actually be too much. Some managed services update the HO more than others but I do not think any of the them are in a place to do more than one update within a 24h period.


## Proposal

Add a parallel, gated promotion path alongside the existing auto-release. A nightly pipeline resolves the most recent HO image built by Konflux's push build pipeline (triggered on every merge to `main`) and tests it against platform-specific e2e suites. Only tested images are promoted to a verified repository. Each platform's promotion is independent — a failure on one does not block others.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any retest mechanisms here should it fail? Or is it then a case of wait until the next day?

Having this per platform makes the concept of a "green nightly" more elusive, is tracking the failures and escalation something you plan when there are consecutive failures?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wait until the next day but retest is something we plan to follow up on later. It was not seen as a must have for a MVP.

| Phase | Coverage | Ownership |
| ----- | -------- | --------- |
| Phase 1 (MVP) | Cluster lifecycle, NodePool scaling, one upgrade path | HCP team — required for completion |
| Phase 2 | Full CPO version matrix (every supported 4.y.z and 4.y.0) | HCP team — required for completion |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this actually mean? Is this "run CPO on lots of 4.Y management clusters" or "CPO can create lots of 4.Y workload clusters"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some CPO testing being done outside this effort but those tests will be included in the promotion process of the image. @clebs could point you to that effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants