Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
393 changes: 393 additions & 0 deletions stps/sig-virt/dual-stream-cluster-rhcos9-rhcos10/stp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,393 @@
# Openshift-virtualization-tests Test plan
Comment thread
akri3i marked this conversation as resolved.

## **Dual-Stream RHCOS Support (RHCOS9.8 + RHCOS10.2 Worker Nodes) — Quality Engineering Plan**

### **Metadata & Tracking**

- **Enhancement(s):** https://redhat.atlassian.net/browse/VIRTSTRAT-83
- **Feature Tracking:** N/A
Comment thread
akri3i marked this conversation as resolved.
- **Epic Tracking:** https://redhat.atlassian.net/browse/CNV-49964
- **QE Owner(s):** Akriti Gupta
- **Readiness Tracking:** [RHCOS 10 Readiness Tracking](https://docs.google.com/spreadsheets/d/1mD1mVkiIqrdyaDIqB0fhRgmywWp4sI9OEy0IVBD6pDc/edit?gid=0#gid=0)
- **Owning SIG:** sig-virt
- **Participating SIGs:** sig-virt, sig-network, sig-storage, sig-iuo, sig-infra
Comment thread
akri3i marked this conversation as resolved.

- **Target Release(s):**
- DP: N/A
- TP: v4.22 — RHCOS 10.2 worker node support is Tech Preview; test failures are not release blockers
- GA: v5.0 — RHCOS 10.2 support GA; unresolved Critical/High bugs block readiness

> **This is a Parent STP.** It defines the overall dual-stream RHCOS testing strategy and cross-cutting
> requirements. Each participating SIG is expected to create a child STP that extends this document with
> SIG-specific test scenarios. Child STPs should reference this parent STP and must not duplicate content
> defined here.
Comment thread
akri3i marked this conversation as resolved.

**Document Conventions:**

- **RHCOS9.8:** Red Hat CoreOS 9.8 worker nodes.
- **RHCOS10.2:** Red Hat CoreOS 10.2 worker nodes.
- **Dual-stream cluster:** An OCP cluster running both RHCOS9.8 and RHCOS10.2 worker nodes simultaneously.

### **Feature Overview**
Comment thread
akri3i marked this conversation as resolved.

Starting with OCP 4.22, OpenShift Virtualization supports dual-stream clusters running both RHCOS 9.8 and RHCOS 10.2 worker nodes.
This enables customers to gradually migrate their worker nodes to RHCOS 10 while maintaining VM workload availability through live migration across node types.
RHCOS 10.2 worker node support is Tech Preview in OCP 4.22 and is targeted for GA in OCP 5.0.

---

### **I. Motivation and Requirements Review (QE Review Guidelines)**

#### **1. Requirement & User Story Review Checklist**

- [x] **Review Requirements**
- *Key requirements reviewed:*
- CNV 4.22: RHCOS9.8 is GA/default; RHCOS10.2 is Tech Preview.
- CNV 4.22.0 release checklist testing must happen with RHCOS9.8.
- OCPSTRAT-1150: Support dual-stream clusters (RHCOS9.8 + RHCOS10.2 worker nodes).
- VM live migration must succeed across RHCOS9.8 ↔ RHCOS10.2 worker nodes.
- Successful LiveMigration of VM from RHCOS9.8 to RHCOS10.2 back to RHCOS9.8
(i.e., VM created first on RHCOS9.8 worker node)
- Successful LiveMigration of VM from RHCOS10.2 to RHCOS9.8 back to RHCOS10.2
(i.e., VM created first on RHCOS10.2 worker node)

- [x] **Understand Value and Customer Use Cases**
- *Feature value to customers:* Customers can move their existing RHCOS 9 worker nodes to RHCOS 10
incrementally as part of the OCP 5.x minor upgrades, ensuring workload compatibility during the transition.
- *Customer use cases:*
- As a cluster administrator, I want to add RHCOS10.2 worker nodes to my existing
cluster.
- As a VM operator, I want to live-migrate VMs from RHCOS9.8 worker nodes to RHCOS10.2 worker
nodes.
- As a platform team, I want to validate that CNV behaves correctly on both RHCOS9.8 and RHCOS10.2
nodes.

- [x] **Acceptance Criteria**
Comment thread
akri3i marked this conversation as resolved.
- *Acceptance criteria:*
- Existing Tier 1, Tier 2, and Tier 3 test suites pass on an RHCOS10.2-only cluster with no new
regressions compared to an RHCOS9.8 baseline.
- Existing Tier 1, Tier 2, and Tier 3 test suites pass on a dual-stream cluster
(RHCOS9.8 + RHCOS10.2 workers) with no new regressions compared to an RHCOS9.8 baseline.
- A VM running on an RHCOS9.8 worker node can be live-migrated to an RHCOS10.2 worker node and
back without disruption.
- A VM running on an RHCOS10.2 worker node can be live-migrated to an RHCOS9.8 worker node and
back without disruption.
- CNV components deploy and report ready status on RHCOS 10.2 worker nodes with the same behavior as on RHCOS 9.8 nodes

- [x] **Non-Functional Requirements (NFRs)**
- *Applicable NFRs:*
- **Performance:** N/A — no new workloads introduced; existing Tier 1/2/3 results serve as the baseline.
- **Security:** Required — all testing must be performed with FIPS enabled (see Test Environment, Section II.3).
- **Monitoring/Observability:** N/A — no new alerts or metrics introduced; existing CNV monitoring applies unchanged.
Comment thread
akri3i marked this conversation as resolved.
- **Scalability:** N/A — no new scale requirements; existing cluster-level live migration parallelism limits apply.
- **UI:** N/A — no UI code changes introduced; UI testing adds no customer value for this feature.
- **Documentation:** Release notes must document RHCOS10.2 Tech Preview status in 4.22 and GA status timeline.
- **Compatibility:** N/A — validated via Tier 1/2/3 runs on both RHCOS9.8 and RHCOS10.2.


#### **2. Technology and Design Review**

- [x] **Developer Handoff/QE Kickoff**
- *Key takeaways and concerns:*
- The dual-stream support approach was reviewed and signed off by PM, Engineering, Platform,
and Product Operations per the [recommendation document](https://docs.google.com/document/d/1MMNmUbhGPymnJDrbqbcq_KGi_FN9jxTvAdnzed1cwWw/edit?tab=t.0#heading=h.54747tl9m7p8).
- CNV component teams must update their STPs for dual-stream scenarios.
- Bugs found during 4.22 RHCOS10.2 Tech Preview testing must be resolved before 5.0 GA.

- [x] **Technology Challenges**
- *Identified challenges:*
- **CNV compatibility on RHCOS 10.2:** RHCOS 10.2 is a new platform configuration that may
surface unexpected failures. Tier 1, Tier 2, and Tier 3 testing is the primary mechanism for
finding these issues.
- **Dual-stream cluster provisioning:** Clusters with mixed RHCOS9.8 and RHCOS10.2 worker
nodes require specific provisioning tooling. DevOps QE to provide this capability.
- **Bug triage complexity:** Failures on RHCOS10.2 must be clearly attributed to RHCOS 10.2 vs. pre-existing issues.

- *Impact on testing approach:*
- Tier1, Tier 2, Tier 3 and ad-hoc testing on an RHCOS10.2-only cluster and on a dual-stream cluster are both mandatory for
all component teams — this is the primary testing vehicle.
- Live migration across RHCOS versions must be tested explicitly.

- [x] **API Extensions**
- *New or modified user-facing APIs:* No Upstream or Downstream changes in CNV.
- *Testing impact:* No new API tests required.

Comment thread
akri3i marked this conversation as resolved.
- [x] **Test Environment Needs**
- *See Section II.3 for environment requirements.*

- [x] **Topology Considerations**
- *Topology requirements:*
- FOR RHCOS 10.2:
- Either high-availability (HA) or a compact cluster
- Requires both control plane and worker nodes to be on RHCOS 10.2
- For OCPSTRAT-1150:
- Dual-stream testing requires only a high-availability (HA)
cluster — at least 1 worker running RHCOS10.2 alongside RHCOS9.8 workers
within the same OCP cluster.

---

#### **3. Known Limitations**

- **CNV on RHCOS 10.2 worker nodes uses the same software stack as on RHCOS 9.8
through OCP 5.2.** A fully RHCOS 10-native CNV stack is planned for OCP 5.3
and is out of scope for this STP.
- *Sign-off:* [Name/Date]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

CRITICAL: Unresolved [Name/Date] placeholders block approval

These placeholder sign-offs leave required evidence incomplete. Replace each with concrete name/date or remove the section item if not applicable.

As per coding guidelines, “Every claim in an STP must have evidence — sign-offs, Jira links, dates. No empty placeholders in approved STPs”.

Also applies to: 185-185, 194-194, 355-355, 371-371

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@stps/sig-virt/dual-stream-cluster-rhcos9-rhcos10/virt.md` at line 148,
Replace the unresolved placeholder entries that read "*Sign-off:* [Name/Date]"
with concrete evidence: either a real signer name and date (e.g., "Sign-off:
Alice Smith / 2026-05-04") or remove the bullet if no sign-off is applicable;
ensure you update every occurrence of the exact placeholder string "*Sign-off:*
[Name/Date]" found in this document so no empty placeholders remain (also
address the other occurrences flagged with the same placeholder).


---

### **II. Software Test Plan (STP)**

#### **1. Scope of Testing**

**Testing Goals**

- **[P0]** Run Tier1, Tier 2 and Tier 3 test suites against an RHCOS10.2-only worker-node cluster.
Required for sig-virt and all other participating SIGs (via child STPs).
- **[P1]** Run Tier 1, Tier 2 and Tier 3 test suites against a dual-stream cluster (RHCOS9.8 + RHCOS10.2
workers); triage failures to distinguish cross-kernel issues from pre-existing bugs.
Required for sig-virt and all other participating SIGs (via child STPs).
- **[P1]** For OCPSTRAT-1150: VM live migration must succeed across RHCOS9.8 ↔ RHCOS10.2 worker nodes.
- Successful LiveMigration of VM from RHCOS9.8 to RHCOS10.2 back to RHCOS9.8
(i.e., VM created first on RHCOS9.8 worker node)
- Successful LiveMigration of VM from RHCOS10.2 to RHCOS9.8 back to RHCOS10.2
(i.e., VM created first on RHCOS10.2 worker node)
- **[P1]** File and track all RHCOS10.2 failures found during 4.22 testing. Per-SIG readiness status is
tracked in the [RHCOS 10 Readiness Tracking](https://docs.google.com/spreadsheets/d/1mD1mVkiIqrdyaDIqB0fhRgmywWp4sI9OEy0IVBD6pDc/edit?gid=0#gid=0) spreadsheet.

**Release Readiness Note:**

RHCOS10.2 test failures are **not a blocker for OCP 4.22** (as Tech Preview). They **will block
readiness for CNV 5.0** — for 4.22, all e2e and functional tests must run on RHCOS9.8 by
default for release checklist.

**Out of Scope (Testing Scope Exclusions)**

- **Fully RHCOS 10-native CNV stack**
- *Rationale:* A fully RHCOS 10-native CNV stack is planned for OCP 5.3 and is a separate
testing effort. This STP covers only dual-stream support through OCP 5.2.
- *PM/Lead Agreement:* [Name/Date]


Comment thread
akri3i marked this conversation as resolved.
**Test Limitations**

- **Dual-stream cluster provisioning depends on QE DevOps tooling.** The ability to deploy
a cluster with mixed RHCOS9.8 and RHCOS10.2 worker nodes relies on tooling provided by the
QE DevOps team. If this tooling is unavailable or unstable, the dual-stream migration
scenarios cannot be executed.
- *Sign-off:* [Name/Date]


#### **2. Test Strategy**

**Functional**

- [x] **Functional Testing**
- Validates that the full CNV feature set operates correctly on RHCOS10.2-only clusters
- *Details:* The primary mechanism is running existing Tier1, Tier 2 and Tier 3 test suites and triaging failures.
- All other SIGs (sig-network, sig-storage, sig-iuo, sig-infra) must document
their Tier1, Tier 2 and Tier 3 Test results and bugs in their own Jira Stories.
- For OCPSTRAT-1150: Validates that the full CNV feature set operates correctly on dual-stream clusters, with el9.8 userspace.
- VM live migration must succeed across RHCOS9.8 ↔ RHCOS10.2 worker nodes.
- Successful LiveMigration of VM from RHCOS9.8 to RHCOS10.2 worker node and backto RHCOS9.8
i.e VM created first on RHCOS9.8 Worker Node
- Successful LiveMigration of VM from RHCOS10.2 to RHCOS9.8 worker node and backto RHCOS10.2
i.e VM created first on RHCOS10.2 Worker Node

- [ ] **Automation Testing**
- For RHCOS 10.2: No new automation tests needed — existing Tier 1, Tier 2 and Tier 3 suites
are run as-is.
- For OCPSTRAT-1150 (dual-stream live migration): Manual ad-hoc runs for 4.22 Tech Preview;
node-affinity-based automation in Tier 2 by 5.0 GA.

- [x] **Regression Testing** — sig-virt and all participating SIGs must run Tier1, Tier 2 and Tier 3
regression
- On RHCOS10.2-only cluster.
- *Details:* The strategy relies on running existing Tier 1, Tier 2 and Tier 3 test suites against RHCOS 10.2
Comment thread
akri3i marked this conversation as resolved.
- For OCPSTRAT-1150: See Automation Testing above for the decision and rationale.

**Non-Functional**

- [x] **Performance Testing** — N/A

- [x] **Scale Testing** — N/A

- [x] **Security Testing** — N/A

- [x] **Usability Testing** — N/A

- [x] **Monitoring** — N/A

**Integration & Compatibility**

- [x] **Compatibility Testing** — CNV must remain compatible with both RHCOS9.8 and RHCOS10.2
across the supported version range (4.22 through 5.2).
- *Details:* Compatibility is validated through Tier 1, Tier 2 and Tier 3 runs on both RHCOS9.8 and RHCOS10.2
clusters. Component teams validate their specific feature areas in child STPs.

- [x] **Upgrade Testing** — Out of scope for this 4.22 STP.
- *Details:* Upgrade testing (4.22 → 5.0, EUS-to-EUS, etc.) will be covered in the 5.0 .
Not planned as part of 4.22 testing.

- [x] **Dependencies** — Testing is blocked on specific deliverables from other teams.
- *Details:* QE DevOps team must provide a stable dual-stream cluster provisioning option
(RHCOS9.8 + RHCOS10.2 workers in the same cluster).

- [x] **Cross Integrations** — Other SIGs must create a child STP extending this one to cover adhoc testing for OCPSTRAT-1150.
- *Details:* sig-network, sig-storage, sig-iuo, and sig-infra must each create child STPs
specifying their Tier 1, Tier 2 and Tier 3 test coverage on RHCOS10.2 nodes and dual-stream
clusters. Each SIG is responsible for triaging failures in their area and filing bugs
with clear RHCOS-version attribution.

**Infrastructure**

- [x] **Cloud Testing**
— For 4.22 Tech Preview we must test with FIPS enabled.
- Use cloud setup if it supports FIPS enabled
Comment thread
akri3i marked this conversation as resolved.

#### **3. Test Environment**

- **FIPS:** enabled

- **Cluster Topology:**
- **Dual-stream testing:** High-availability (HA) bare-metal cluster required —
3-control-plane / 3-worker minimum, with at least 1 worker running RHCOS10.2 alongside
RHCOS9.8 workers. SNO or compact clusters are not supported for dual-stream testing.
- **RHCOS10.2-only testing:** Standard 3-control-plane / 3-worker bare-metal cluster with all
workers on RHCOS10.2.

- **OCP & OpenShift Virtualization Version(s):** OCP 4.22 with CNV 4.22; RHCOS10.2 as Tech Preview.

- **CPU Virtualization:** Standard (VT-x / AMD-V enabled)

- **Compute Resources:** Standard

- **Special Hardware:** N/A

- **Storage:** ocs-storagecluster-ceph-rbd-virtualization

- **Network:** Standard

- **Required Operators:** Standard.

- **Platform:** Bare metal (dual-stream cluster provisioned by DevOps QE tooling).

- **Special Configurations:** Mixed RHCOS9.8 + RHCOS10.2 worker node cluster (dual-stream). DevOps QE
provides tooling to deploy this configuration.

#### **3.1. Testing Tools & Frameworks**

- **Test Framework:**
- Standard (openshift-virtualization-tests) for RHCOS 10.2
- For OCPSTRAT-1150 (dual-stream): See Automation Testing in Section II.2 for the decision and rationale.

- **CI/CD:** Two cluster configurations are required, both available from CNV 4.22:
- **RHCOS10.2-only cluster:** Existing Tier 1, Tier 2 and Tier 3 jobs run by all component teams.
- **Dual-stream cluster (RHCOS9.8 + RHCOS10.2):** Manual ad-hoc live migration runs by component teams.

- **Other Tools:** N/A

#### **4. Entry Criteria**

The following conditions must be met before testing can begin:

- [x] Requirements and design documents are **approved and merged**
(recommendation doc sign-offs from PM, Engineering, Platform, Product Ops confirmed)
- [x] Test environment can be **set up and configured** (dual-stream cluster available via
QE DevOps tooling)
- [x] **CNV 4.22 builds based on RHCOS 9.8 are available and validated**
- [x] QE DevOps dual-stream cluster provisioning is validated and available

#### **5. Risks**

**Timeline/Schedule**
Comment thread
akri3i marked this conversation as resolved.

- **Risk:** None identified.
- **Mitigation:** N/A
- *Estimated impact on schedule:* N/A

**Test Coverage**

- **Risk:** None identified.
- **Mitigation:** All intentional coverage exclusions are documented in Out of Scope (Section II.1).
Each participating SIG covers their own feature area via child STPs.
- *Areas with reduced coverage:* N/A

**Test Environment**

- **Risk:**
- HA resource shortage.
- RDU2 to RDU3 migration, bare metal cluster outages.
- Deploying RHCOS 10.2 and Dual-Stream cluster on cloud fails (PSI, IBM-BM or other clouds with FIPS fails).
- QE DevOps dual-stream cluster provisioning tooling may be unavailable or unstable, blocking
Tier 1, Tier 2 and Tier 3 runs.
- **Mitigation:**
- Use bare metal cluster with FIPS enabled.
- Engage QE DevOps team early to confirm dual-stream cluster availability timeline. Identify a
fallback of manually provisioning a mixed-node cluster if tooling is delayed. Track
provisioning readiness as an entry criterion.
- *Missing or unavailable environments:* Dual-stream cluster if QE DevOps tooling is not ready.
- *Sign-off:* [Name/Date]

**Untestable Aspects**

- **Risk:** None identified.
- **Mitigation:** N/A
- *Reason untestable and mitigation approach:* N/A

**Resource Constraints**

- **Risk:** Manual ad-hoc testing for dual-stream live migration scenarios (OCPSTRAT-1150) requires
component team bandwidth through OCP 5.2, until Tier 2 automation (Option 3) is completed.
- **Mitigation:** Each participating SIG allocates time for ad-hoc runs in their test cycle.
Tier 2 automation will be added progressively. See Automation Testing in Section II.2 for the
full options analysis and decision rationale.
- *Missing resources or infrastructure:* N/A
- *Sign-off:* [Name/Date]

**Dependencies**

- **Risk:** None identified.
- **Mitigation:** N/A
- *Third-party services or blockers:* N/A

**Other**

- **Risk:** None identified.
- **Mitigation:** N/A

Comment thread
coderabbitai[bot] marked this conversation as resolved.
---

### **III. Test Scenarios & Traceability**


- **[CNV-81251](https://redhat.atlassian.net/browse/CNV-81251)** — As a VM operator, I want to live-migrate VMs between RHCOS9.8 and
RHCOS10.2 worker nodes within the same cluster so my workloads remain available during
node maintenance.
- *Test Scenario:* [Tier 2] **Scenario 1** — VM created on an RHCOS9.8 worker node is live-migrated
to an RHCOS10.2 worker node without disruption; then migrate back to RHCOS9.8 without disruption.
- *Priority:* P1

- **[CNV-81251](https://redhat.atlassian.net/browse/CNV-81251)** — As a VM operator, I want to live-migrate VMs between RHCOS9.8 and
RHCOS10.2 worker nodes within the same cluster so my workloads remain available during
node maintenance.
- *Test Scenario:* [Tier 2] **Scenario 2** — VM created on an RHCOS10.2 worker node is live-migrated
to an RHCOS9.8 worker node without disruption; then migrate back to RHCOS10.2 without disruption.
- *Priority:* P1
Comment on lines +364 to +376
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for high level STP it can be merged into 1 scenario.


---

### **IV. Sign-off and Approval**

This Software Test Plan requires approval from the following stakeholders:

- **Reviewers:**
Comment thread
akri3i marked this conversation as resolved.
- QE Members (sig-iuo): @hmeir @OhadRevah @rlobillo @albarker-rh
- QE Members (sig-network): @yossisegev @Anatw @EdDev @servolkov @azhivovk
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep here @azhivovk as the network T2 owner of this feature.
I'm not sure if we need SIG approvers in this list as well, that is up to @rnetser to decide.

- QE Members (sig-storage): @duyanyan @jpeimer @josemacassan @kgoldbla @dalia-frank @Ahmad-Hafe @kshvaika @ema-aka-young @acinko-rh
- QE Members (sig-virt): @dshchedr @vsibirsk @SamAlber
- QE Members (sig-infra): @geetikakay @RoniKishner
Comment on lines +385 to +389
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rnetser should we include every team member here?

- **Approvers:**
- QE Architect: [Ruth Netser](@rnetser)
- Principal Developer (sig-virt): Luboslav Pivarc @xpivarc
- Product Manager: [Martin Tessun] @mtessun
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Loading