Skip to content

Feature branch sync - pub/q2_upgrade to staging#4740

Merged
abhishek-sa1 merged 28 commits into
stagingfrom
pub/q2_upgrade
Jun 11, 2026
Merged

Feature branch sync - pub/q2_upgrade to staging#4740
abhishek-sa1 merged 28 commits into
stagingfrom
pub/q2_upgrade

Conversation

@abhishek-sa1

Copy link
Copy Markdown
Collaborator

Feature branch sync - pub/q2_upgrade to staging

abhishek-sa1 and others added 23 commits June 9, 2026 17:27
Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

Revert "task failure ansible.cfg update"

This reverts commit 7b2a70b.

callback plugin update

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

Update omnia_default.py

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

Update omnia_default.py

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

Update omnia_default.py

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

Update omnia_default.py

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

Update omnia_default.py

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>
…ubnet (OMN01D-2534)

In multi-subnet deployments, service K8s control plane nodes may
reside in an additional_subnet (e.g. 10.40.2.0/24) rather than the
primary admin subnet (e.g. 10.40.1.0/24). The VIP for K8s HA must
be in the same subnet as the control plane nodes, not the OIM admin
NIC subnet.

The fix:
1. In validate_service_k8s_cluster_ha(), extract control plane node
   IPs from PXE mapping (FUNCTIONAL_GROUP_NAME starts with
   service_kube_control_plane) and determine their subnet by
   checking the primary admin subnet and additional_subnets.
2. Pass the control plane subnet (kcp_subnet_ip, kcp_subnet_bits)
   to validate_vip_address().
3. In validate_vip_address(), validate the VIP against the control
   plane subnet if provided, otherwise fall back to the primary
   admin subnet for backward compatibility.

Fixes: OMN01D-2534
Signed-off-by: Sujit Jadhav <sujit.jadhav@dell.com>
Signed-off-by: Katakam-Rakesh <katakam.rakesh@dell.com>
Signed-off-by: Katakam-Rakesh <katakam.rakesh@dell.com>
Signed-off-by: Katakam-Rakesh <katakam.rakesh@dell.com>
Add a wait for kube controller pod to be created and then check for pod running
 feat: Add custom callback plugin to suppress duplicate error output in ansible-core 2.20
…3_secret_key

Signed-off-by: venu <236371043+Venu-p1@users.noreply.github.com>
Signed-off-by: Jagadeesh N V <jagadeesh_n_v@dell.com>
…led is true

Two issues prevent nid hostname resolution on slurm and login nodes:

1. OIM firewall blocks port 53 (DNS) for external access
   CoreDNS on the OIM binds to admin_nic_ip:53, but firewalld only
   opens ports for DHCP/TFTP/HTTP/etc. Nodes querying 10.x.x.x:53
   get their packets dropped. From the OIM itself, DNS works because
   podman interfaces are in the trusted zone (local traffic bypasses
   the firewall).

   Fix: Open port 53/tcp and 53/udp in the OIM firewall when
   dns_enabled is true.

2. NetworkManager overwrites /etc/resolv.conf after cloud-init
   set-ssh.sh runs nmcli con add/up which triggers NetworkManager
   to overwrite /etc/resolv.conf with DHCP-provided DNS servers,
   removing the CoreDNS nameserver entry.

   Fix: After set-ssh.sh completes, restore /etc/resolv.conf and
   lock it with chattr +i. Matches existing K8s template protection.

Files changed:
- prepare_oim/.../openchami/tasks/configs/firewall.yml (port 53)
- ci-group-slurm_control_node_x86_64.yaml.j2
- ci-group-slurm_node_x86_64.yaml.j2
- ci-group-slurm_node_aarch64.yaml.j2
- ci-group-login_node_x86_64.yaml.j2
- ci-group-login_node_aarch64.yaml.j2
- ci-group-login_compiler_node_x86_64.yaml.j2
- ci-group-login_compiler_node_aarch64.yaml.j2

Only active when dns_enabled is true (no impact on non-DNS deployments).

Signed-off-by: Sujit Jadhav <sujit.jadhav@dell.com>
Signed-off-by: sakshi-singla-1735 <sakshi.s@dell.com>
fix(provision): fix DNS resolution on slurm/login nodes when dns_enabled is true
…plate (OMN01D-2533) (#4729)

The cloud-init template has two YAML literal block scalar levels:
1. Outer content: | (base indent 6sp) - strips 6 spaces
2. Inner runcmd - | (base indent 4sp after outer) - strips 4 spaces
Total: 10 spaces stripped from template lines.

Previous heredoc fix used 12sp indent with spaces embedded in the
delimiter string ('            PYEOF'). After YAML stripping, the
terminator line became '  PYEOF' (2sp) but the shell expected
'            PYEOF' (12sp literal) — heredoc never terminated.

Fix: Place Python code and PYEOF terminator at 10sp in the template.
After both YAML levels strip their indentation, these lines land at
column 0 in the shell script. The simple delimiter 'PYEOF' matches
the column-0 terminator exactly. Python receives column-0 code with
correct relative indentation for with/if/else blocks.

All lines >= 10sp > 6sp, so the outer YAML content: | block stays
intact (lines at < 6sp would prematurely terminate it).

Signed-off-by: Sujit Jadhav <sujit.jadhav@dell.com>
…ction (OMN01D-2532) (#4724)

In multi-subnet deployments, service K8s control plane nodes may
reside in an additional_subnet (e.g. 10.40.2.0/24) while the OIM
admin NIC is in the primary subnet (e.g. 10.40.1.0/24). Calico's
IP_AUTODETECTION_METHOD was hardcoded to admin_nic_cidr (the OIM
subnet), causing Calico to fail IP auto-detection on nodes in
different subnets with:
  'Unable to auto-detect an IPv4 address using interface cidr
   [10.40.1.0/24]: no valid IPv4 addresses found'

The fix:
1. In create_k8s_config_nfs.yml, read the PXE mapping to find the
   first service_kube_control_plane node's ADMIN_IP and determine
   which subnet (primary or additional) it belongs to. Set
   calico_cidr to that subnet's CIDR.
2. Update the cloud-init template to use calico_cidr instead of
   admin_nic_cidr for Calico's IP_AUTODETECTION_METHOD.

The upgrade path is intentionally left unchanged (uses
admin_nic_cidr) since multi-subnet is a fresh deployment feature
and changing the upgrade flow could impact existing deployments.

Fixes: OMN01D-2532

Signed-off-by: Sujit Jadhav <sujit.jadhav@dell.com>
fix(validation): validate HA VIP against service_kube_control_plane subnet
Signed-off-by: Kratika_Patidar <Kratika.Patidar@dell.com>
Signed-off-by: Kratika_Patidar <Kratika.Patidar@dell.com>
defct fix for input valdition and pxe mapping check
Push software_config..json from artifacts during deploy
@abhishek-sa1 abhishek-sa1 marked this pull request as ready for review June 11, 2026 10:45
snarthan and others added 5 commits June 11, 2026 16:23
* Update container tag for vulnerability

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update requirements.txt

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* tag update

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Buildstream upgrade validation

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update upgrade.yml

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update main.yml

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update catalog_rhel.json

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update requirements.txt

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update provision_config.yml

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update provision_config.j2

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

---------

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>
…ralized Python L2 validation (#4735)

* add cloud init

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update container tag for vulnerability

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* additional cloud init group

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update validate_additional_cloud_init.yml

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* logic update

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update requirements.txt

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update provision config

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update validate_additional_cloud_init_section.yml

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* cloud init update

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* tag update

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* minimal os group update

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Buildstream upgrade validation

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update upgrade.yml

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* moving packages as prohibited

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update main.yml

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update catalog_rhel.json

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update requirements.txt

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update provision_config.yml

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

* Update provision_config.j2

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>

---------

Signed-off-by: Abhishek S A <abhishek.sa3@dell.com>
Set PXE boot replace lc check moduel with POST call
…s during rollback (#4738)

* fix rollback

Signed-off-by: Katakam-Rakesh <katakam.rakesh@dell.com>

* fix(rollback): remove 'skipped' from build_stream_terminal condition

Signed-off-by: Katakam-Rakesh <katakam.rakesh@dell.com>

* Rollback conditions for slurm and k8s

Signed-off-by: Jagadeesh N V <jagadeesh_n_v@dell.com>

* Update rollback.yml

Signed-off-by: Katakam Rakesh Naga Sai <125246792+Katakam-Rakesh@users.noreply.github.com>

* Lint fixes

Signed-off-by: Jagadeesh N V <jagadeesh_n_v@dell.com>

---------

Signed-off-by: Katakam-Rakesh <katakam.rakesh@dell.com>
Signed-off-by: Jagadeesh N V <jagadeesh_n_v@dell.com>
Signed-off-by: Katakam Rakesh Naga Sai <125246792+Katakam-Rakesh@users.noreply.github.com>
Co-authored-by: Jagadeesh N V <jagadeesh_n_v@dell.com>
@abhishek-sa1 abhishek-sa1 merged commit 1d84266 into staging Jun 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants