Skip to content

[ansible/vm_set]: Refactor host setup and Python environment management#24512

Open
wangxin wants to merge 6 commits into
sonic-net:masterfrom
wangxin:xiwang5/vm-set-refactor-host-setup
Open

[ansible/vm_set]: Refactor host setup and Python environment management#24512
wangxin wants to merge 6 commits into
sonic-net:masterfrom
wangxin:xiwang5/vm-set-refactor-host-setup

Conversation

@wangxin
Copy link
Copy Markdown
Collaborator

@wangxin wangxin commented May 11, 2026

Description of PR

Summary:
Refactor the ansible/roles/vm_set role to address two problems:

  1. Heavy host-setup tasks (apt installs, Docker setup, sysctl tuning) run redundantly every time any of ~10 playbooks invoke the vm_set role.
  2. Python packages are installed globally via python3-pip, which on Ubuntu 24.04 pulls in python3-wheel with a known S360 vulnerability.

This PR extracts host setup into a flag-gated include, replaces pip with uv + a persistent venv, drops support for Ubuntu < 22.04, and fixes a libvirt-qemu home-traversal issue on hosts with the modern DIR_MODE=0750 adduser default via a targeted POSIX ACL.

Type of change

  • Testbed and Framework(new/improvement)

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

  1. Redundant host setup: The vm_set role is invoked by ~10 playbooks (add-topo, remove-topo, restart-ptf, start-VMs, stop-VMs, etc.). Each invocation re-runs all host-setup tasks (apt package installation, Docker setup, sysctl tuning, kernel module loading), even though these only need to run once per host. This wastes significant time on repeated testbed operations.

  2. Global pip installs and S360 vulnerability: Python packages (requests, docker, flask, grpcio, etc.) are installed globally via python3-pip. On Ubuntu 24.04, installing python3-pip pulls in python3-wheel as a dependency, which has a known S360 security vulnerability. Additionally, PEP 668 on Ubuntu 24.04 marks the system Python as externally managed, making global pip installs fail without --break-system-packages.

  3. Dead code: Support for Ubuntu 16.04, 17.04, 18.04, and 20.04 is no longer needed but adds complexity with version-conditional blocks and Python 2 fallbacks.

  4. libvirt-qemu home traversal: On freshly provisioned Ubuntu hosts the default DIR_MODE shipped in /etc/adduser.conf is 0750, which blocks the unprivileged libvirt-qemu user from traversing /home/<user> to open VM disk images stored under ~/sonic-vm/disks/. The result is a Permission denied (as uid:64055, gid:108) error when starting the SONiC VM.

How did you do it?

New files:

  • defaults/main.yml — Declares sonic_testbed_venv: "/opt/sonic-testbed/venv" so the venv path is defined in exactly one place.
  • host_setup.yml — Extracted all heavy one-time setup tasks (apt packages, Docker, sysctl, br_netfilter). Gated by a version-stamped flag file at /opt/sonic-testbed/.host_setup_done. Subsequent invocations skip setup unless the version string is bumped or FORCE_HOST_SETUP=true is set. Adds acl to the apt package list.
  • ensure_python_env.yml — Idempotent tasks to install uv and create a persistent venv at {{ sonic_testbed_venv }} with --system-site-packages (to inherit apt-provided python3-libvirt and python3-lxml). Installs requests==2.33.1 (CVE fix in v2.33.0) and docker==7.1.0. Pre-checks installed versions and skips uv pip install when the venv already matches.

Modified files:

  • main.yml — Slimmed down from ~477 to ~300 lines. Added fail-fast for non-Ubuntu or Ubuntu < 22.04. Fixed package_installation env var lookup to not override playbook-set values. Moved user group membership outside the flag gate (per-user, not per-host). Added an acl task granting libvirt-qemu the execute (traverse) bit on home_path; gated on package_installation since both the acl package and the libvirt-qemu user are installed via host_setup.yml.
  • docker.yml — Removed Ubuntu 16.04–20.04 docker repo blocks, all pip/virtualenv Python package installs, and the trailing ansible_python_interpreter override.
  • control_mux_simulator.yml / control_nic_simulator.yml — Replaced pip installs with uv pip install into the venv; pre-check installed versions to skip re-install when already present.
  • templates/mux-simulator.service.j2 / nic-simulator.service.j2 — Use {{ sonic_testbed_venv }}/bin/python3 directly; the redundant python_command set_fact is removed.

Key design decisions:

  • --system-site-packages is mandatory because python3-libvirt and python3-lxml are C extensions installed via apt that must match the system's shared libraries.
  • ensure_python_env.yml and the libvirt-qemu ACL task are both gated by package_installation to respect playbooks (e.g. testbed_renumber_vm_topology.yml) that intentionally skip first-time provisioning.
  • ansible_python_interpreter is only set to venv python if the venv actually exists, preventing failures when package_installation=false.
  • libvirt-qemu disk access: rather than rewriting /etc/libvirt/qemu.conf to run QEMU as root (which goes against the distro security model and broadens the attack surface), we grant the unprivileged libvirt-qemu user the minimum permission it needs — a POSIX ACL granting --x (traverse only, not read) on home_path. This keeps QEMU running as libvirt-qemu, applies uniformly to Ubuntu 22.04 and 24.04, survives subsequent chmod operations on the home directory, and does not require restarting libvirtd. See the discussion thread on PR for context.

How did you verify/test it?

Tested on both Ubuntu 22.04 and Ubuntu 24.04 with a VS testbed:

Test Result on 22.04 Result on 24.04
testbed-cli.sh add-topo vms-kvm-t0 (fresh host) ok=302 changed=49 failed=0, VM vlab-01 running ok=283 changed=35 failed=0, VM vlab-01 running
QEMU process user libvirt-qemu (unprivileged) libvirt-qemu (unprivileged)
/etc/libvirt/qemu.conf user/group lines unchanged (using libvirt defaults) unchanged (using libvirt defaults)
ACL on home_path user:libvirt-qemu:--x applied user:libvirt-qemu:--x applied
Repeat add-topo (idempotence) ✅ pip install tasks skipped (pkg_versions.stdout_lines != ['2.33.1', '7.1.0'] false-condition), ACL re-applied after manual strip ✅ same
restart-ptf (package_installation=false style path) ✅ host_setup, python env, and ACL tasks correctly skipped ✅ same

Any platform specific information?

  • Minimum Ubuntu version: 22.04
  • Tested end-to-end on both Ubuntu 22.04 and Ubuntu 24.04.
  • Drops support for Ubuntu 16.04, 17.04, 18.04, 20.04 and Python 2.

Caveat for hosts that previously ran an earlier iteration of this branch

An earlier iteration of this PR (now reverted) unconditionally rewrote /etc/libvirt/qemu.conf to run QEMU as root on Ubuntu 24.04. The final approach in this PR does not touch qemu.conf and instead uses an ACL. Any host that pulled an earlier xiwang5/vm-set-refactor-host-setup HEAD will still have user = "root" / group = "root" in qemu.conf. To revert manually after this PR lands:

sudo sed -i -e 's|^user = "root"|#user = "libvirt-qemu"|' \
            -e 's|^group = "root"|#group = "libvirt-qemu"|' /etc/libvirt/qemu.conf
sudo systemctl restart libvirtd

Supported testbed topology if it's a new test case?

N/A — this is a testbed infrastructure improvement, not a new test case.

Documentation

N/A — internal ansible role change, no user-facing documentation impact.

- Extract heavy host-setup tasks (apt, docker, sysctl) into host_setup.yml,
  gated by a version-stamped flag file at /opt/sonic-testbed/.host_setup_done.
  Subsequent vm_set invocations skip setup unless the version changes or
  FORCE_HOST_SETUP=true is set.

- Replace python3-pip/pip with uv + persistent venv at /opt/sonic-testbed/venv.
  Install uv 0.7.2 via standalone installer. Venv uses --system-site-packages
  to inherit apt-provided python3-libvirt and python3-lxml.

- Drop Ubuntu < 22.04 support with explicit fail-early error messages.
  Remove all 16.04/17.04/18.04/20.04 code paths (python2, old docker repos,
  pip_executable logic, old virtualenv workarounds).

- Simplify docker.yml, control_mux_simulator.yml, control_nic_simulator.yml
  to use uv pip and venv python instead of system pip.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xin Wang <xiwang5@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread ansible/roles/vm_set/tasks/ensure_python_env.yml Outdated
Comment thread ansible/roles/vm_set/tasks/control_mux_simulator.yml Outdated
@StormLiangMS
Copy link
Copy Markdown
Collaborator

@wangxin — nice refactor overall, the flag-gated host setup + uv venv is the right shape and the motivation is clear. One thing I'd like to see addressed before merging:

🔴 Blocker — qemu.conf user/group change is no longer gated on Ubuntu version

In the old main.yml, the libvirt qemu.conf rewrite was:

- name: Update libvirt qemu configuration
  block:
    - name: Set user to root in qemu.conf
      ...
    - name: Set group to root in qemu.conf
      ...
    - name: Restart libvirtd to apply qemu.conf changes
      ...
  when: host_distribution_version.stdout >= "24.04"

In the new host_setup.yml the same block is now unconditional. The PR description acknowledges this is intentional ("set to root for all supported Ubuntu versions to avoid permission denied"), but the rollout impact on existing Ubuntu 22.04 testbeds isn't called out:

  • Existing 22.04 hosts that have been running fine with the default libvirt-qemu:libvirt-qemu user/group will silently have /etc/libvirt/qemu.conf rewritten and libvirtd restarted on the next play run.
  • VM disk images, sockets, log files, and any other artifacts under /var/lib/libvirt/ that were created by the old libvirt-qemu user can become inaccessible to the new root-owned qemu processes (or vice versa for files written after the switch), depending on the directory perms.
  • This will hit every existing 22.04 testbed the first time someone runs add-topo / restart-ptf / etc. after this PR lands — there's no rollback path inside the role.

Suggested options

Either of these would unblock from my side:

  1. Keep the original gate and document that 22.04 users must opt in via a separate flag if they hit the disk-image permission error you encountered:
    when: host_distribution_version.stdout >= "24.04" or force_qemu_root | default(false) | bool
  2. Keep the unconditional change but add a one-time pre-flight task that chown -R root:root (or chmod -R g+rwX) the relevant libvirt directories, and bump host_setup_version so existing testbeds re-run the host_setup block instead of relying on the stale .host_setup_done flag.

Either way, please call this out explicitly in the PR description / commit message so reviewers and operators rolling this out know to expect a libvirtd restart and to check VM-image permissions on existing 22.04 testbeds.

I have a few smaller follow-ups (idempotent uv pip install on every play invocation, unpinned grpcio, connection_db still using global pip, etc.) but those are all non-blocking and fine for a follow-up PR — happy to share them once this one is sorted. Thanks!

- Use bare 'uv' command instead of absolute path /usr/local/bin/uv,
  allowing pre-existing uv installations in PATH to be used.
- Remove explicit blinker install from mux simulator; Flask's dependency
  resolver handles it automatically in the venv.
- Re-gate qemu.conf user/group=root on Ubuntu >= 24.04, with opt-in
  'force_qemu_root' variable for 22.04 users who hit disk-image
  permission errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xin Wang <xiwang5@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@wangxin
Copy link
Copy Markdown
Collaborator Author

wangxin commented May 12, 2026

@StormLiangMS Thanks for the review! Agreed on the qemu.conf concern — I've re-gated it to only apply on Ubuntu >= 24.04. For 22.04 users who hit libvirt disk-image permission errors, there's now an opt-in variable force_qemu_root that can be set to true to trigger the same fix. This avoids unexpected libvirtd restarts on existing 22.04 testbeds.

Comment thread ansible/roles/vm_set/tasks/ensure_python_env.yml Outdated
Comment thread ansible/roles/vm_set/tasks/main.yml Outdated
@saiarcot895
Copy link
Copy Markdown
Contributor

saiarcot895 commented May 12, 2026

@StormLiangMS Thanks for the review! Agreed on the qemu.conf concern — I've re-gated it to only apply on Ubuntu >= 24.04. For 22.04 users who hit libvirt disk-image permission errors, there's now an opt-in variable force_qemu_root that can be set to true to trigger the same fix. This avoids unexpected libvirtd restarts on existing 22.04 testbeds.

Regarding this, I haven't seen this permission issue when I was using Ubuntu 24.04. I also don't think qemu (when started as libvirt) should be running as root when the intention of the distro (and maybe libvirt themselves) is to run it as a non-root user. Is this the best fix to make? What was the actual original issue?

Edit, from this:

Libvirt system VMs do not run as root. They run as the QEMU system user. They have only access to host resources that were previously set up by libvirt.

System VMs require a number of helpers from libvirt, which run with elevated privileges, to attach to host storage and networking. That means system VMs are not entirely rootless. Those helpers reduce the attach surface of libvirt and avoid the need of granting root privileges directly to any QEMU process or VM process.

… 2.33.1, document qemu.conf rationale

- Extract hardcoded /opt/sonic-testbed/venv into sonic_testbed_venv variable
  in defaults/main.yml (DRY across ensure_python_env, control_mux_simulator,
  control_nic_simulator, and main.yml)
- Bump requests from 2.32.3 to 2.33.1 (CVE fix in v2.33.0)
- Add comment explaining why qemu.conf sets user=root on Ubuntu 24.04+
  (AppArmor denies libvirt-qemu access to disk images outside
  /var/lib/libvirt/images)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xin Wang <xiwang5@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@wangxin wangxin marked this pull request as draft May 14, 2026 02:40
@wangxin
Copy link
Copy Markdown
Collaborator Author

wangxin commented May 14, 2026

Need some re-work. Close for now.

@wangxin wangxin closed this May 14, 2026
wangxin and others added 3 commits May 14, 2026 11:11
The python_command set_fact in control_mux_simulator.yml and
control_nic_simulator.yml was just an alias for
{{ sonic_testbed_venv }}/bin/python3, used only by the systemd
service templates. Having two variables for the same path
defeats the purpose of the sonic_testbed_venv refactor.

- Remove python_command from both set_fact blocks
- Reference {{ sonic_testbed_venv }}/bin/python3 directly in
  mux-simulator.service.j2 and nic-simulator.service.j2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xin Wang <xiwang5@microsoft.com>
The previous approach unconditionally rewrote /etc/libvirt/qemu.conf to
run QEMU as root on Ubuntu 24.04+ (and via force_qemu_root opt-in on
22.04), which goes against the distro security model and broadens the
attack surface.

The actual root cause is that newer Ubuntu releases default the home
directory mode to 0750 (DIR_MODE in /etc/adduser.conf), which prevents
the unprivileged libvirt-qemu user from traversing /home/<user> to
reach VM disk images stored under {{ home_path }}/sonic-vm/disks/.

Replace the qemu.conf rewrite with a targeted POSIX ACL granting only
libvirt-qemu the execute (traverse) bit on home_path:

- Drop the qemu.conf rewrite block and the force_qemu_root opt-in
- Add 'acl' to the apt package list in host_setup.yml
- Add an 'acl' task in main.yml after home_path is computed by getent
- Bump host_setup_version to 2 so existing testbeds re-run host_setup
  to pick up the acl package install

This keeps QEMU running as the unprivileged libvirt-qemu user, only
grants the minimum permission needed (execute, not read), survives
chmod operations, and applies uniformly across Ubuntu 22.04 and 24.04+.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xin Wang <xiwang5@microsoft.com>
…n up

- main.yml: gate the libvirt-qemu ACL task on package_installation, since
  the acl package and libvirt-qemu user are both installed by host_setup;
  with package_installation=false the role no longer assumes either exists
- main.yml: revert speculative host_setup_version bump (PR not merged yet)
- ensure_python_env.yml: switch to standard uv installer URL instead of
  the version-pinned one (uv version is invisible to playbook behavior;
  we already pin the actual Python packages we install)
- ensure_python_env.yml, control_mux_simulator.yml, control_nic_simulator.yml:
  pre-check installed package versions and skip 'uv pip install' when the
  venv already matches the pinned versions; eliminates redundant work and
  spurious 'changed' status on every play invocation
- ensure_python_env.yml: fix misleading comment that claimed the file
  'always runs (not gated by the host_setup flag)'; it is in fact gated by
  package_installation in main.yml
- host_setup.yml: drop dangling comment that documented an ACL task
  living in a different file

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xin Wang <xiwang5@microsoft.com>
@wangxin wangxin reopened this May 14, 2026
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

1 similar comment
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@wangxin
Copy link
Copy Markdown
Collaborator Author

wangxin commented May 14, 2026

@saiarcot895 @StormLiangMS — wanted to consolidate the qemu.conf discussion since the approach has evolved meaningfully across iterations of this PR.

Original symptom: on freshly provisioned Ubuntu hosts (especially 24.04) the SONiC VM fails to boot with:

libvirt.libvirtError: Cannot access storage file
  ''/home/<user>/sonic-vm/disks/sonic_vlab-01.img'' (as uid:64055, gid:108): Permission denied

Root cause (which I had to actually verify rather than guess): /etc/adduser.conf on newer Ubuntu releases sets DIR_MODE=0750 instead of 0755. The unprivileged libvirt-qemu user is in groups {kvm, libvirt-qemu} — neither matches the user''s group on /home/<user> — so it cannot traverse the home directory to reach the disk image. Comparing two of my hosts confirmed this is the only difference: vm0 (created in 2024, DIR_MODE=0755) works out of the box; vm1 (created in 2026, DIR_MODE=0750) does not. No AppArmor denial is involved on 24.04 — the AppArmor libvirt-<vm-uuid> profile dynamically allows the configured disk path; it''s purely a filesystem traversal issue.

What previous iterations did and why they were wrong:

  • Iteration 1: unconditionally rewrote /etc/libvirt/qemu.conf to run QEMU as root on Ubuntu 24.04+. @StormLiangMS correctly flagged that this would silently flip behavior on every existing 22.04 testbed too if they ever upgraded.
  • Iteration 2: re-gated to >= 24.04 and added a force_qemu_root=true opt-in flag for 22.04. @saiarcot895 correctly pushed back that running QEMU as root goes against the distro security model — libvirt explicitly designs system VMs to run as libvirt-qemu with a small set of privileged helpers, and root QEMU broadens the attack surface for what should be an unprivileged process.

Final approach in this PR: drop the qemu.conf rewrite and the force_qemu_root flag entirely. Grant the libvirt-qemu user the minimum permission it actually needs via a targeted POSIX ACL on home_path:

- name: Allow libvirt-qemu to traverse home directory
  acl:
    path: "{{ home_path }}"
    entity: libvirt-qemu
    etype: user
    permissions: x
    state: present

Properties:

  • QEMU continues to run as the unprivileged libvirt-qemu user — distro security model preserved.
  • ACL grants --x (traverse) only, not r-x. libvirt-qemu cannot ls or read anything else in the home directory.
  • Mask resolves to r-x, so existing group permissions are not silently shrunk.
  • Survives subsequent chmod operations on the home dir.
  • Same code path works on Ubuntu 22.04 and 24.04 — no version gates.
  • No libvirtd restart needed.
  • Other users on the host are unaffected; each user''s home gets its own scoped ACL.
  • On hosts where DIR_MODE=0755 (e.g. older Ubuntu releases) the ACL is redundant-but-harmless — the world bit was already sufficient for traversal.

Verified end-to-end with testbed-cli.sh add-topo vms-kvm-t0 on both Ubuntu 22.04 and Ubuntu 24.04. In both cases qemu.conf is left at its libvirt defaults, the QEMU process runs as libvirt-qemu, and sudo -u libvirt-qemu test -r /home/<user>/sonic-vm/disks/sonic_vlab-01.img returns 0.

One caveat (also called out in the updated PR description): hosts that ran an earlier iteration of this branch will still have user = "root" in their qemu.conf. The role no longer touches that file, so the stale state persists. A one-time sed + systemctl restart libvirtd reverts it. Given that no one outside this PR review thread has run those earlier iterations, the impact should be minimal.

Thanks to both of you for pushing back on the original approach — the final result is meaningfully better than where this started.

@wangxin wangxin marked this pull request as ready for review May 14, 2026 07:22
@wangxin
Copy link
Copy Markdown
Collaborator Author

wangxin commented May 15, 2026

/azpw retry

@mssonicbld
Copy link
Copy Markdown
Collaborator

Retrying failed(or canceled) jobs...

@wangxin wangxin closed this May 15, 2026
@wangxin wangxin reopened this May 15, 2026
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants