[ansible/vm_set]: Refactor host setup and Python environment management#24512
[ansible/vm_set]: Refactor host setup and Python environment management#24512wangxin wants to merge 6 commits into
Conversation
- Extract heavy host-setup tasks (apt, docker, sysctl) into host_setup.yml, gated by a version-stamped flag file at /opt/sonic-testbed/.host_setup_done. Subsequent vm_set invocations skip setup unless the version changes or FORCE_HOST_SETUP=true is set. - Replace python3-pip/pip with uv + persistent venv at /opt/sonic-testbed/venv. Install uv 0.7.2 via standalone installer. Venv uses --system-site-packages to inherit apt-provided python3-libvirt and python3-lxml. - Drop Ubuntu < 22.04 support with explicit fail-early error messages. Remove all 16.04/17.04/18.04/20.04 code paths (python2, old docker repos, pip_executable logic, old virtualenv workarounds). - Simplify docker.yml, control_mux_simulator.yml, control_nic_simulator.yml to use uv pip and venv python instead of system pip. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xin Wang <xiwang5@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@wangxin — nice refactor overall, the flag-gated host setup + uv venv is the right shape and the motivation is clear. One thing I'd like to see addressed before merging: 🔴 Blocker —
|
- Use bare 'uv' command instead of absolute path /usr/local/bin/uv, allowing pre-existing uv installations in PATH to be used. - Remove explicit blinker install from mux simulator; Flask's dependency resolver handles it automatically in the venv. - Re-gate qemu.conf user/group=root on Ubuntu >= 24.04, with opt-in 'force_qemu_root' variable for 22.04 users who hit disk-image permission errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xin Wang <xiwang5@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@StormLiangMS Thanks for the review! Agreed on the qemu.conf concern — I've re-gated it to only apply on Ubuntu >= 24.04. For 22.04 users who hit libvirt disk-image permission errors, there's now an opt-in variable |
Regarding this, I haven't seen this permission issue when I was using Ubuntu 24.04. I also don't think qemu (when started as libvirt) should be running as root when the intention of the distro (and maybe libvirt themselves) is to run it as a non-root user. Is this the best fix to make? What was the actual original issue? Edit, from this:
|
… 2.33.1, document qemu.conf rationale - Extract hardcoded /opt/sonic-testbed/venv into sonic_testbed_venv variable in defaults/main.yml (DRY across ensure_python_env, control_mux_simulator, control_nic_simulator, and main.yml) - Bump requests from 2.32.3 to 2.33.1 (CVE fix in v2.33.0) - Add comment explaining why qemu.conf sets user=root on Ubuntu 24.04+ (AppArmor denies libvirt-qemu access to disk images outside /var/lib/libvirt/images) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xin Wang <xiwang5@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Need some re-work. Close for now. |
The python_command set_fact in control_mux_simulator.yml and
control_nic_simulator.yml was just an alias for
{{ sonic_testbed_venv }}/bin/python3, used only by the systemd
service templates. Having two variables for the same path
defeats the purpose of the sonic_testbed_venv refactor.
- Remove python_command from both set_fact blocks
- Reference {{ sonic_testbed_venv }}/bin/python3 directly in
mux-simulator.service.j2 and nic-simulator.service.j2
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xin Wang <xiwang5@microsoft.com>
The previous approach unconditionally rewrote /etc/libvirt/qemu.conf to
run QEMU as root on Ubuntu 24.04+ (and via force_qemu_root opt-in on
22.04), which goes against the distro security model and broadens the
attack surface.
The actual root cause is that newer Ubuntu releases default the home
directory mode to 0750 (DIR_MODE in /etc/adduser.conf), which prevents
the unprivileged libvirt-qemu user from traversing /home/<user> to
reach VM disk images stored under {{ home_path }}/sonic-vm/disks/.
Replace the qemu.conf rewrite with a targeted POSIX ACL granting only
libvirt-qemu the execute (traverse) bit on home_path:
- Drop the qemu.conf rewrite block and the force_qemu_root opt-in
- Add 'acl' to the apt package list in host_setup.yml
- Add an 'acl' task in main.yml after home_path is computed by getent
- Bump host_setup_version to 2 so existing testbeds re-run host_setup
to pick up the acl package install
This keeps QEMU running as the unprivileged libvirt-qemu user, only
grants the minimum permission needed (execute, not read), survives
chmod operations, and applies uniformly across Ubuntu 22.04 and 24.04+.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xin Wang <xiwang5@microsoft.com>
…n up - main.yml: gate the libvirt-qemu ACL task on package_installation, since the acl package and libvirt-qemu user are both installed by host_setup; with package_installation=false the role no longer assumes either exists - main.yml: revert speculative host_setup_version bump (PR not merged yet) - ensure_python_env.yml: switch to standard uv installer URL instead of the version-pinned one (uv version is invisible to playbook behavior; we already pin the actual Python packages we install) - ensure_python_env.yml, control_mux_simulator.yml, control_nic_simulator.yml: pre-check installed package versions and skip 'uv pip install' when the venv already matches the pinned versions; eliminates redundant work and spurious 'changed' status on every play invocation - ensure_python_env.yml: fix misleading comment that claimed the file 'always runs (not gated by the host_setup flag)'; it is in fact gated by package_installation in main.yml - host_setup.yml: drop dangling comment that documented an ACL task living in a different file Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xin Wang <xiwang5@microsoft.com>
|
/azp run |
1 similar comment
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@saiarcot895 @StormLiangMS — wanted to consolidate the qemu.conf discussion since the approach has evolved meaningfully across iterations of this PR. Original symptom: on freshly provisioned Ubuntu hosts (especially 24.04) the SONiC VM fails to boot with: Root cause (which I had to actually verify rather than guess): What previous iterations did and why they were wrong:
Final approach in this PR: drop the qemu.conf rewrite and the - name: Allow libvirt-qemu to traverse home directory
acl:
path: "{{ home_path }}"
entity: libvirt-qemu
etype: user
permissions: x
state: presentProperties:
Verified end-to-end with One caveat (also called out in the updated PR description): hosts that ran an earlier iteration of this branch will still have Thanks to both of you for pushing back on the original approach — the final result is meaningfully better than where this started. |
|
/azpw retry |
|
Retrying failed(or canceled) jobs... |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description of PR
Summary:
Refactor the
ansible/roles/vm_setrole to address two problems:python3-pip, which on Ubuntu 24.04 pulls inpython3-wheelwith a known S360 vulnerability.This PR extracts host setup into a flag-gated include, replaces
pipwithuv+ a persistent venv, drops support for Ubuntu < 22.04, and fixes a libvirt-qemu home-traversal issue on hosts with the modernDIR_MODE=0750adduser default via a targeted POSIX ACL.Type of change
Back port request
Approach
What is the motivation for this PR?
Redundant host setup: The vm_set role is invoked by ~10 playbooks (
add-topo,remove-topo,restart-ptf,start-VMs,stop-VMs, etc.). Each invocation re-runs all host-setup tasks (apt package installation, Docker setup, sysctl tuning, kernel module loading), even though these only need to run once per host. This wastes significant time on repeated testbed operations.Global pip installs and S360 vulnerability: Python packages (
requests,docker,flask,grpcio, etc.) are installed globally viapython3-pip. On Ubuntu 24.04, installingpython3-pippulls inpython3-wheelas a dependency, which has a known S360 security vulnerability. Additionally, PEP 668 on Ubuntu 24.04 marks the system Python as externally managed, making global pip installs fail without--break-system-packages.Dead code: Support for Ubuntu 16.04, 17.04, 18.04, and 20.04 is no longer needed but adds complexity with version-conditional blocks and Python 2 fallbacks.
libvirt-qemu home traversal: On freshly provisioned Ubuntu hosts the default
DIR_MODEshipped in/etc/adduser.confis0750, which blocks the unprivilegedlibvirt-qemuuser from traversing/home/<user>to open VM disk images stored under~/sonic-vm/disks/. The result is aPermission denied (as uid:64055, gid:108)error when starting the SONiC VM.How did you do it?
New files:
defaults/main.yml— Declaressonic_testbed_venv: "/opt/sonic-testbed/venv"so the venv path is defined in exactly one place.host_setup.yml— Extracted all heavy one-time setup tasks (apt packages, Docker, sysctl, br_netfilter). Gated by a version-stamped flag file at/opt/sonic-testbed/.host_setup_done. Subsequent invocations skip setup unless the version string is bumped orFORCE_HOST_SETUP=trueis set. Addsaclto the apt package list.ensure_python_env.yml— Idempotent tasks to installuvand create a persistent venv at{{ sonic_testbed_venv }}with--system-site-packages(to inherit apt-providedpython3-libvirtandpython3-lxml). Installsrequests==2.33.1(CVE fix in v2.33.0) anddocker==7.1.0. Pre-checks installed versions and skipsuv pip installwhen the venv already matches.Modified files:
main.yml— Slimmed down from ~477 to ~300 lines. Added fail-fast for non-Ubuntu or Ubuntu < 22.04. Fixedpackage_installationenv var lookup to not override playbook-set values. Moved user group membership outside the flag gate (per-user, not per-host). Added anacltask grantinglibvirt-qemuthe execute (traverse) bit onhome_path; gated onpackage_installationsince both theaclpackage and thelibvirt-qemuuser are installed viahost_setup.yml.docker.yml— Removed Ubuntu 16.04–20.04 docker repo blocks, all pip/virtualenv Python package installs, and the trailingansible_python_interpreteroverride.control_mux_simulator.yml/control_nic_simulator.yml— Replaced pip installs withuv pip installinto the venv; pre-check installed versions to skip re-install when already present.templates/mux-simulator.service.j2/nic-simulator.service.j2— Use{{ sonic_testbed_venv }}/bin/python3directly; the redundantpython_commandset_fact is removed.Key design decisions:
--system-site-packagesis mandatory becausepython3-libvirtandpython3-lxmlare C extensions installed via apt that must match the system's shared libraries.ensure_python_env.ymland the libvirt-qemu ACL task are both gated bypackage_installationto respect playbooks (e.g.testbed_renumber_vm_topology.yml) that intentionally skip first-time provisioning.ansible_python_interpreteris only set to venv python if the venv actually exists, preventing failures whenpackage_installation=false./etc/libvirt/qemu.confto run QEMU as root (which goes against the distro security model and broadens the attack surface), we grant the unprivilegedlibvirt-qemuuser the minimum permission it needs — a POSIX ACL granting--x(traverse only, not read) onhome_path. This keeps QEMU running aslibvirt-qemu, applies uniformly to Ubuntu 22.04 and 24.04, survives subsequentchmodoperations on the home directory, and does not require restartinglibvirtd. See the discussion thread on PR for context.How did you verify/test it?
Tested on both Ubuntu 22.04 and Ubuntu 24.04 with a VS testbed:
testbed-cli.sh add-topo vms-kvm-t0(fresh host)ok=302 changed=49 failed=0, VMvlab-01runningok=283 changed=35 failed=0, VMvlab-01runninglibvirt-qemu(unprivileged)libvirt-qemu(unprivileged)/etc/libvirt/qemu.confuser/group lineshome_pathuser:libvirt-qemu:--xapplieduser:libvirt-qemu:--xappliedadd-topo(idempotence)pkg_versions.stdout_lines != ['2.33.1', '7.1.0']false-condition), ACL re-applied after manual striprestart-ptf(package_installation=falsestyle path)Any platform specific information?
Caveat for hosts that previously ran an earlier iteration of this branch
An earlier iteration of this PR (now reverted) unconditionally rewrote
/etc/libvirt/qemu.confto run QEMU as root on Ubuntu 24.04. The final approach in this PR does not touchqemu.confand instead uses an ACL. Any host that pulled an earlierxiwang5/vm-set-refactor-host-setupHEAD will still haveuser = "root"/group = "root"inqemu.conf. To revert manually after this PR lands:Supported testbed topology if it's a new test case?
N/A — this is a testbed infrastructure improvement, not a new test case.
Documentation
N/A — internal ansible role change, no user-facing documentation impact.