diff --git a/CLAUDE.md b/CLAUDE.md index afca8ee..6404a89 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,455 +4,280 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Project Overview -ARM64 Type-1 bare-metal hypervisor written in Rust (no_std) with ARM64 assembly. Runs at EL2 (hypervisor exception level) and manages guest VMs at EL1. Targets QEMU virt machine. Boots Linux 6.12.12 to BusyBox shell with 4 vCPUs, virtio-blk storage, and virtio-net inter-VM networking. Supports multi-VM with per-VM Stage-2, VMID-tagged TLBs, two-level scheduling, and L2 virtual switch. Includes FF-A v1.1 proxy with stub SPMC, page ownership validation via Stage-2 PTE SW bits, FF-A v1.1 descriptor parsing, SMC forwarding to EL3, and VM-to-VM memory sharing (MEM_RETRIEVE/RELINQUISH with dynamic Stage-2 page mapping). Android boot with PL031 RTC emulation, Binder IPC, binderfs, minimal init, 1GB guest RAM. Dual boot modes: NS-EL2 hypervisor via `make run-tfa-linux` (BL33, `tfa_boot` feature) and S-EL2 SPMC via `make run-spmc` (BL32). TF-A boot chain: BL1→BL2→BL31(SPMD)→BL32(SPMC)→BL33 with manifest FDT parsing. SPMC boots SP1 (Hello) + SP2 (IRQ) at S-EL1 via ERET with per-SP Secure Stage-2, dispatches NWd→SP DIRECT_REQ/RESP messaging to multiple SPs. End-to-end FF-A DIRECT_REQ: NS proxy → SPMD → SPMC → SP1/SP2 (SP modifies x4 += 0x1000 as proof). SP-to-SP DIRECT_REQ with CallStack cycle detection, recursive dispatch, and chain preemption. E2E memory sharing: NWd SHARE → SP RETRIEVE → SP write → SP RELINQUISH → NWd verify → NWd RECLAIM (SP-initiated FF-A calls via `handle_sp_exit()` loop). SP-to-SP MEM_SHARE: SP1 shares Secure DRAM page with SP2 via SPMC (MEM_SHARE→RETRIEVE→read/write→RELINQUISH→RECLAIM). MEM_DONATE: irrevocable ownership transfer (RECLAIM/RELINQUISH blocked). 20/20 BL33 integration tests pass. FFA_CONSOLE_LOG (SP debug logging to UART), SRI/NPI feature IDs (donated SGI INTIDs), MEM_FRAG_TX/RX (descriptor fragmentation). NS interrupt preemption: IRQ during SP → FFA_INTERRUPT → NWd calls FFA_RUN → SPMC resumes SP (CNTHP timer, SP_IRQ_PREEMPTED flag, Preempted state). Secure virtual interrupt injection: per-SP INTID ownership, CNTHP poll timer at S-EL2, HCR_EL2.VI + HF_INTERRUPT_GET paravirt (Hafnium-compatible), cross-SP preemption. SPMC manages NWd RXTX state (SPMD forwards RXTX_MAP/UNMAP/RX_RELEASE from NWd to SPMC per TF-A v2.12), NS proxy registers its own RXTX with SPMD, PARTITION_INFO_GET writes 24-byte FF-A v1.1 descriptors to NWd's RX buffer, Linux FF-A driver support (`CONFIG_ARM_FFA_TRANSPORT`, guest DTB `arm,ffa` node). +ARM64 Type-1 bare-metal hypervisor in Rust (`no_std`) + ARM64 assembly. One codebase, two compile-time personalities: + +- **NS-EL2 hypervisor** — boots Linux 6.12.12 to BusyBox (4 vCPUs, virtio-blk/net, multi-VM with per-VM Stage-2 + VMID TLBs + L2 vswitch), Android (PL031 RTC, Binder, 1GB RAM), and a FF-A v1.1 proxy (pKVM-compatible). +- **S-EL2 SPMC** (`sel2`) — runs as TF-A BL32, manages Secure Partitions at S-EL1 (SP1 Hello, SP2 IRQ, SP3 Relay), full FF-A v1.1: DIRECT_REQ (incl. SP↔SP with cycle detection), memory sharing (SHARE/LEND/DONATE/RETRIEVE/RELINQUISH/RECLAIM, NWd↔SP and SP↔SP), secure vIRQ injection, NS interrupt preemption. + +End state target: TF-A BL31+SPMD @ EL3 → our SPMC @ S-EL2 → SPs @ S-EL1, alongside pKVM @ NS-EL2 → Linux/Android @ NS-EL1. Validated E2E against real pKVM (`ffa_test.ko`: 35/35 PASS). See Roadmap and `DEVELOPMENT_PLAN.md`. ## Build Commands ```bash make # Build hypervisor -make run # Build + run in QEMU — runs 33 test suites automatically (exit: Ctrl+A then X) -make run-linux # Build + boot Linux guest (--features linux_guest, 4 vCPUs on 1 pCPU, virtio-blk) -make run-linux-smp # Build + boot Linux guest (--features multi_pcpu, 4 vCPUs on 4 pCPUs) -make run-multi-vm # Build + boot 2 Linux VMs time-sliced (--features multi_vm) -make run-android # Build + boot Android-configured kernel (PL031 RTC, Binder, minimal init, 1GB RAM) +make run # Build + run in QEMU — runs 34 test suites automatically (exit: Ctrl+A then X) +make run-linux # Boot Linux guest (--features linux_guest, 4 vCPUs on 1 pCPU, virtio-blk) +make run-linux-smp # Boot Linux guest (--features multi_pcpu, 4 vCPUs on 4 pCPUs) +make run-multi-vm # Boot 2 Linux VMs time-sliced (--features multi_vm) +make run-android # Boot Android-configured kernel (PL031 RTC, Binder, minimal init, 1GB RAM) make run-guest GUEST_ELF=/path/to/zephyr.elf # Boot Zephyr guest (--features guest) -make run-sel2 # Boot TF-A with trivial BL32 at S-EL2 (requires build-tfa first) -make run-tfa-linux # Boot TF-A → hypervisor (BL33) → Linux (requires build-tfa-bl33 first) -make run-spmc # Boot TF-A → our SPMC (BL32) at S-EL2 (requires build-tfa-spmc first) -make build-tfa-full # Build TF-A with real SPMC (BL32) + preloaded BL33 hypervisor -make run-tfa-linux-ffa # Boot TF-A → SPMC → hypervisor (BL33) → Linux (FF-A discovery) -make build-qemu # Build QEMU 9.2.3 from source (one-time, Docker) -make build-tfa # Build TF-A flash.bin with SPD=spmd (Docker) -make build-tfa-bl33 # Build TF-A flash.bin with PRELOADED_BL33_BASE=0x40200000 -make build-spmc # Build hypervisor as S-EL2 SPMC binary (--features sel2) -make build-sp-hello # Build SP Hello binary (S-EL1 Secure Partition) -make build-sp-irq # Build SP IRQ binary (S-EL1, interrupt handling) -make build-sp-relay # Build SP Relay binary (S-EL1, SP-to-SP DIRECT_REQ relay) -make build-tfa-spmc # Build TF-A with real SPMC as BL32 + SP Hello + SP IRQ -make build-pkvm-kernel # Build AOSP android16-6.12 kernel for pKVM (Docker, ~15-30min first time) -make build-tfa-pkvm # Build TF-A flash-pkvm.bin (ARM_LINUX_KERNEL_AS_BL33, Linux as BL33) -make run-pkvm # Boot pKVM (NS-EL2) + our SPMC (S-EL2) — AOSP kernel as BL33 (requires build-pkvm-kernel + build-tfa-pkvm) -make run-pkvm-ffa-test # Boot pKVM with FF-A test module (35/35 PASS) -make build-crosvm # Build crosvm VMM for aarch64 (Docker, ~5-10min first time) -make build-crosvm-initramfs # Build pKVM initramfs with crosvm + pVM kernel -make run-crosvm # Boot pKVM (nVHE) + crosvm pVM (AVF validation, requires ARM64 host for KVM accel) make debug # Build + run with GDB server on port 1234 -make clean # Clean build artifacts -make check # Check code without building -make clippy # Run linter -make fmt # Format code +make check / clippy / fmt / clean + +# Secure-world / TF-A chain (all Docker-based, TCG-only — KVM cannot virtualize EL3/Secure) +make build-qemu # Build QEMU 9.2.3 from source (one-time) +make build-tfa-bl33 # TF-A flash.bin, PRELOADED_BL33_BASE=0x40200000 +make run-tfa-linux # TF-A → hypervisor (BL33) → Linux (needs build-tfa-bl33) +make build-tfa-spmc # TF-A + real SPMC (BL32) + SP Hello/IRQ/Relay +make run-spmc # TF-A → our SPMC (BL32) at S-EL2 +make build-tfa-full # TF-A real SPMC (BL32) + preloaded BL33 hypervisor +make run-tfa-linux-ffa # TF-A → SPMC → hypervisor (BL33) → Linux (FF-A discovery) +make build-spmc / build-sp-hello / build-sp-irq / build-sp-relay # individual binaries + +# pKVM integration (AOSP kernel as BL33) +make build-pkvm-kernel # AOSP android16-6.12 kernel for pKVM (Docker, ~15-30min first time) +make build-tfa-pkvm # flash-pkvm.bin (ARM_LINUX_KERNEL_AS_BL33) +make run-pkvm # pKVM (NS-EL2) + our SPMC (S-EL2), AOSP kernel as BL33 +make run-pkvm-ffa-test # pKVM with FF-A test module (35/35 PASS) + +# AVF / crosvm (needs ARM64 host + /dev/kvm for full validation) +make build-crosvm / build-crosvm-initramfs / run-crosvm ``` -**Feature flags** (Cargo features, selected via Makefile targets): +**Feature flags** (Cargo, selected by Makefile targets): - `(default)` — unit tests only, no guest boot -- `guest` — Zephyr guest loading -- `linux_guest` — Linux guest with DynamicIdentityMapper, GICR trap-and-emulate, virtio-blk, virtio-net -- `multi_pcpu` — Multi-pCPU support (implies `linux_guest`): 1:1 vCPU-to-pCPU affinity, PSCI boot, TPIDR_EL2 context, SpinLock devices -- `multi_vm` — Multi-VM support (implies `linux_guest`): 2 VMs time-sliced on 1 pCPU, per-VM Stage-2/VMID, per-VM DeviceManager -- `sel2` — S-EL2 SPMC mode: hypervisor as BL32 (SPMC role), separate boot_sel2.S entry, linker base 0x0e100000 (secure DRAM), manifest parsing, FFA_MSG_WAIT handshake, secondary CPU warm-boot via FFA_SECONDARY_EP_REGISTER, boots SP1 (Hello) + SP2 (IRQ) + SP3 (Relay) -- `tfa_boot` — TF-A boot mode (implies `linux_guest`): sets SPMC_PRESENT=true at compile time, NS proxy registers RXTX with SPMD, forwards DIRECT_REQ, PARTITION_INFO_GET, and MEM_SHARE/LEND/RECLAIM (SP receivers) to real SPMC via 8-register SMC +- `guest` — Zephyr loading +- `linux_guest` — Linux guest: DynamicIdentityMapper, GICR trap-and-emulate, virtio-blk/net +- `multi_pcpu` (⊃ linux_guest) — 1:1 vCPU↔pCPU affinity, PSCI boot, TPIDR_EL2 context, SpinLock devices +- `multi_vm` (⊃ linux_guest) — 2 VMs time-sliced on 1 pCPU, per-VM Stage-2/VMID/DeviceManager +- `sel2` — S-EL2 SPMC mode: BL32, `boot_sel2.S`, linker base 0x0e100000, manifest parse, secondary warm-boot, boots SP1+SP2+SP3 +- `tfa_boot` (⊃ linux_guest) — SPMC_PRESENT=true, NS proxy forwards DIRECT_REQ/PARTITION_INFO/MEM_SHARE/LEND/RECLAIM to real SPMC via SPMD -**Note**: `multi_pcpu` and `multi_vm` are mutually exclusive — both imply `linux_guest` but use different scheduling models. `sel2` is mutually exclusive with all others. `tfa_boot` is used with `run-tfa-linux` when a real SPMC is available at S-EL2. +**Mutual exclusivity**: `multi_pcpu` ⊥ `multi_vm`; `sel2` ⊥ all others. `tfa_boot` pairs with `run-tfa-linux`. -**Toolchain requirements**: Rust nightly, `aarch64-linux-gnu-gcc`, `aarch64-linux-gnu-ar`, `aarch64-linux-gnu-objcopy`, `qemu-system-aarch64` +**Toolchain**: Rust nightly, `aarch64-linux-gnu-{gcc,ar,objcopy}`, `qemu-system-aarch64`. ## Architecture ### Privilege Model -- **EL2**: Hypervisor — exception handling, Stage-2 page tables, GIC virtual interface -- **EL1**: Guest — Linux kernel or Zephyr RTOS -- **Stage-2 Translation**: Identity mapping (GPA == HPA), 2MB blocks + 4KB pages +- **EL2** (or **S-EL2** in `sel2`): hypervisor — exceptions, Stage-2 tables, GIC virtual interface +- **EL1**: guest (Linux/Zephyr) or Secure Partition +- **Stage-2**: identity map (GPA==HPA), 2MB blocks + 4KB pages ### Core Abstractions | Type | File | Role | |------|------|------| | `Vm` | `src/vm.rs` | VM lifecycle, Stage-2 setup, `run_smp()` scheduler loop | -| `Vcpu` | `src/vcpu.rs` | State machine (Uninitialized→Ready→Running→Stopped), context save/restore | -| `VcpuContext` | `src/arch/aarch64/regs.rs` | Guest registers (x0-x30, SP, PC, SPSR, system regs) | -| `VcpuArchState` | `src/arch/aarch64/vcpu_arch_state.rs` | Per-vCPU GIC LRs, timer, EL1 sysregs, PAC keys | -| `DeviceManager` | `src/devices/mod.rs` | Enum-dispatch MMIO routing to emulated devices | -| `Scheduler` | `src/scheduler.rs` | Round-robin vCPU scheduler with block/unblock | -| `ExitReason` | `src/arch/aarch64/regs.rs` | VM exit causes: WfiWfe, HvcCall, SmcCall, DataAbort, etc. | -| `FfaProxy` | `src/ffa/proxy.rs` | FF-A v1.1 proxy: intercepts guest SMC, handles VERSION/ID_GET/FEATURES/RXTX/messaging/memory | -| `Stage2Walker` | `src/ffa/stage2_walker.rs` | Stage-2 page table walker from VTTBR_EL2: PTE SW bits, S2AP, map_page/unmap_page for cross-VM sharing | +| `Vcpu` | `src/vcpu.rs` | State machine, context save/restore | +| `VcpuContext` / `VcpuArchState` | `src/arch/aarch64/regs.rs`, `vcpu_arch_state.rs` | Guest regs; per-vCPU GIC LRs, timer, EL1 sysregs, PAC keys | +| `DeviceManager` | `src/devices/mod.rs` | Enum-dispatch MMIO routing | +| `Scheduler` | `src/scheduler.rs` | Round-robin vCPU scheduler, block/unblock | +| `ExitReason` | `src/arch/aarch64/regs.rs` | VM exit causes (WfiWfe, HvcCall, SmcCall, DataAbort, …) | +| `FfaProxy` | `src/ffa/proxy.rs` | NS-EL2 FF-A v1.1 proxy (VERSION/ID/FEATURES/RXTX/messaging/memory) | +| `Stage2Walker` | `src/ffa/stage2_walker.rs` | Walker from VTTBR_EL2: PTE SW bits, S2AP, map/unmap for cross-VM sharing | | `FfaDescriptors` | `src/ffa/descriptors.rs` | FF-A v1.1 composite memory region descriptor parsing | -| `SmcForward` | `src/ffa/smc_forward.rs` | SMC forwarding to EL3 + SPMC probe | -| `PlatformInfo` | `src/dtb.rs` | Runtime DTB parsing: UART, GIC, RAM, CPU count discovery | -| `VSwitch` | `src/vswitch.rs` | L2 virtual switch with MAC learning, inter-VM frame forwarding | -| `NetRxRing` | `src/vswitch.rs` | Per-port SPSC ring buffer for async RX frame delivery | -| `VirtualPl031` | `src/devices/pl031.rs` | PL031 RTC emulation: counter-based time, PrimeCell ID | -| `SpMcManifest` | `src/manifest.rs` | SPMC manifest parser: TOS_FW_CONFIG DTB (spmc_id, version) | -| `SpmcHandler` | `src/spmc_handler.rs` | S-EL2 SPMC event loop + FF-A dispatch, multi-SP DIRECT_REQ routing via `dispatch_to_sp()` + `enter_guest()` ERET, SP-initiated FF-A call loop in `handle_sp_exit()` (MEM_RETRIEVE_REQ/MEM_RELINQUISH/CONSOLE_LOG → handle locally → re-enter SP), SP→SP DIRECT_REQ routing (CallStack cycle detection, recursive dispatch_to_sp, chain preemption via handle_sp_exit sentinel pattern), NS interrupt preemption (SP_IRQ_PREEMPTED flag, CNTHP timer, FFA_INTERRUPT return), `resume_preempted_sp()` via FFA_RUN, secure vIRQ injection via `inject_pending_virq()` (HCR_EL2.VI), cross-SP preemption via `dispatch_interrupt_to_sp()`, NWd RXTX management, PARTITION_INFO_GET writes 24-byte descriptors to NWd RX buffer, SPMC-side memory sharing (MEM_SHARE/LEND/DONATE/RETRIEVE/RELINQUISH/RECLAIM with SpmcShareRecord storage, dynamic Secure Stage-2 mapping via Stage2Walker), SP-to-SP MEM_SHARE/LEND/DONATE/RECLAIM (SP-initiated sharing via handle_sp_exit), MSG_SEND2/MSG_WAIT indirect messaging (per-SP SpMailbox), CONSOLE_LOG (extracts packed characters to UART), SRI/NPI feature IDs | -| `SpContext` | `src/sp_context.rs` | Per-SP state machine (Reset→Idle→Running→Blocked→Preempted, incl. Blocked→Preempted for chain preemption), wraps VcpuContext, per-SP `owned_intids[4]` + `pending_irq`, global SpStore, `for_each_sp()`/`find_sp_for_intid()`/`find_sp_with_pending_irq()` iterators | -| `SecureStage2Config` | `src/secure_stage2.rs` | VSTTBR_EL2/VSTCR_EL2 config for SP isolation, `build_sp_stage2()` identity-maps SP code + UART | -| `Sel2Mmu` | `src/sel2_mmu.rs` | S-EL2 Stage-1 identity map: static L0/L1/L2 tables, NS=1 for NWd DRAM, Device for GIC/UART, Normal Secure for SPMC/SPs; `install_sel2_stage1_secondary()` for secondary CPU warm-boot | +| `PlatformInfo` | `src/dtb.rs` | Runtime DTB parsing (UART/GIC/RAM/CPU discovery) | +| `VSwitch` / `NetRxRing` | `src/vswitch.rs` | L2 vswitch + MAC learning; per-port SPSC RX ring | +| `VirtualPl031` | `src/devices/pl031.rs` | PL031 RTC emulation | +| `SpmcHandler` | `src/spmc_handler.rs` | S-EL2 SPMC event loop + FF-A dispatch (see SPMC section) | +| `SpContext` | `src/sp_context.rs` | Per-SP state machine (Reset→Idle→Running→Blocked→Preempted), `owned_intids[4]`, `pending_irq`, global SpStore | +| `SecureStage2Config` | `src/secure_stage2.rs` | VSTTBR/VSTCR config for SP isolation | +| `Sel2Mmu` | `src/sel2_mmu.rs` | S-EL2 Stage-1 identity map (NS=1 for NWd DRAM) | ### Exception Handling Flow ``` -Guest @ EL1 - ↓ trap (Data Abort, HVC, SMC, WFI, MSR/MRS) -Exception Vector (arch/aarch64/exception.S) — save context - ↓ -handle_exception() (src/arch/aarch64/hypervisor/exception.rs) +Guest @ EL1 ─trap→ exception.S (save ctx) ─→ handle_exception() (arch/aarch64/hypervisor/exception.rs) ├─ WFI → return false (exit to scheduler) - ├─ HVC → handle_psci() (PSCI v1.0: CPU_ON, CPU_OFF, SYSTEM_RESET) or HF_INTERRUPT_GET (sel2: returns pending INTID) - ├─ SMC → handle_smc() → PSCI or FF-A proxy or forward to EL3 - ├─ Data Abort → HPFAR_EL2 for IPA → decode instruction → MMIO dispatch - ├─ MSR/MRS trap → handle ICC_SGI1R_EL1 (SGI emulation), sysreg emulation - └─ IRQ → handle INTID 26 (preemption/poll timer), 27 (vtimer), 33 (UART RX); sel2: per-SP INTID routing via HCR_EL2.VI - ↓ advance PC, restore context -ERET back to guest + ├─ HVC → handle_psci() (PSCI v1.0) or HF_INTERRUPT_GET (sel2) + ├─ SMC → handle_smc() → PSCI / FF-A proxy / forward to EL3 + ├─ Data Abort → HPFAR_EL2 for IPA → decode instr → MMIO dispatch + ├─ MSR/MRS → ICC_SGI1R_EL1 (SGI emul), sysreg emul + └─ IRQ → INTID 26 (preempt/poll timer), 27 (vtimer), 33 (UART RX); sel2: per-SP routing via HCR_EL2.VI +─→ advance PC, restore ctx, ERET ``` ### SMP / Multi-vCPU +`run_smp()` loops `run_one_iteration()`, each running one vCPU on one pCPU (cooperative + preemptive): +check `pending_cpu_on` → wake vCPUs w/ pending SGIs/SPIs → round-robin pick → drain UART RX (SPI 33) → inject SGIs/SPIs into `ich_lr[]` → arm CNTHP preempt timer (10ms, INTID 26, only if 2+ online) → `vcpu.run()`/`enter_guest()` → handle exit. +- `vcpu_online_mask` **must include vCPU 0 at boot** or preempt timer never arms. +- SGI/IPI: ICC_SGI1R_EL1 trapped via ICH_HCR_EL2.TALL1=1 → decoded → `PENDING_SGIS[vcpu_id]` atomics → injected before entry. -`run_smp()` calls `run_one_iteration()` in a loop. Each iteration runs one vCPU on a single physical CPU via cooperative + preemptive scheduling: - -1. Check per-VM `pending_cpu_on` → `boot_secondary_vcpu()` (PSCI CPU_ON) -2. Wake vCPUs with pending SGIs/SPIs → `scheduler.unblock()` -3. Pick next vCPU (round-robin) → set `current_vcpu_id` -4. Drain UART RX ring → inject SPI 33 -5. Inject pending SGIs/SPIs into `arch_state.ich_lr[]` -6. Arm CNTHP preemption timer (10ms, INTID 26) — only when 2+ vCPUs online -7. `vcpu.run()` → save/restore arch state → `enter_guest()` → ERET -8. Handle exit: terminal→remove, CPU_ON/preemption→yield, WFI→block, other→yield - -**Important**: `vcpu_online_mask` must include vCPU 0 at boot — without it, preemption timer never activates. - -**SGI/IPI emulation**: ICC_SGI1R_EL1 trapped via ICH_HCR_EL2.TALL1=1 → decoded (TargetList[15:0], Aff1[23:16], INTID[27:24]) → `PENDING_SGIS[vcpu_id]` atomics → injected before next entry. - -### Multi-pCPU (4 vCPUs on 4 Physical CPUs) - -Feature: `multi_pcpu` (implies `linux_guest`). Target: `make run-linux-smp`. - -**Architecture**: 1:1 vCPU-to-pCPU affinity. Each physical CPU runs one vCPU exclusively — no scheduler needed. - -**Secondary pCPU Boot**: QEMU virt keeps secondary CPUs powered off. `wake_secondary_pcpus()` issues real PSCI CPU_ON SMC calls (`smc #0`, function_id=0xC4000003) to QEMU's EL3 firmware with `secondary_entry` as the entry point. - -**Per-CPU Context Pointer**: `TPIDR_EL2` (hardware-banked per physical CPU) replaces the global `current_vcpu_context` variable in `exception.S`. Set by `enter_guest()`, read by exception/IRQ handlers. - -**Physical GICR Programming**: `ensure_vtimer_enabled(cpu_id)` programs physical GICR ISENABLER0 for SGIs 0-15 + PPI 27 (vtimer) before every guest entry. Guest GICR writes only update the shadow `VirtualGicr` state. - -**Cross-pCPU SPI Delivery**: `inject_spi()` reads physical GICD_IROUTER directly (EL2 bypasses Stage-2) to avoid deadlock with the `DEVICES` SpinLock. If the target is a remote pCPU, sends physical SGI 0 via `msr icc_sgi1r_el1` to wake it. - -**WFI Passthrough**: TWI cleared in multi-pCPU mode — real WFI on physical CPU, woken by physical interrupts. - -### Multi-VM (2 Linux VMs Time-Sliced) - -Feature: `multi_vm` (implies `linux_guest`). Target: `make run-multi-vm`. - -**Architecture**: 2 VMs round-robin time-sliced on a single pCPU. Each VM has 4 vCPUs scheduled via the inner `run_one_iteration()` loop. - -**Per-VM Global State**: `VmGlobalState` struct (indexed by `CURRENT_VM_ID`) replaces flat globals. Each VM has its own `pending_sgis`, `pending_spis`, `vcpu_online_mask`, `current_vcpu_id`, and `preemption_exit`. - -**Per-VM DeviceManager**: `DEVICES: [GlobalDeviceManager; MAX_VMS]` array. Exception handler uses `CURRENT_VM_ID` to dispatch MMIO to the correct VM's devices. - -**VMID-Tagged Stage-2**: `Stage2Config::new_with_vmid()` encodes VMID in VTTBR_EL2 bits [63:48] for TLB isolation. `Vm::activate_stage2()` writes VTTBR_EL2/VTCR_EL2 before guest entry. +### Multi-pCPU (`multi_pcpu`, `run-linux-smp`) +1:1 vCPU↔pCPU affinity, no scheduler. Secondary pCPUs are powered off — `wake_secondary_pcpus()` issues real PSCI CPU_ON SMC. `TPIDR_EL2` (HW-banked) holds per-pCPU context. `ensure_vtimer_enabled()` programs physical GICR ISENABLER0 before each entry. `inject_spi()` reads physical GICD_IROUTER directly (avoids DEVICES-lock deadlock); cross-pCPU wake via physical SGI 0. WFI passthrough (TWI cleared). -**Two-Level Scheduler**: `run_multi_vm()` → outer VM round-robin → `CURRENT_VM_ID.store()` → `activate_stage2()` → `run_one_iteration()` → inner vCPU round-robin. - -**Memory Partitioning**: VM 0 at 0x48000000 (256MB), VM 1 at 0x68000000 (256MB). Each VM gets separate kernel, DTB, initramfs, and virtio-blk disk image loaded by QEMU. +### Multi-VM (`multi_vm`, `run-multi-vm`) +2 VMs round-robin time-sliced on 1 pCPU, each with 4 vCPUs. `VmGlobalState[CURRENT_VM_ID]` replaces flat globals (per-VM SGIs/SPIs/online_mask/current_vcpu/preempt). `DEVICES: [_; MAX_VMS]` per-VM. VMID encoded in VTTBR_EL2[63:48] for TLB isolation. Two-level scheduler: outer VM round-robin → `activate_stage2()` → inner `run_one_iteration()`. Memory: VM0 @ 0x48000000, VM1 @ 0x68000000 (256MB each). ### GIC Emulation - -| Component | Address | Mode | Implementation | -|-----------|---------|------|----------------| -| GICD | 0x08000000 | Trap + write-through | `VirtualGicd` shadow state + write-through to physical GICD | -| GICR 0-3 | 0x080A0000+ | Trap-and-emulate | `VirtualGicr` (Stage-2 unmapped, 4KB pages) | -| ICC regs | System regs | Virtual | ICH_HCR_EL2.En=1 redirects to ICV_* at EL1 | -| ICC_SGI1R | System reg | Trapped | TALL1=1, decoded for IPI emulation | - -**List Register injection**: 4 LRs (ICH_LR0-3_EL2). HW=1 for vtimer (INTID 27) enables physical-virtual EOI linkage. EOImode=1 for proper priority drop / deactivation split. - -### Virtio-blk - -``` -VirtioMmioTransport @ 0x0a000000 (SPI 16 = INTID 48) - ├─ MMIO registers (virtio-mmio spec) - ├─ Virtqueue (descriptor table + available ring + used ring) - └─ VirtioBlk backend (disk image at 0x58000000, loaded by QEMU) -``` - -Guest writes QueueNotify → `process_request()` → read/write disk image via `copy_nonoverlapping` (identity-mapped) → update used ring → `inject_spi(48)` → `flush_pending_spis_to_hardware()`. - -### Virtio-net + VSwitch - -``` -VirtioMmioTransport @ 0x0a000200 (SPI 17 = INTID 49) - ├─ MMIO registers (virtio-mmio spec) - ├─ 2 virtqueues: RX (queue 0) + TX (queue 1) - └─ VirtioNet backend (device_id=1, MAC 52:54:00:00:00:{vm_id+1}) -``` - -**TX path**: Guest writes QueueNotify → `process_tx()` → strip 12-byte `virtio_net_hdr_v1` → `vswitch_forward(src_port, frame)` → VSwitch MAC learning + L2 forwarding → `PORT_RX[dst].store(frame)`. - -**RX path**: `drain_net_rx(vm_id)` in run loop → `PORT_RX[vm_id].take()` → `inject_net_rx()` → `inject_rx(frame)` → write 12-byte header (num_buffers=1) + frame into RX descriptor chain via `copy_nonoverlapping` → `inject_spi(49)`. - -**VSwitch** (`src/vswitch.rs`): L2 virtual switch with 16-entry MAC learning table. Broadcasts/multicasts flood all ports (excluding source). Unknown unicasts also flood. MAC entries are learned on TX (source MAC → source port). - -**NetRxRing**: SPSC ring buffer (9 slots, 8 usable + 1 sentinel) per VM port. Atomic head/tail with Acquire/Release ordering. Stores up to 1514-byte Ethernet frames. - -**MMIO slot abstraction**: `platform::virtio_slot(n)` returns `(base_addr, intid)` for slot n. Slot 0 = virtio-blk, slot 1 = virtio-net. Stride = 0x200. - -**Auto-IP**: Initramfs `/init` reads MAC from sysfs, extracts last octet, assigns `10.0.0.{octet}/24` via `ifconfig`. VM 0 → `10.0.0.1`, VM 1 → `10.0.0.2`. +| Component | Address | Mode | +|-----------|---------|------| +| GICD | 0x08000000 | Trap + write-through (`VirtualGicd` shadow) | +| GICR 0-3 | 0x080A0000+ | Trap-and-emulate (`VirtualGicr`, Stage-2 unmapped, 4KB pages) | +| ICC regs | sysregs | Virtual (ICH_HCR_EL2.En=1 → ICV_* at EL1) | +| ICC_SGI1R | sysreg | Trapped (TALL1=1, decoded for IPI) | + +LR injection: 4 LRs (ICH_LR0-3_EL2). HW=1 for vtimer (INTID 27) → physical-virtual EOI linkage. EOImode=1 for priority-drop/deactivation split. + +### Virtio +- **virtio-blk** @ 0x0a000000 (SPI 16/INTID 48): QueueNotify → `process_request()` → r/w disk image @ 0x58000000 (identity-mapped) → used ring → `inject_spi(48)`. +- **virtio-net** @ 0x0a000200 (SPI 17/INTID 49): 2 vqueues (RX q0, TX q1). TX: strip 12B `virtio_net_hdr_v1` → `vswitch_forward()`. RX: `drain_net_rx()` → `PORT_RX[vm].take()` → write header+frame into RX chain → `inject_spi(49)`. +- **VSwitch**: 16-entry MAC learning, flood broadcast/multicast/unknown-unicast (excl. source), learn src MAC on TX. +- **NetRxRing**: SPSC ring (8 usable slots), up to 1514B frames. +- `platform::virtio_slot(n)` → `(base, intid)`; slot 0 = blk, slot 1 = net, stride 0x200. +- **Auto-IP**: initramfs `/init` reads MAC → assigns `10.0.0.{last_octet}/24`. ### FF-A v1.1 Proxy (`src/ffa/`) - -Implements the FF-A (Firmware Framework for Arm) v1.1 hypervisor proxy role (pKVM-compatible). Guest SMC calls trapped via `HCR_EL2.TSC=1` (bit 19) are routed through `handle_smc()` → `ffa::proxy::handle_ffa_call()`. - -**Supported calls**: FFA_VERSION, FFA_ID_GET, FFA_SPM_ID_GET, FFA_FEATURES, FFA_RXTX_MAP/UNMAP, FFA_RX_RELEASE, FFA_PARTITION_INFO_GET, FFA_MSG_SEND_DIRECT_REQ, FFA_MSG_SEND2, FFA_MSG_WAIT, FFA_RUN, FFA_MEM_SHARE/LEND/RETRIEVE_REQ/RELINQUISH/RECLAIM, FFA_MEM_FRAG_TX/FRAG_RX, FFA_NOTIFICATION_BITMAP_CREATE/DESTROY/BIND/UNBIND/SET/GET/INFO_GET, FFA_CONSOLE_LOG_32/64. FFA_MEM_DONATE is blocked (returns NOT_SUPPORTED). FFA_FEATURES returns donated SGI INTIDs for SRI (INTID 9) and NPI (INTID 8) feature queries. VM-to-VM memory sharing: sender shares pages via MEM_SHARE, receiver maps them via MEM_RETRIEVE_REQ (dynamic Stage-2 page mapping), receiver unmaps via MEM_RELINQUISH, sender reclaims via MEM_RECLAIM. When `tfa_boot` + SP receiver (part_id >= 0x8000): MEM_SHARE/LEND/RECLAIM forwarded to real SPMC via SPMD, with dual record (local SW bits + SPMC handle via `record_share_with_handle()`). PARTITION_INFO_GET: when SPMC_PRESENT, forwards to SPMD and copies 24-byte descriptors from proxy RX to guest RX; otherwise uses 8-byte stub descriptors. Notification bitmaps support FFA_HOST_ID (0x0000) for pKVM host scheduler — `endpoint_index()` maps it to slot `FFA_MAX_VMS + 2`. - -**Stub SPMC** (`src/ffa/stub_spmc.rs`): Simulates 2 Secure Partitions (SP1=0x8001, SP2=0x8002) for testing without a real Secure World. Direct messaging echoes x4-x7 back. Memory sharing tracks multi-range records with `MemShareRecord` (up to 4 ranges per share, `ShareInfo`/`ShareInfoFull` for reclaim/retrieve). `mark_retrieved()`/`mark_relinquished()` track retrieve state; `MEM_RECLAIM` blocked while retrieved. - -**RXTX Mailbox** (`src/ffa/mailbox.rs`): Per-VM TX/RX buffer IPAs registered via FFA_RXTX_MAP. Used by PARTITION_INFO_GET to return SP descriptors. TX buffer used for FF-A v1.1 composite memory region descriptors. - -**Page Ownership** (`src/ffa/memory.rs`): Stage-2 PTE software bits [56:55] track page state: Owned(0b00), SharedOwned(0b01), SharedBorrowed(0b10), Donated(0b11). Validated during MEM_SHARE/LEND (Owned required), transitioned to SharedOwned, restored on MEM_RECLAIM. S2AP bits [7:6] restrict access: SHARE→RO, LEND→NONE. Matches pKVM page ownership model. - -**Stage-2 Walker** (`src/ffa/stage2_walker.rs`): Lightweight page table walker reconstructed from `VTTBR_EL2` at SMC handling time. Reads/writes PTE SW bits and S2AP without owning page table memory. Used by MEM_SHARE/LEND/RECLAIM for ownership validation. `map_page()` creates 4KB page entries in a target VM's Stage-2 (allocates L2/L3 tables from heap), used by MEM_RETRIEVE_REQ for cross-VM sharing. `unmap_page()` zeroes L3 PTEs, used by MEM_RELINQUISH. `PER_VM_VTTBR` global stores each VM's L0 table PA for constructing walkers for non-active VMs. Gated by `#[cfg(feature = "linux_guest")]` — unit tests skip Stage-2 validation (stale VTTBR from earlier page table tests). - -**Descriptor Parsing** (`src/ffa/descriptors.rs`): Parses FF-A v1.1 composite memory region descriptors (DEN0077A Table 5.19-5.25): `FfaMemRegion`(48B) → `FfaMemAccessDesc`(16B) → `FfaCompositeMemRegion`(16B) → `FfaMemRegionAddrRange`(16B). Uses `core::ptr::read_unaligned` for packed struct safety. `build_retrieve_resp_descriptor()` constructs response descriptors (reverse of `parse_mem_region()`). Falls back to register-based protocol (x3=IPA, x4=count, x5=receiver) when no mailbox is mapped. - -**Fragmentation** (`FragmentState`/`FragRxState` in proxy.rs): MEM_FRAG_TX handles sender-side descriptor fragments (large MEM_SHARE descriptors split across multiple calls). MEM_FRAG_RX handles receiver-side fragments (MEM_RETRIEVE_RESP too large for RX buffer, receiver calls FRAG_RX for subsequent chunks). Per-VM state tracks active handle, buffer, total/delivered lengths. - -**Console Log**: FFA_CONSOLE_LOG_32/64 (FF-A v1.2) extracts packed characters from x2-x7 (8 bytes per register, little-endian, up to 48 chars) and writes to UART. Supported in both NS proxy and SPMC (including `handle_sp_exit()` for SP-initiated logging). - -**SMC Forwarding** (`src/ffa/smc_forward.rs`): `forward_smc()` uses inline `smc #0` to forward calls to EL3 (HCR_EL2.TSC only traps EL1 SMC). `probe_spmc()` sends FFA_VERSION to detect a real SPMC at EL3. `ffa::proxy::init()` called at boot (linux_guest only) to set `SPMC_PRESENT` flag. Unknown SMCs in `handle_smc()` catch-all are forwarded to EL3 instead of returning -1. - -**SMC routing**: `is_ffa_function(fid)` checks for SMC32/64 function IDs in the 0x84/0xC4 range with low byte >= 0x60. PSCI functions (0x84000000-0x8400001F, 0xC4000000-0xC4000003) are handled separately. - -### UART (PL011) Emulation - -Full trap-and-emulate (Stage-2 unmapped). TX: guest writes UARTDR → `output_char()` to physical UART. RX: physical IRQ (INTID 33) → `UART_RX` ring buffer → `VirtualUart.push_rx()` → inject SPI 33. Linux amba-pl011 probe requires PeriphID/PrimeCellID registers. - -### PL031 RTC Emulation (`src/devices/pl031.rs`) - -Trap-and-emulate at `0x09010000` (SPI 2 = INTID 34). Counter-based time: `RTCDR = load_value + (CNTVCT_EL0 / CNTFRQ_EL0)` when enabled (RTCCR bit 0). Registers: RTCDR (0x000, read), RTCLR (0x008, write), RTCCR (0x00C, control), RTCIMSC/RTCRIS/RTCMIS/RTCICR (0x010-0x01C, stubs). PrimeCell ID registers (0xFE0-0xFFC) required for Linux amba bus probe. 4 unit tests in `tests/test_pl031.rs`. +NS-EL2 hypervisor proxy role (pKVM-compatible). Guest SMC trapped via `HCR_EL2.TSC=1` → `handle_smc()` → `ffa::proxy::handle_ffa_call()`. + +**Calls**: VERSION, ID_GET, SPM_ID_GET, FEATURES, RXTX_MAP/UNMAP, RX_RELEASE, PARTITION_INFO_GET, MSG_SEND_DIRECT_REQ, MSG_SEND2, MSG_WAIT, RUN, MEM_SHARE/LEND/RETRIEVE_REQ/RELINQUISH/RECLAIM, MEM_FRAG_TX/RX, NOTIFICATION_*, CONSOLE_LOG_32/64. MEM_DONATE blocked (NOT_SUPPORTED). FEATURES returns donated SGI INTIDs for SRI(9)/NPI(8). + +- **VM-to-VM sharing**: sender MEM_SHARE → receiver MEM_RETRIEVE_REQ (dynamic Stage-2 page map) → MEM_RELINQUISH (unmap) → sender MEM_RECLAIM. With `tfa_boot` + SP receiver (part_id≥0x8000): SHARE/LEND/RECLAIM forwarded to real SPMC (dual record: local SW bits + SPMC handle). +- **Stub SPMC** (`stub_spmc.rs`): simulates SP1=0x8001/SP2=0x8002 without Secure World (for unit tests). Echoes x4-x7, tracks multi-range `MemShareRecord`. +- **RXTX mailbox** (`mailbox.rs`): per-VM TX/RX IPAs via FFA_RXTX_MAP; used by PARTITION_INFO_GET + composite descriptors. +- **Page ownership** (`memory.rs`): Stage-2 PTE SW bits [56:55] = Owned/SharedOwned/SharedBorrowed/Donated; S2AP [7:6] restricts (SHARE→RO, LEND→NONE). Matches pKVM model. +- **Stage2Walker** (`stage2_walker.rs`): reconstructed from VTTBR_EL2 per-SMC; `map_page`/`unmap_page` create/zero 4KB L3 entries. `PER_VM_VTTBR` holds each VM's L0 PA. Gated `#[cfg(feature="linux_guest")]`. +- **Descriptors** (`descriptors.rs`): FF-A v1.1 composite region (DEN0077A Tbl 5.19-5.25): `FfaMemRegion`(48B)→`FfaMemAccessDesc`(16B)→`FfaCompositeMemRegion`(16B)→`FfaMemRegionAddrRange`(16B). Uses `read_unaligned`. Falls back to register protocol (x3=IPA,x4=count,x5=receiver) when no mailbox. +- **Fragmentation**: MEM_FRAG_TX (sender), MEM_FRAG_RX (receiver, RETRIEVE_RESP > RX buffer). +- **SMC forwarding** (`smc_forward.rs`): `forward_smc()` (inline `smc #0`), `probe_spmc()` (FFA_VERSION), `init()` sets SPMC_PRESENT. Unknown SMCs forwarded to EL3. +- **Routing**: `is_ffa_function(fid)` = 0x84/0xC4 range, low byte ≥ 0x60. PSCI handled separately. + +### S-EL2 SPMC (`SpmcHandler`, `sel2` feature) +Event loop + FF-A dispatch: +- **Multi-SP DIRECT_REQ**: `dispatch_to_sp()` + `enter_guest()` ERET; SP-initiated calls handled in `handle_sp_exit()` loop (MEM_RETRIEVE_REQ/RELINQUISH/CONSOLE_LOG → handle locally → re-enter SP). +- **SP→SP DIRECT_REQ**: CallStack cycle detection, recursive `dispatch_to_sp`, chain preemption (Blocked→Preempted sentinel). +- **NS interrupt preemption**: `SP_IRQ_PREEMPTED` flag, CNTHP timer, FFA_INTERRUPT return; `resume_preempted_sp()` via FFA_RUN. +- **Secure vIRQ**: `inject_pending_virq()` (HCR_EL2.VI), cross-SP `dispatch_interrupt_to_sp()`. +- **Memory sharing**: SHARE/LEND/DONATE/RETRIEVE/RELINQUISH/RECLAIM with `SpmcShareRecord`, dynamic Secure Stage-2 map via Stage2Walker; SP↔SP via `handle_sp_exit`. +- **NWd RXTX management**, PARTITION_INFO_GET (24B descriptors to NWd RX), MSG_SEND2/MSG_WAIT (per-SP SpMailbox), CONSOLE_LOG, SRI/NPI feature IDs. + +### Other Devices +- **PL011 UART**: full trap-and-emulate. TX → physical UART. RX: physical IRQ 33 → `UART_RX` ring → inject SPI 33. PeriphID/PrimeCellID regs needed for Linux amba-pl011 probe. +- **PL031 RTC** (`devices/pl031.rs`): @ 0x09010000 (SPI 2/INTID 34). `RTCDR = load + CNTVCT/CNTFRQ`. PrimeCell IDs needed for amba probe. ### DTB Runtime Parsing (`src/dtb.rs`) - -At boot, QEMU passes the host DTB address in x0. `boot.S` preserves it in callee-saved x20, then passes to `rust_main(dtb_addr: usize)`. `dtb::init()` uses the `fdt` crate (v0.1.5, zero-copy, no-alloc) to discover platform hardware: - -- **UART**: `arm,pl011` compatible → `uart_base` -- **GIC**: `arm,gic-v3` compatible → `gicd_base`, `gicr_base`, `gicr_size` -- **RAM**: `/memory` node → `ram_base`, `ram_size` -- **CPUs**: `cpus` node → `num_cpus` - -Helpers: `gicr_rd_base(cpu_id) = gicr_base + cpu_id * 0x20000`, `gicr_sgi_base(cpu_id) = gicr_rd_base + 0x10000`. - -Falls back to QEMU virt defaults if DTB parse fails (e.g., QEMU passes addr=0 with `-kernel`). `platform::num_cpus()` reads DTB at runtime; `MAX_SMP_CPUS = 8` is the compile-time array capacity. - -**Pre-DTB code** (`uart_puts` in `lib.rs`, GICD/GICC statics in `gic.rs`) still uses hardcoded `platform::UART_BASE`/`GICD_BASE` because they run before DTB init or require `const` for Rust `static`. +QEMU passes host DTB in x0 → `boot.S` preserves in x20 → `rust_main(dtb_addr)`. `dtb::init()` (`fdt` crate v0.1.5) discovers UART (`arm,pl011`), GIC (`arm,gic-v3`), RAM (`/memory`), CPUs. Helpers: `gicr_rd_base(cpu)=gicr_base+cpu*0x20000`, `gicr_sgi_base=rd_base+0x10000`. Falls back to QEMU virt defaults if parse fails. `platform::num_cpus()` reads DTB; `MAX_SMP_CPUS=8` is array capacity. Pre-DTB code (`uart_puts`, GIC statics) uses hardcoded `platform::*` constants. ### Memory Layout - | Region | Address | Purpose | |--------|---------|---------| | SPMC code (sel2) | 0x0e100000 | S-EL2 linker base (secure DRAM, BL32) | -| SP1 (sp_hello) | 0x0e300000 | SP Hello package (1MB, partition 0x8001) | -| SP2 (sp_irq) | 0x0e400000 | SP IRQ package (1MB, partition 0x8002) | -| SP3 (sp_relay) | 0x0e500000 | SP Relay package (1MB, partition 0x8003) | -| Secure heap | 0x0e600000 | S-EL2 page table allocation | -| Hypervisor code (NS) | 0x40200000 | NS-EL2 linker base (avoids QEMU DTB at 0x40000000 in -bios mode) | -| Heap | 0x41000000 (16MB) | Page table allocation, `BumpAllocator` | -| DTB (VM 0) | 0x47000000 | Device tree blob | -| Kernel (VM 0) | 0x48000000 | Linux Image load address | -| Initramfs (VM 0) | 0x54000000 | BusyBox initramfs | -| Disk image (VM 0) | 0x58000000 | virtio-blk backing store | -| VM 0 RAM | 0x48000000-0x58000000 | 256MB (single-VM: 0x48000000-0x88000000 = 1GB) | -| DTB (VM 1) | 0x67000000 | Device tree blob (multi_vm only) | -| Kernel (VM 1) | 0x68000000 | Linux Image load address (multi_vm only) | -| VM 1 RAM | 0x68000000-0x78000000 | 256MB (multi_vm only) | -| Disk image (VM 1) | 0x78000000 | virtio-blk backing store (multi_vm only) | - -**Stage-2 mappers**: -- `IdentityMapper` (static, 2MB-only) — used by unit tests (`make run`) -- `DynamicIdentityMapper` (heap-allocated, 2MB+4KB) — used by Linux guest (`make run-linux`), supports `unmap_4kb_page()` for GICR trap setup - -**Heap gap**: Heap lies within guest's PA range but is left unmapped in Stage-2 to prevent guest corruption of page tables. Guest kernel never accesses this range (declared memory starts at 0x48000000). - -### Global State (`src/global.rs`) +| SP1/SP2/SP3 | 0x0e300000 / 0x0e400000 / 0x0e500000 | Hello / IRQ / Relay (1MB each) | +| Secure heap | 0x0e600000 | S-EL2 page table alloc | +| Hypervisor (NS) | 0x40200000 | NS-EL2 linker base (avoids QEMU DTB @ 0x40000000) | +| Heap | 0x41000000 (16MB) | `BumpAllocator` page tables | +| DTB/Kernel/Initramfs/Disk (VM0) | 0x47000000 / 0x48000000 / 0x54000000 / 0x58000000 | | +| VM0 RAM | 0x48000000–0x58000000 (256MB; single-VM: 1GB to 0x88000000) | | +| DTB/Kernel/Disk (VM1, multi_vm) | 0x67000000 / 0x68000000 / 0x78000000 | | +| VM1 RAM | 0x68000000–0x78000000 (256MB) | | -| Global | Type | Purpose | -|--------|------|---------| -| `DEVICES` | `[GlobalDeviceManager; MAX_VMS]` | Per-VM MMIO dispatch (UnsafeCell single-pCPU / SpinLock multi-pCPU) | -| `VM_STATE` | `[VmGlobalState; MAX_VMS]` | Per-VM state (see below) | -| `CURRENT_VM_ID` | `AtomicUsize` | Which VM is currently active | -| `PENDING_CPU_ON_PER_VCPU` | `[PerVcpuCpuOnRequest; 8]` | Per-vCPU PSCI CPU_ON (multi-pCPU mode only) | -| `SHARED_VTTBR` / `SHARED_VTCR` | `AtomicU64` | Stage-2 config shared from primary to secondaries (multi-pCPU) | -| `PER_VM_VTTBR` | `[AtomicU64; MAX_VMS]` | Per-VM L0 table PA for cross-VM Stage-2 access (FF-A RETRIEVE) | -| `UART_RX` | `UartRxRing` | Lock-free ring buffer, IRQ handler → run loop | -| `PORT_RX` | `[NetRxRing; MAX_PORTS]` | Per-VM SPSC ring for virtio-net RX frames | -| `VSWITCH` | `UnsafeCell` | L2 virtual switch with MAC learning table | +**Stage-2 mappers**: `IdentityMapper` (static, 2MB-only, unit tests) vs `DynamicIdentityMapper` (heap, 2MB+4KB, Linux guest, supports `unmap_4kb_page()` for GICR). **Heap gap**: heap is in guest PA range but left Stage-2-unmapped so guest can't corrupt page tables (guest RAM starts at 0x48000000). -`VmGlobalState` contains per-VM: `pending_sgis[MAX_VCPUS]`, `pending_spis[MAX_VCPUS]`, `terminal_exit[MAX_VCPUS]`, `vcpu_online_mask`, `current_vcpu_id`, `pending_cpu_on`, `preemption_exit`. Accessed via `vm_state(vm_id)` or `current_vm_state()`. +### Global State (`src/global.rs`) +`DEVICES[MAX_VMS]` (per-VM MMIO), `VM_STATE[MAX_VMS]` (`VmGlobalState`: pending_sgis/spis, terminal_exit, online_mask, current_vcpu, pending_cpu_on, preempt), `CURRENT_VM_ID`, `PENDING_CPU_ON_PER_VCPU[8]` (multi-pCPU), `SHARED_VTTBR/VTCR`, `PER_VM_VTTBR[MAX_VMS]` (cross-VM Stage-2 for RETRIEVE), `UART_RX`, `PORT_RX[MAX_PORTS]`, `VSWITCH`. -**SPMC globals** (sel2 feature, `SpinLock`-protected for per-CPU SPMD concurrency): `NWD_RXTX` (NWd RXTX buffer state), `SPMC_SHARES` (memory share records), `NOTIF_STATE` (notification bitmaps), `STAGE2_LOCK` (serializes all `map_page`/`unmap_page` calls to prevent TOCTOU in page table walks). SpinLock required because pKVM's per-CPU SPMD breaks the single-event-loop serialization assumption. Global heap (`src/mm/heap.rs`) also uses `SpinLock>` for concurrent `alloc_page()` safety. +**SPMC globals** (`sel2`, `SpinLock`-protected for per-CPU SPMD concurrency): `NWD_RXTX`, `SPMC_SHARES`, `NOTIF_STATE`, `STAGE2_LOCK` (serializes map/unmap to prevent TOCTOU). Global heap also `SpinLock>`. ### Device Manager Pattern - -Enum-dispatch (no dynamic dispatch / trait objects): -```rust -pub enum Device { - Uart(pl011::VirtualUart), - Gicd(gic::VirtualGicd), - Gicr(gic::VirtualGicr), - VirtioBlk(virtio::mmio::VirtioMmioTransport), - VirtioNet(virtio::mmio::VirtioMmioTransport), - Pl031(pl031::VirtualPl031), -} -``` -Array-based routing: `devices: [Option; 8]`, scan for `dev.contains(addr)`. +Enum-dispatch (no `dyn Trait`): `enum Device { Uart, Gicd, Gicr, VirtioBlk, VirtioNet, Pl031 }`. Array routing `[Option; 8]`, scan `dev.contains(addr)`. ## Build System - -- **build.rs**: Cross-compiles boot assembly and `exception.S` via `aarch64-linux-gnu-gcc`, archives into `libboot.a`, links with `--whole-archive`. Feature-gated: `sel2` selects `boot_sel2.S` + `linker_sel2.ld`, otherwise `boot.S` + `linker.ld` -- **Target**: `aarch64-unknown-none.json` (custom spec: `llvm-target: aarch64-unknown-none`, `panic-strategy: abort`, `disable-redzone: true`) -- **Linker**: `arch/aarch64/linker.ld` — base at 0x40200000 (NS-EL2, avoids QEMU DTB at 0x40000000 in `-bios` mode); `arch/aarch64/linker_sel2.ld` — base at 0x0e100000 (S-EL2, secure DRAM) +- **build.rs**: cross-compiles boot asm + `exception.S` via `aarch64-linux-gnu-gcc` → `libboot.a` (`--whole-archive`). `sel2` → `boot_sel2.S` + `linker_sel2.ld`, else `boot.S` + `linker.ld`. +- **Target**: `aarch64-unknown-none.json` (panic=abort, disable-redzone). +- **Linker**: `linker.ld` base 0x40200000 (NS-EL2); `linker_sel2.ld` base 0x0e100000 (S-EL2). + +## Coding Standards +Project-specific firmware rules (full rationale: `docs/RUST_FIRMWARE_CODING_GUIDELINES.md`). Violating these causes silent hangs/UB at EL2/S-EL2, not just lints: +- **`no_std` + core/alloc only**; `Box`/`Vec` via custom `BumpAllocator`. +- **No dynamic dispatch** — enum-dispatch (see `Device`), never `dyn Trait`. +- **No recursion** — statically bounded call graphs (128KB/pCPU stack). Sanctioned exception: SPMC's depth-limited `dispatch_to_sp()` SP→SP chain (CallStack cycle detection). +- **No floating-point/SIMD** — `CPTR_EL3.TFP` traps FP/SIMD from S-EL2 → silent hang. Rust **debug builds emit NEON** for `read_volatile` alignment checks → S-EL2 must build `opt-level >= 1` (see CPTR_EL3.TFP note). +- **Allocate at boot only** — never on hot path (exception/MMIO/SMC). +- **Error handling**: `.expect()` at boot; `Result` for FF-A; **never panic** in exception handlers (return `bool`/`SmcResult8`); `Option`/`bool` for device emul. ## Tests - -~457 assertions across 34 test suites run automatically on `make run` (no feature flags). Orchestrated sequentially in `src/main.rs`. Located in `tests/`: - -| Test | Coverage | Assertions | -|------|----------|------------| -| `test_dtb` | DTB parsing, PlatformInfo defaults, GICR helpers | 8 | -| `test_allocator` | Bump allocator page alloc/free | 4 | -| `test_heap` | Global heap (Box, Vec) | 4 | -| `test_dynamic_pagetable` | DynamicIdentityMapper 2MB mapping + 4KB unmap | 6 | -| `test_multi_vcpu` | Multi-vCPU creation, VMPIDR | 4 | -| `test_scheduler` | Round-robin scheduling, block/unblock | 4 | -| `test_vm_scheduler` | VM-integrated scheduling lifecycle | 5 | -| `test_mmio` | MMIO device registration + guest UART access | 1 | -| `test_gicv3_virt` | List Register injection, ELRSR | 6 | -| `test_complete_interrupt` | End-to-end IRQ injection flow | 1 | -| `test_guest` | Basic hypercall (HVC #0) | 1 | -| `test_guest_loader` | GuestConfig for Zephyr/Linux | 3 | -| `test_simple_guest` | Simple guest boot + exit | 1 | -| `test_decode` | MmioAccess::decode() ISS + instruction paths | 9 | -| `test_gicd` | VirtualGicd shadow state (CTLR, ISENABLER, IROUTER) | 8 | -| `test_gicr` | VirtualGicr per-vCPU state (TYPER, WAKER, ISENABLER0) | 8 | -| `test_global` | PendingCpuOn atomics + UartRxRing SPSC buffer | 6 | -| `test_guest_irq` | Per-VM PENDING_SGIS/PENDING_SPIS bitmask operations | 5 | -| `test_device_routing` | DeviceManager registration, routing, accessors | 6 | -| `test_vm_state_isolation` | Per-VM SGI/SPI/online_mask/vcpu_id independence | 4 | -| `test_vmid_vttbr` | VMID 0/1 encoding in VTTBR_EL2 bits [63:48] | 2 | -| `test_multi_vm_devices` | DEVICES[0]/DEVICES[1] registration + MMIO isolation | 3 | -| `test_vm_activate` | Vm initial VTTBR/VTCR state | 2 | -| `test_net_rx_ring` | NetRxRing SPSC: empty/store/take/fill/overflow/wraparound | 8 | -| `test_vswitch` | VSwitch: flood/MAC learning/broadcast/no-self/capacity | 6 | -| `test_virtio_net` | VirtioNet: device_id/features/queues/config/mac_for_vm | 8 | -| `test_page_ownership` | Stage-2 PTE SW bits: read/write OWNED/SHARED_OWNED, unmapped IPA, 2MB block→4KB split | 9 | -| `test_pl031` | PL031 RTC: RTCDR readable, RTCLR write+readback, PeriphID/PrimeCellID, unknown offset | 4 | -| `test_ffa` | FF-A proxy: VERSION/ID_GET/FEATURES/RXTX/messaging/MEM_SHARE/MEM_LEND/RECLAIM/descriptors/SMC forward/VM-to-VM RETRIEVE/RELINQUISH/SPM_ID_GET/RUN/notifications/MSG_SEND2/MSG_WAIT/FRAG_RX/CONSOLE_LOG/SP→SP routing | 55 | -| `test_spmc_handler` | SPMC dispatch: VERSION/ID_GET/SPM_ID_GET/FEATURES/PARTITION_INFO/DIRECT_REQ echo (32+64)/framework msg/RXTX/RXTX_UNMAP frag cleanup/FFA_RUN Preempted path/multi-SP/find_sp_for_intid/global SP helpers/MEM_SHARE/LEND/RETRIEVE/RELINQUISH/RECLAIM/DONATE(lifecycle+RECLAIM denied+RELINQUISH denied+SP-to-SP)/multi-page/SP2-receiver/zero-page/range overflow/LEND negative tests/notifications/MSG_SEND2/MSG_WAIT/FRAG_RX/CONSOLE_LOG/SRI/NPI/cross-SP isolation/IPA validation/stress/SP→SP DIRECT_REQ relay/cycle detection/SP-to-SP MEM_SHARE lifecycle | 182 | -| `test_sp_context` | SpContext: state machine (incl. all illegal transitions), CAS try_transition failure, VcpuContext fields, set/get args (x0-x7), owned_intids, pending_irq lifecycle + overflow | 58 | -| `test_secure_stage2` | SecureStage2Config: VSTTBR address, VSTCR T0SZ, new_from_vsttbr | 4 | -| `test_log` | LogBuffer: empty state, write/read, overflow, log_info!, LogWriter, per-CPU isolation, accumulation | 8 | -| `test_guest_interrupt` | Guest interrupt injection + exception vector (blocks) | 1 | - -Not wired into `main.rs` (exported but not called): -- `test_timer` — timer interrupt detection (requires manual timer setup) +~457 assertions across 34 test suites run automatically on `make run` (no features), orchestrated in `src/main.rs`, located in `tests/`. Largest suites: `test_spmc_handler` (182 — SPMC dispatch, all FF-A calls, SP↔SP, memory sharing, DONATE, isolation, stress), `test_sp_context` (58 — state machine + illegal transitions), `test_ffa` (55 — proxy calls + descriptors + SMC forward), `test_page_ownership` (9), `test_decode` (9). Others cover DTB, allocator/heap, page tables, scheduler, GIC (gicd/gicr/gicv3_virt), MMIO/decode, virtio-net/vswitch/net_rx_ring, pl031, multi-VM isolation, VMID, secure_stage2, log. `test_timer` exists but is not wired into `main.rs`. When adding a suite, register it in `src/main.rs` and update this count. ## Critical Implementation Details ### HPFAR_EL2 for MMIO (must-know) -When guest MMU is on, `FAR_EL2` = guest VA, NOT IPA. Use `HPFAR_EL2` for the IPA: +When guest MMU is on, `FAR_EL2` = guest VA, NOT IPA. Use `HPFAR_EL2`: ``` IPA = (hpfar & 0x0000_0FFF_FFFF_FFF0) << 8 | (far_el2 & 0xFFF) ``` ### Never Modify Guest SPSR_EL2 -Guest controls its own `PSTATE.I` (interrupt mask). Overriding causes spinlock deadlocks. +Guest controls its own `PSTATE.I`. Overriding causes spinlock deadlocks. ### CNTHP Timer Must Be Re-enabled -Guest can re-disable INTID 26 via GICR writes. `ensure_cnthp_enabled()` directly writes physical GICR (EL2 bypasses Stage-2) before every vCPU entry. +Guest can re-disable INTID 26 via GICR writes. `ensure_cnthp_enabled()` writes physical GICR directly (EL2 bypasses Stage-2) before every vCPU entry. ### ICC_SGI1R_EL1 Bit Fields -- TargetList: bits [15:0] (NOT [23:16]) -- Aff1: bits [23:16] (NOT [27:24]) -- INTID: bits [27:24] (NOT [3:0]) +TargetList [15:0] (NOT [23:16]), Aff1 [23:16] (NOT [27:24]), INTID [27:24] (NOT [3:0]). ### inject_spi() Must Not Acquire DEVICES Lock (multi-pCPU) -`inject_spi()` is called from `signal_interrupt()` inside the `DEVICES` SpinLock. Reading `DEVICES.route_spi()` would deadlock (non-reentrant). Instead, multi-pCPU mode reads physical GICD_IROUTER directly (EL2 bypasses Stage-2). +Called from `signal_interrupt()` already inside the `DEVICES` SpinLock → reading `DEVICES.route_spi()` deadlocks (non-reentrant). Multi-pCPU reads physical GICD_IROUTER directly. ### QEMU virt Secondary CPUs Are Powered Off -Secondary physical CPUs start powered off — they do NOT execute `_start`. Must use real PSCI CPU_ON SMC (`smc #0`, function_id=0xC4000003) to QEMU's EL3 firmware. +They do NOT execute `_start`. Must use real PSCI CPU_ON SMC (`smc #0`, fid=0xC4000003) to QEMU EL3 firmware. ### TPIDR_EL2 for Per-CPU Context (multi-pCPU) -`exception.S` uses `mrs x0, tpidr_el2` instead of a global variable. Each physical CPU has its own hardware-banked TPIDR_EL2. Set by `enter_guest()` via `msr tpidr_el2, x0`. +`exception.S` uses `mrs x0, tpidr_el2` (HW-banked per pCPU), set by `enter_guest()`. ### Physical GICR Must Be Programmed for SGIs/PPIs -Guest GICR writes only update `VirtualGicr` shadow state. `ensure_vtimer_enabled()` programs physical GICR ISENABLER0 for SGIs 0-15 + PPI 27 before every guest entry. +Guest GICR writes only update shadow state. `ensure_vtimer_enabled()` programs physical GICR ISENABLER0 for SGIs 0-15 + PPI 27 before each entry. ### HCR_EL2.TSC for SMC Trapping -`HCR_TSC = 1 << 19` traps guest SMC instructions to EL2 as `EC_SMC64 (0x17)`. Unlike HVC traps, the trapped SMC sets `ELR_EL2` to the SMC instruction itself — exception handler must advance PC by 4. This enables the FF-A proxy to intercept guest FF-A SMC calls and route them through `handle_smc()`. +`HCR_TSC = 1<<19` traps guest SMC to EL2 as `EC_SMC64 (0x17)`. Unlike HVC, ELR_EL2 = SMC instr itself — handler must advance PC by 4. ### CPTR_EL3.TFP Traps FP/SIMD at S-EL2 (must-know for TF-A builds) -TF-A's default `CPTR_EL3.TFP=1` traps ALL FP/SIMD instructions from S-EL2 to EL3. Rust debug-mode `read_volatile` uses NEON SIMD internally (`cnt v0.8b, v0.8b` for popcount alignment check in `is_aligned_to`), causing silent hangs on any memory read. Fix: `CTX_INCLUDE_FPREGS=1` in TF-A build (clears CPTR_EL3.TFP). Requires `ENABLE_SVE_FOR_NS=0` and `ENABLE_SME_FOR_NS=0` to avoid build conflicts. +TF-A default `CPTR_EL3.TFP=1` traps ALL FP/SIMD from S-EL2 → EL3. Rust debug `read_volatile` uses NEON (`cnt v0.8b` for alignment popcount) → silent hang on any read. Fix: `CTX_INCLUDE_FPREGS=1` in TF-A build (needs `ENABLE_SVE_FOR_NS=0` + `ENABLE_SME_FOR_NS=0`). -### S-EL2 SPMC Boot (`sel2` feature) -Entry point: `boot_sel2.S` → `rust_main_sel2(manifest_addr, hw_config_addr, core_id)`. SPMD passes x0=TOS_FW_CONFIG (manifest DTB at 0x0e002000), x1=HW_CONFIG, x4=core_id. Init: exception vectors → manifest parse → **S-EL2 Stage-1 MMU** (identity map with NS=1 for NWd DRAM) → GIC init (enables PPI 26+29 as Secure Group 1) → CNTHCTL_EL2 timer access → Secure Stage-2 → parse SPKG header (img_offset=0x4000) → clear SCTLR_EL1/VBAR_EL1 → ERET to SP1 → SP calls FFA_MSG_WAIT → detect SP2 at SP2_LOAD_ADDR (0x0e400000) → boot SP2 if present → **register secondary EP** (FFA_SECONDARY_EP_REGISTER) → FFA_MSG_WAIT → SPMC event loop. `src/manifest.rs` parses `/attribute` node (spmc_id, maj_ver, min_ver) per FF-A Core Manifest v1.0 (DEN0077A). +### S-EL2 SPMC Boot (`sel2`) +`boot_sel2.S` → `rust_main_sel2(manifest, hw_config, core_id)`. SPMD passes x0=TOS_FW_CONFIG (@0x0e002000), x4=core_id. Init: vectors → manifest parse → S-EL2 Stage-1 MMU → GIC (PPI 26+29 Secure Group 1) → CNTHCTL_EL2 → Secure Stage-2 → parse SPKG (img_offset=0x4000) → clear SCTLR_EL1/VBAR_EL1 → ERET to SP1 → detect/boot SP2/SP3 → FFA_SECONDARY_EP_REGISTER → event loop. `src/manifest.rs` parses `/attribute` node (FF-A Core Manifest v1.0). -**Secondary CPU warm-boot**: When pKVM issues PSCI CPU_ON, TF-A's SPMD routes the secondary CPU through `spmd_cpu_on_finish_handler()` → ERET to our registered `secondary_entry_sel2` in `boot_sel2.S`. The secondary path: set per-CPU stack (3 × 32KB in `.bss.sel2_pcpu_stacks`) → `rust_main_sel2_secondary()` → install VBAR → install S-EL2 Stage-1 MMU (reuse primary's page tables via `install_sel2_stage1_secondary()`) → FFA_MSG_WAIT → SPMD completes PSCI CPU_ON → NS-EL2 secondary boots. +**Secondary warm-boot**: pKVM PSCI CPU_ON → SPMD `spmd_cpu_on_finish_handler()` → ERET to `secondary_entry_sel2` → per-CPU stack (3×32KB in `.bss.sel2_pcpu_stacks`) → `rust_main_sel2_secondary()` (VBAR → reuse primary MMU via `install_sel2_stage1_secondary()` → FFA_MSG_WAIT). ### S-EL2 Stage-1 MMU (`src/sel2_mmu.rs`) -S-EL2 runs with MMU off by default. All memory accesses target the **Secure** physical address space. NWd RXTX buffer PAs (e.g. 0x42a16000) are in **Non-Secure** DRAM — writing without Stage-1 translation hits the Secure alias, so pKVM reads zeros from the NS alias. - -**Fix**: `init_sel2_stage1()` enables a minimal S-EL2 Stage-1 identity map. Static page tables (3 pages in `.bss`, no heap): -- **L1[1-2]**: 1GB blocks at 0x40000000/0x80000000, **NS=1**, Normal WB, XN → NWd DRAM -- **L2[64-79]**: 2MB Device blocks, NS=0, XN → GIC (0x08000000) + UART (0x09000000) -- **L2[112-127]**: 2MB Normal blocks, NS=0 → SPMC code + SPs + secure heap (0x0E000000) - -Registers: MAIR_EL2 (Attr0=Device, Attr1=Normal-WB), TCR_EL2 (T0SZ=16, 4KB, 48-bit PA), TTBR0_EL2, SCTLR_EL2.{M,C,I}=1. Independent of Secure Stage-2 (VSTTBR_EL2) used for SP isolation. +S-EL2 MMU off by default → all accesses hit **Secure** PA space. NWd RXTX PAs are in **Non-Secure** DRAM — writing without translation hits the Secure alias (pKVM reads zeros). Fix: `init_sel2_stage1()` static identity map (3 pages, no heap): L1[1-2] = 1GB blocks @ 0x40000000/0x80000000 **NS=1** Normal WB XN; L2[64-79] = 2MB Device NS=0 (GIC+UART); L2[112-127] = 2MB Normal NS=0 (SPMC+SPs+heap). Regs: MAIR/TCR (T0SZ=16,4KB,48-bit)/TTBR0/SCTLR.{M,C,I}=1. Independent of Secure Stage-2 (VSTTBR_EL2). ### Secure Virtual Interrupt Injection (Phase D) -Hafnium-compatible HCR_EL2.VI mechanism for injecting virtual interrupts to SPs at S-EL1: +Hafnium-compatible HCR_EL2.VI for vIRQ to SPs at S-EL1: +1. Per-SP INTID ownership (`owned_intids[4]`; SP2 owns INTID 29). +2. CNTHP poll timer at S-EL2 (CNTPS inaccessible at S-EL1 since SCR_EL3.ST=0). +3. IRQ routing: owned-by-current → queue + VI; owned-by-other → queue + preempt; unowned → FFA_INTERRUPT. +4. HCR_EL2.VI → HW auto-vector to VBAR_EL1+0x280 on ERET. +5. HF_INTERRUPT_GET (HVC x0=0xFF04) → SPMC returns INTID, clears VI. +6. Cross-SP: `dispatch_interrupt_to_sp()` preempts SP1 → SP2 handler → resume SP1. -1. **Per-SP INTID ownership**: `SpContext.owned_intids[4]` — SP2 owns INTID 29 (Secure Physical Timer PPI) -2. **CNTHP poll timer**: Since CNTPS is inaccessible at S-EL1 (SCR_EL3.ST=0), CNTHP at S-EL2 polls for owned INTIDs -3. **IRQ routing** (`exception.rs`): Case 1: owned by current SP → queue + HCR_EL2.VI → continue. Case 2: owned by another SP → queue + preempt current. Unowned → FFA_INTERRUPT -4. **HCR_EL2.VI injection**: Setting VI causes hardware auto-vector to VBAR_EL1+0x280 on ERET -5. **HF_INTERRUPT_GET**: SP calls HVC with x0=0xFF04 → SPMC returns pending INTID in x0, clears VI -6. **Cross-SP preemption**: `dispatch_interrupt_to_sp()` — preempt SP1 → enter SP2 IRQ handler → SP2 returns → resume SP1 - -**SP2 (sp_irq)** at `tfa/sp_irq/`: S-EL1 partition with VBAR_EL1 IRQ handler, handles both DIRECT_REQ_32 and DIRECT_REQ_64 (matching RESP variant via x15 flag), slow-path busy-loop until vIRQ, responds with captured INTID in x5. Loaded at 0x0e400000 by BL2. +**SP2 (sp_irq)** (`tfa/sp_irq/`): S-EL1, VBAR_EL1 IRQ handler, DIRECT_REQ_32+64, slow-path busy-loop until vIRQ, responds w/ INTID in x5. ### SP Package Format (SPKG) -BL2 loads raw SP packages to `load-address` from `tb_fw_config.dts`. SPKG header (24 bytes LE): magic("SPKG"), version, pm_offset(0x1000), pm_size, img_offset(0x4000), img_size. SPMC must parse header and enter SP at `load_addr + img_offset`. The sp_manifest.dts UUID gets byte-swapped by `sp_mk_generator.py` (LE conversion); `tb_fw_config.dts` UUID must match the swapped form. Use `fiptool info fip.bin` to verify. +BL2 loads raw packages to `load-address` from `tb_fw_config.dts`. SPKG header (24B LE): magic("SPKG"), version, pm_offset(0x1000), pm_size, img_offset(0x4000), img_size. SPMC enters SP at `load_addr + img_offset`. `sp_manifest.dts` UUID gets byte-swapped by `sp_mk_generator.py`; `tb_fw_config.dts` UUID must match swapped form. Verify with `fiptool info fip.bin`. ### Diagnostic Fault Handler (`exception.S`) -`fault_diag_print` handles exceptions when TPIDR_EL2=0 (no vCPU context — host-level fault). Prints ESR_EL2, ELR_EL2, FAR_EL2, HPFAR_EL2 to UART. Used during S-EL2 boot to diagnose Data Aborts. Located at end of `exception.S` (outside vector table alignment constraints). +`fault_diag_print` handles exceptions when TPIDR_EL2=0 (host-level fault). Prints ESR/ELR/FAR/HPFAR_EL2 to UART. Used during S-EL2 boot. At end of `exception.S` (outside vector alignment). ### Platform Constants -Guest-specific addresses (heap, kernel load, virtio disk) are in `src/platform.rs`. Host hardware addresses (UART, GIC, RAM, CPU count) are discovered at runtime from DTB via `src/dtb.rs` — use `platform::num_cpus()` and `dtb::platform_info()` instead of hardcoded constants. `MAX_SMP_CPUS = 8` is the compile-time array capacity; `SMP_CPUS = 4` is the fallback default. - -## Roadmap: NS-EL2 → S-EL2 SPMC → pKVM Integration - -**Target architecture** (end state): -``` -EL3: TF-A BL31 + SPMD (SMC relay, world switch) -S-EL2: Our hypervisor (SPMC role, BL32) → manages Secure Partitions -S-EL1: Secure Partitions (bare-metal SPs) -NS-EL2: pKVM (Linux KVM protected mode) → manages Normal World VMs -NS-EL1: Linux/Android guest -``` - -**Phase 3** (done): NS-EL2 complete — 2MB block split, FF-A notifications, indirect messaging -**Phase 4** (done): QEMU `secure=on` + TF-A boot chain → Sprint 4.1-4.4 done (SPMC + SP Hello + 7/7 BL33 tests) -**Sprint 5.1** (done): DIRECT_REQ end-to-end — `tfa_boot` feature, NS proxy → SPMD → SPMC → SP1 (x4 += 0x1000 proof) -**Sprint 5.2** (done): RXTX + PARTITION_INFO_GET forwarding + Linux FF-A discovery, SPMC NWd RXTX management (SPMD forwards RXTX_MAP to SPMC), 8/8 BL33 tests pass -**Phase C** (done): NS interrupt preemption — IRQ during SP → FFA_INTERRUPT → FFA_RUN resume, CNTHP timer, SP_IRQ_PREEMPTED flag, Preempted state, SP Hello slow path, 9/9 BL33 tests pass -**Phase D** (done): Multi-SP + secure vIRQ injection — SP2 (sp_irq) at S-EL1, per-SP INTID ownership, HCR_EL2.VI + HF_INTERRUPT_GET paravirt, CNTHP poll timer, cross-SP preemption, 11/11 BL33 tests pass -**Phase 4.5** (done): pKVM at NS-EL2 + our SPMC at S-EL2 — `make run-pkvm` boots pKVM to BusyBox shell (`Protected hVHE mode initialized successfully`). Uses AOSP android16-6.12 kernel (`make build-pkvm-kernel`) with Google's pKVM FF-A proxy (`kvm-arm.mode=protected`). FF-A v1.1 discovery works in both nVHE and protected mode: `ARM FF-A: Driver version 1.2`, `Firmware version 1.1 found`. RXTX_MAP forwarded by SPMD, PARTITION_INFO_GET returns SP1+SP2 descriptors (x3=24 partition_sz). S-EL2 Stage-1 MMU maps NS DRAM with NS=1 bit so writes to pKVM's hyp RX buffer reach Non-Secure memory. Secondary CPU warm-boot: `FFA_SECONDARY_EP_REGISTER` (0xC4000087) + `secondary_entry_sel2` in `boot_sel2.S` + per-CPU stacks (3 × 32KB) + `rust_main_sel2_secondary()` (VBAR → MMU → FFA_MSG_WAIT). SVE workaround: `sve=off` (ENABLE_SVE_FOR_NS=0 conflicts with CTX_INCLUDE_FPREGS=1). SRI/NPI feature IDs now return donated SGI INTIDs (eliminates pKVM `-95` messages) -**M4.6 Sprint S1** (done): SPMC-side memory sharing — MEM_SHARE/LEND/RETRIEVE/RELINQUISH/RECLAIM handlers in spmc_handler.rs with SpmcShareRecord storage, dynamic Secure Stage-2 mapping via Stage2Walker, register-based + descriptor-based protocols, 12 new unit test assertions (54 total), BL33 Test 13 (MEM_SHARE + RECLAIM) -**M4.6 Sprint S2** (done): True E2E memory sharing — SP-initiated MEM_RETRIEVE/RELINQUISH via `handle_sp_exit()` loop in dispatch_to_sp()/resume_preempted_sp(), SP Hello memory test command (x3=0xABCD0001), BL33 Test 14 full lifecycle (NWd SHARE → SP RETRIEVE → SP write → SP RELINQUISH → NWd verify → NWd RECLAIM), 14/14 BL33 tests (incl. alternating SP1/SP2 DIRECT_REQ) -**M4.6 Backlog** (done): QW-1~4 (PSCI v1.0, is_valid_receiver), ME-4 SpinLock for SPMC globals, ME-2 MEM_SHARE forwarding to real SPMC, ME-1 BITMAP_CREATE FFA_HOST_ID fix, ME-5 MEM_FRAG_TX/RX fragmentation, ME-3 SPMC-side MSG_SEND2/MSG_WAIT indirect messaging (per-SP SpMailbox), CONSOLE_LOG (proxy + SPMC + handle_sp_exit), ME-7 SRI/NPI feature IDs (eliminates pKVM `-95 EOPNOTSUPP`). ~370 assertions / 33 test suites -**Phase 4.6** (done): pKVM E2E validation — FfaMemRegion struct fix (wrong offsets: extra reserved_0, missing ep_mem_size), RETRIEVE_RESP x2=fragment_length (was handle), NWd vs SP RETRIEVE_REQ distinction (pKVM reclaim sends RETRIEVE_REQ to get descriptor — must NOT map pages or mark retrieved), SP2 DIRECT_REQ_64 support (Linux FF-A driver sends 64-bit variant when AARCH64_EXEC set in properties), SP2 MEM_SHARE E2E (BL33 Test 15). ffa_test.ko: 20/20 PASS (SP1 DIRECT_REQ 4 + MEM_SHARE 6, SP2 DIRECT_REQ 4 + MEM_SHARE 6). BL33: 16/16 PASS. `make run-pkvm-ffa-test` -**Phase 4.5 AVF** (partial): AVF validation — crosvm VMM in pKVM host (EL0) creates pVM via /dev/kvm. Protected hVHE mode works without SMMU (`pKVM enabled without an IOMMU driver`). KVM API validated: /dev/kvm, KVM_CREATE_VM, KVM_CREATE_VCPU all PASS (5/5). crosvm fails with `failed to create IRQ chip` — QEMU TCG cannot create `KVM_DEV_TYPE_ARM_VGIC_V3` device. SMMUv3 tested (`iommu=smmuv3`) but hangs at CPU3 GIC redistributor init (custom DTB lacks SMMU nodes). Embedded initramfs approach (nested kernel + crosvm at `/nested/`), virtio-console (`console=hvc0`) fixes ttyS0 probe failure. `make build-crosvm` (Docker cross-compile), `make build-crosvm-initramfs`, `make run-crosvm` (protected mode). Requires ARM64 hardware for full AVF validation. -**Phase 4.7** (done): Security hardening — SPMC cross-SP isolation fix (RETRIEVE/RELINQUISH validate caller==receiver_id via `dispatch_ffa_as_sp()`, prevents SP1 mapping pages into SP2's Stage-2), IPA alignment + page count validation (4KB-aligned, max 65536 pages/range, overflow checks), fragment sender tracking (`NwdFragmentState.sender_id`), `reset_nwd_frag_state()` cleanup helper, stress tests (16-slot exhaustion, interleaved lifecycle, double RETRIEVE, RELINQUISH-without-RETRIEVE). Robustness hardening: range count overflow validation (reject > MAX_SHARE_RANGES instead of silent truncation), RXTX_UNMAP fragment state cleanup (NWD_FRAG + NWD_FRAG_RX), MEM_LEND negative tests + E2E lifecycle (BL33 Test 16). ~415 assertions / 34 test suites -**Phase 5.1** (done): SP-to-SP DIRECT_REQ — CallStack cycle detection, recursive dispatch_to_sp, chain preemption (Blocked→Preempted), SP3 (sp_relay) at 0x0e500000, BL33 Tests 17-18 (relay chain + cycle detection). SP-to-SP MEM_SHARE — SP-initiated MEM_SHARE/LEND/RECLAIM in handle_sp_exit, SP1→SP2 Secure DRAM sharing (BL33 Test 19). SP-to-SP MEM_RECLAIM — SP1 persists handle in memory, reclaims after SP2 relinquishes (BL33 Test 20). MEM_DONATE — irrevocable ownership transfer (`is_donate` flag in SpmcShareRecord), RECLAIM/RELINQUISH blocked (DENIED), SP-to-SP DONATE via handle_sp_exit. 20/20 BL33 tests, ~457 assertions / 34 test suites -**Phase 5.1 pKVM** (done): pKVM SP-to-SP E2E verification — SP3 (sp_relay) added to pKVM flash (`build-tfa-pkvm`), SP3 DIRECT_REQ_64 support, ffa_test.ko extended with SP3 echo + relay + SP-to-SP MEM_SHARE + SP-to-SP RECLAIM (SP1→SP2 Secure DRAM sharing through real SPMD chain). ffa_test.ko: 35/35 PASS (SP1 10 + SP2 10 + SP3 6 + SP-to-SP share+reclaim 9). `make run-pkvm-ffa-test` -**Phase 5**: RME & CCA (Realm Manager) - -See `DEVELOPMENT_PLAN.md` for full details. +Guest addresses (heap, kernel load, disk) in `src/platform.rs`. Host hardware (UART/GIC/RAM/CPU) discovered at runtime via `src/dtb.rs` — use `platform::num_cpus()`/`dtb::platform_info()`, not hardcoded constants. `MAX_SMP_CPUS=8` capacity; `SMP_CPUS=4` fallback. + +## Related Documentation +- `ARCHITECTURE.md` / `docs/architecture.md` — narrative architecture (latter is exhaustive) +- `DEVELOPMENT_PLAN.md` — full roadmap + per-sprint detail +- `REQUIREMENTS.md` — feature requirements +- `CONTRIBUTING.md` — dev setup +- `docs/RUST_FIRMWARE_CODING_GUIDELINES.md` — full coding standards + +## Roadmap (NS-EL2 → S-EL2 SPMC → pKVM) +All phases below are **done** unless noted; see `DEVELOPMENT_PLAN.md` for per-sprint detail. + +> **E2E re-verified 2026-05-26** on a native ARM64 (aarch64) Linux host using the **distro QEMU 8.2.2** (no custom QEMU 9.2.3 needed): `sg docker -c 'make build-tfa-spmc'` + `make run-spmc` boots the full chain TF-A BL31 v2.12.0+SPMD @ EL3 → our SPMC @ S-EL2 → SP1/SP2/SP3 @ S-EL1 → BL33 FF-A client, **20/20 BL33 tests PASS**. The Makefile note "secure=on requires QEMU 9.2+" is overly cautious; 8.2.2 handles S-EL2 fine. `make run` (NS-EL2 TCG unit tests) also passes 34/34. Secure world is TCG-only (KVM can't virtualize EL3/Secure) — and that's sufficient. + +- **Phase 3**: NS-EL2 complete (2MB block split, notifications, indirect messaging). +- **Phase 4 / Sprint 5.1-5.2 / Phase C / Phase D**: TF-A boot chain, SPMC + SPs, DIRECT_REQ E2E (NS proxy → SPMD → SPMC → SP), RXTX + PARTITION_INFO forwarding, NS interrupt preemption, multi-SP + secure vIRQ injection. +- **Phase 4.5**: pKVM @ NS-EL2 + our SPMC @ S-EL2 — `make run-pkvm` boots AOSP android16-6.12 to BusyBox (`Protected hVHE mode initialized successfully`), FF-A v1.1 discovery works. +- **M4.6 + Phase 4.6**: SPMC-side + true E2E memory sharing, FfaMemRegion struct fix, NWd-vs-SP RETRIEVE distinction, SP2 DIRECT_REQ_64. `ffa_test.ko`: 20/20. +- **Phase 4.7**: security hardening (cross-SP isolation, IPA/page-count validation, fragment tracking, stress tests, MEM_LEND negatives). +- **Phase 5.1 (+pKVM)**: SP→SP DIRECT_REQ (cycle detection, chain preemption, SP3 relay), SP↔SP MEM_SHARE/RECLAIM, MEM_DONATE (irrevocable). `ffa_test.ko`: 35/35. ~457 assertions / 34 suites. +- **Phase 4.5 AVF** (partial): crosvm VMM in pKVM host (EL0) creates a VM via /dev/kvm. KVM API validated (5/5: /dev/kvm, CREATE_VM, CREATE_VCPU). Blocked: crosvm `failed to create IRQ chip` (QEMU TCG can't create the in-kernel vGICv3 `KVM_DEV_TYPE_ARM_VGIC_V3`). **Re-confirmed 2026-05-26 on a native ARM64 host + stock QEMU 8.2.2** (`make run-crosvm`, full rebuild via Docker): the guest boots pKVM cleanly (`Protected hVHE mode initialized successfully`, `nv: 554/669 trap handlers`, `/dev/kvm` PASS) but crosvm still dies at `failed to create IRQ chip` ~1s after `crosvm run`. The Makefile's crosvm command is **non-protected** (no `--protected-vm`), so this also proves a *normal* (non-pVM) crosvm guest hits the same vGICv3 wall — protected vs non-protected is irrelevant; the wall is the in-kernel vGICv3 under TCG, and being on real ARM hardware does not change it. crosvm is KVM-only (no emulation fallback). Real fix needs nested-virt KVM (`-accel kvm` + host `kvm.nested=1`) or native crosvm on the host (`/dev/kvm` + pKVM-mode boot). NOTE: KVM cannot virtualize EL3/Secure, so the full NS→Secure chain stays TCG regardless of host (and that's sufficient — 20/20, verified above). +- **Phase 5** (next): RME & CCA (Realm Manager). diff --git a/docs/devto-e2e-on-arm.md b/docs/devto-e2e-on-arm.md new file mode 100644 index 0000000..6d8b399 --- /dev/null +++ b/docs/devto-e2e-on-arm.md @@ -0,0 +1,164 @@ +--- +title: "Field Notes: Booting a Full NS→Secure ARM Chain on a Real ARM Server with Stock QEMU" +published: true +description: "TF-A → our S-EL2 SPMC → Secure Partitions → FF-A client, 20/20 E2E on a bare ARM64 box — no /dev/kvm, no sudo, no custom QEMU. Plus: why KVM can't save the Secure world, and where Android AVF actually stops." +tags: rust, arm, hypervisor, qemu +cover_image: +canonical_url: https://willamhou.github.io/hypervisor/ +--- + +My ARM64 Type-1 hypervisor has a Secure-world personality: an S-EL2 **SPMC** (Secure Partition Manager Core) that runs as TF-A's BL32, manages Secure Partitions at S-EL1, and speaks ARM's FF-A v1.1 protocol. Until now I'd only ever exercised that chain under QEMU TCG on an x86 dev box. + +This week I got a real ARM64 Linux server (aarch64, `/dev/kvm` present, SVE2, kernel 6.11). The obvious question: + +> On real ARM hardware, can I finally use KVM acceleration — maybe even run Android's AVF? + +The answer is more interesting than yes/no. These are the field notes from getting the full chain up, in the order it actually happened, traps included. + +{% github willamhou/hypervisor %} + +## Counterintuitive #1: a real ARM box + /dev/kvm still can't help the Secure world + +It's tempting to assume x86 → "emulation only", ARM → "KVM, full speed". The Secure world is the exception. The chain I need to boot is: + +``` +EL3 TF-A BL31 + SPMD ← secure monitor +S-EL2 our SPMC ← manages Secure Partitions +S-EL1 SP1 / SP2 / SP3 ← the partitions +NS-EL2 pKVM / test client +NS-EL1 Linux/Android +``` + +The hard fact: **QEMU's `secure=on` (EL3, Secure world, S-EL2) is TCG-only. KVM cannot virtualize EL3 or the Secure world** — there is no "nested Secure virtualization" on ARM. So this NS→Secure chain runs under TCG regardless of whether the host is x86 or ARM. + +What's `/dev/kvm` good for here, then? Only the **pure Normal-world** paths (accelerating NS-EL2 pKVM, or AVF's crosvm) — and only with extra conditions (more below). + +Correction #1: **TCG for the Secure world isn't a compromise — it's the only correct way, and it's entirely sufficient.** + +## Trap #1: the box had no QEMU, and no sudo + +`make run` wants `qemu-system-aarch64`. It wasn't in `PATH`, or anywhere on the system. Worse: + +- no passwordless `sudo` (can't install packages non-interactively) +- not in the `docker` group (every TF-A/QEMU build target uses Docker) +- not in the `kvm` group (`/dev/kvm` → Permission denied) + +I first tried a **fully root-free** route: micromamba (a single static binary) + conda-forge to install QEMU. Download and extract were fine; the `linux-aarch64` channel was not: + +``` +qemu =* * does not exist +``` + +conda-forge's ARM64 channel ships only **`qemu.qmp` (a Python lib), not the QEMU emulator itself**. The x86 channel has it; arm64 doesn't. Worth remembering: don't assume conda gives you `qemu-system` on arm64. + +Building from source? Missing `glib`, `pixman`, `meson`, `ninja` — none installed. + +The plainest path won. One sudo line from someone who had it: + +```bash +sudo apt install -y qemu-system-arm # this package also provides qemu-system-aarch64 +``` + +That gave me QEMU **8.2.2**. Remember that version — there's a payoff later. + +## Normal world first: make run, 34/34 + +Basic unit-test suites (NS-EL2 + TCG, no Secure world): + +```bash +qemu-system-aarch64 -machine virt,virtualization=on,gic-version=3 \ + -cpu max -smp 4 -m 2G -nographic -kernel target/.../hypervisor +``` + +Clean boot to EL2 — GICv3, 16 MB heap, timer all up — and **all 34 test suites ran with zero failures.** + +Two gotchas worth noting: the final suite `guest_interrupt` enters a guest and **never returns** by design, so wrap QEMU in `timeout --foreground 90 ...` or it hangs. And `[INIT] DTB: parse failed, using defaults` is expected — in `-kernel` mode QEMU passes DTB address 0, and the code falls back to virt defaults. + +## Trap #2: I joined the docker group but my session couldn't see it + +The Secure world needs Docker. After `sudo usermod -aG docker $USER`, `/etc/group` confirmed it: + +``` +docker:x:988:wilamhou +``` + +…yet `id` in my shell still didn't list `docker`. Group changes only take effect for **newly logged-in processes**, and my shell's parent process predated the change — re-logging into the UI doesn't restart that process. + +Rather than restart the whole session, `sg` (switch group) reads `/etc/group` at runtime: + +```bash +sg docker -c 'docker ps' # works immediately, no re-login +sg docker -c 'make build-tfa-spmc' +``` + +`sg docker -c ''` runs the entire command as the docker group, including the `docker run` children it forks. Handy whenever you've just been added to a group and don't want to log out. + +## Build + boot the full Secure chain + +One command builds TF-A + the SPMC + all three SPs (compiled inside Docker, ~10–20 min the first time): + +```bash +sg docker -c 'make build-tfa-spmc' +``` + +A landmine here, in passing: at S-EL2, `CPTR_EL3.TFP` traps all FP/SIMD, and Rust **debug** builds emit NEON (`cnt v0.8b`) for `read_volatile` alignment checks — which **silently hangs** the moment it executes. The repo's `[profile.dev]` already sets `opt-level = 1`, so the `sel2` build emits no NEON and the trap is defused before it bites. + +With artifacts in place (`flash-spmc.bin` 64 MB, `hypervisor_spmc.bin`, SP1/2/3), run it: + +```bash +qemu-system-aarch64 -machine virt,secure=on,virtualization=on,gic-version=3 \ + -cpu max -smp 4 -m 2G -nographic -bios tfa/flash-spmc.bin -nic none +``` + +The whole four-level privilege stack comes up: + +``` +NOTICE: BL31: v2.12.0(debug) ← EL3 TF-A + SPMD +[SPMC] Running at EL2 ← S-EL2, our SPMC +[SPMC] spmc_id=0x8000 version=1.1 +[SPMC] S-EL2 Stage-1 MMU enabled (NS DRAM mapped) +[SP] Hello from S-EL1! ← SP1 +[SP2] Hello from S-EL1! ← SP2 +[SP3] Hello from S-EL1 (sp_relay) ← SP3 +... + Test 1: FFA_VERSION .............. PASS + ... + Test 20: SP-to-SP MEM_RECLAIM .... PASS + All tests complete. +``` + +**20/20 BL33 tests pass** — FF-A discovery, DIRECT_REQ (incl. multi-SP), preemption + FFA_RUN, secure vIRQ injection, full MEM_SHARE/LEND lifecycles, and SP↔SP relay + cycle detection + memory share/reclaim. The complete chain works. + +## The biggest payoff: stock QEMU 8.2.2 was enough + +The Makefile carries this note: + +> Local QEMU 9.2+ for S-EL2 targets (secure=on requires newer QEMU) + +implying you must compile QEMU 9.2.3 from source (a 20–40 minute job) for Secure-world targets. In practice — **Ubuntu's stock QEMU 8.2.2 runs S-EL2 just fine; that "needs 9.2+" note is overly cautious.** The custom-QEMU build step was unnecessary. + +(Mechanically, `QEMU_SEL2` is "use `tools/qemu-system-aarch64` if present, else fall back to system qemu". By not building a custom one, it picks up 8.2.2 automatically.) + +## So can you run Android AVF under QEMU? + +The other big question of the day. It splits into two layers: + +| Layer | What it is | Runs under QEMU? | +|---|---|---| +| **pKVM** (EL2 hypervisor) | protected KVM, `kvm-arm.mode=protected` | ✅ Yes. `make run-pkvm` boots an AOSP kernel in pKVM mode to BusyBox under TCG, FF-A working | +| **crosvm / pVM** (EL0 creates protected VMs) | a VMM creating pVMs via `/dev/kvm` | ❌ Not currently — fails with `failed to create IRQ chip` | + +Why crosvm stalls: it calls `/dev/kvm` **inside** the guest to create a pVM, which needs the guest kernel's KVM to actually work — and QEMU TCG can't create `KVM_DEV_TYPE_ARM_VGIC_V3` (the vGICv3 device). To make it work you either run QEMU with `-accel kvm` **plus nested virtualization** (giving the guest a real EL2), but this host has `kvm.nested` disabled; or run crosvm **natively** on the host, blocked by `/dev/kvm` permissions. None of this involves my hypervisor — it's the boundary of Android's own pKVM + crosvm under pure TCG. + +In one line: **Android's pKVM hypervisor layer runs under QEMU today; full AVF's pVM-creation layer does not under TCG** — you need nested-virt KVM or native KVM access. + +## Reusable takeaways + +1. **The Secure world (EL3/S-EL2) is always TCG.** KVM can't help, and TCG is plenty — don't agonize over acceleration here. +2. **conda-forge's arm64 channel has no `qemu-system` binary.** Don't burn time there. +3. **`sg docker -c '...'`** uses a freshly-added group without logging out. +4. **Stock QEMU 8.2.2 handles S-EL2 `secure=on`** — no need to compile 9.2.3. +5. **S-EL2 Rust builds must be `opt-level >= 1`**, or debug-mode NEON alignment checks hang silently. +6. Wrap `run` / `run-spmc` in `timeout` — they end in an idle/blocking state. + +The real upside of running on ARM hardware isn't the Secure world (that's TCG by design) — it's the *future* possibility of unlocking AVF's crosvm layer, which needs one of two doors opened: nested virtualization, or native `/dev/kvm`. That's the next post. diff --git a/docs/zhihu/e2e-on-arm-fieldnotes.md b/docs/zhihu/e2e-on-arm-fieldnotes.md new file mode 100644 index 0000000..25451ec --- /dev/null +++ b/docs/zhihu/e2e-on-arm-fieldnotes.md @@ -0,0 +1,173 @@ +# 在一台真 ARM 服务器上,用发行版自带的 QEMU 跑通了完整 NS→Secure 链路 + +## 写在前面 + +之前这个项目的安全世界链路(TF-A → 我们的 S-EL2 SPMC → Secure Partition),一直是在 x86 开发机上靠 QEMU TCG 纯模拟跑的。这次手上有了一台真正的 ARM64 Linux 服务器(aarch64,带 `/dev/kvm`,SVE2,NVIDIA 内核 6.11),我想验证一个朴素的问题: + +**搬到真 ARM 机器上,是不是就能用 KVM 加速、甚至跑通 Android AVF 了?** + +结论有点反直觉,也正是这篇实战记最想讲清楚的一件事。下面按当天的真实顺序记录,包括几个踩坑和绕过的办法。 + +--- + +## 第一个反直觉:真 ARM 机器 + /dev/kvm,也救不了安全世界 + +很多人(包括当时的我)会下意识觉得:x86 上只能模拟,换到 ARM 原生硬件,KVM 一上,全链路就能加速了。 + +但安全世界这条链是个例外。我们要跑的是: + +``` +EL3 TF-A BL31 + SPMD ← 安全监控器 +S-EL2 我们的 SPMC ← 管理 Secure Partition +S-EL1 SP1 / SP2 / SP3 ← 秘密分区 +NS-EL2 pKVM / 测试客户端 +NS-EL1 Linux/Android +``` + +关键点:**QEMU 的 `secure=on`(EL3、安全世界、S-EL2)只能在 TCG 下模拟,KVM 根本无法虚拟化 EL3 或 Secure World**——ARM 上不存在"嵌套安全虚拟化"这种东西。所以无论宿主是 x86 还是 ARM,这条 NS→Secure 全链路都只能 TCG。 + +那 `/dev/kvm` 在这台机器上还有什么用?只对**纯普通世界**(NS-EL2 的 pKVM 加速、AVF 的 crosvm)有意义,而且还得满足额外条件——后面会说。 + +所以第一个认知修正就是:**安全世界用 TCG 跑,不是将就,是唯一正确的方式。而 TCG 完全够用。** + +--- + +## 第二个坑:这台机器上压根没有 QEMU,还没有 sudo + +想跑 `make run`,结果 `qemu-system-aarch64` 不在 PATH,全系统都搜不到。更尴尬的是: + +- 当前用户没有免密 `sudo`(非交互装不了包) +- 不在 `docker` 组(仓库里所有 TF-A/QEMU 构建都靠 Docker) +- 不在 `kvm` 组(`/dev/kvm` 直接 Permission denied) + +我先试了一条**完全免 root** 的路子:用 micromamba(单文件静态二进制)从 conda-forge 装 QEMU。下载、解压都顺利,结果在 `linux-aarch64` 频道翻车了—— + +``` +qemu =* * does not exist +``` + +conda-forge 的 ARM64 频道**只有 `qemu.qmp` 这个 Python 库,没有 QEMU 模拟器本体**。x86 频道有,arm64 没有。这个坑值得记一下:别想当然以为 conda 能在 arm64 上给你 qemu-system。 + +从源码编译?又缺 `glib` / `pixman` / `meson` / `ninja`,系统里一个都没装。 + +最后还是最朴素的方案最快:让有 sudo 的人跑一条 + +```bash +sudo apt install -y qemu-system-arm # 这个包会附带 qemu-system-aarch64 +``` + +装完得到 QEMU **8.2.2**。记住这个版本号,后面有个惊喜。 + +--- + +## 先跑普通世界:make run,34/34 + +拿到 QEMU 后,先跑最基础的单元测试套件(NS-EL2 + TCG,不碰安全世界): + +```bash +qemu-system-aarch64 -machine virt,virtualization=on,gic-version=3 \ + -cpu max -smp 4 -m 2G -nographic -kernel target/.../hypervisor +``` + +干净启动到 EL2,GICv3 初始化、16MB 堆、timer 全部就绪,**34 个测试套件全部运行、零失败**。 + +一个小提示:最后一个套件 `guest_interrupt` 会进入 guest 执行后**永不返回**(设计如此),所以记得用 `timeout --foreground 90 qemu...` 包一层,否则会一直挂着。同理日志里那句 `[INIT] DTB: parse failed, using defaults` 也是预期的——`-kernel` 模式下 QEMU 传的 DTB 地址是 0,代码回退到 virt 默认值。 + +--- + +## 第三个坑:加了 docker 组,但当前会话看不到 + +要跑安全世界就得用 Docker。`sudo usermod -aG docker $USER` 跑完,`/etc/group` 里也确实有了: + +``` +docker:x:988:wilamhou +``` + +但执行 `id` 时,当前 shell 的 live group **还是没有 docker**。原因是:组成员变更只对**新登录的进程**生效,而我的执行 shell 的父进程是在改组之前启动的,重新登录 UI 并不会重启这个底层进程。 + +不想重启整个会话,有个干净的办法——`sg`(switch group),它在运行时读 `/etc/group`: + +```bash +sg docker -c 'docker ps' # 立即可用,无需重新登录 +sg docker -c 'make build-tfa-spmc' +``` + +`sg docker -c '命令'` 会以 docker 组身份执行整条命令,连带它 fork 出来的 `docker run` 子进程也都有组权限。这招在"刚加完组又不想注销"的场景下很实用。 + +--- + +## 构建 + 跑通安全世界全链路 + +一条命令搞定 TF-A + SPMC + 三个 SP 的构建(Docker 内编译,首次约 10-20 分钟): + +```bash +sg docker -c 'make build-tfa-spmc' +``` + +顺便提一个容易爆的雷:S-EL2 下 `CPTR_EL3.TFP` 会陷阱掉所有 FP/SIMD,而 Rust 的 debug 构建会给 `read_volatile` 的对齐检查发 NEON 指令(`cnt v0.8b`),一旦执行就**静默挂死**。好在仓库的 `[profile.dev]` 已经把 `opt-level` 设成 1,sel2 默认构建就不发 NEON,这个坑提前被堵上了。 + +产物齐了(`flash-spmc.bin` 64MB、`hypervisor_spmc.bin`、SP1/2/3),跑: + +```bash +qemu-system-aarch64 -machine virt,secure=on,virtualization=on,gic-version=3 \ + -cpu max -smp 4 -m 2G -nographic -bios tfa/flash-spmc.bin -nic none +``` + +输出一气呵成,整条四级特权链全部起来了: + +``` +NOTICE: BL31: v2.12.0(debug) ← EL3 TF-A + SPMD +[SPMC] Running at EL2 ← S-EL2 我们的 SPMC +[SPMC] spmc_id=0x8000 version=1.1 +[SPMC] S-EL2 Stage-1 MMU enabled (NS DRAM mapped) +[SP] Hello from S-EL1! ← SP1 +[SP2] Hello from S-EL1! ← SP2 +[SP3] Hello from S-EL1 (sp_relay) ← SP3 +... + Test 1: FFA_VERSION .............. PASS + ... + Test 20: SP-to-SP MEM_RECLAIM .... PASS + All tests complete. +``` + +**BL33 测试 20/20 全过**——FF-A 发现、DIRECT_REQ(含多 SP)、抢占 + FFA_RUN、安全 vIRQ 注入、MEM_SHARE/LEND 完整生命周期、SP↔SP 转发 + 循环检测 + 内存共享/回收,全链路打通。 + +--- + +## 最大的惊喜:发行版自带的 QEMU 8.2.2 就够了 + +仓库 Makefile 里有一句注释: + +> Local QEMU 9.2+ for S-EL2 targets (secure=on requires newer QEMU) + +意思是安全世界目标需要自己从源码编译 QEMU 9.2.3(一个 20-40 分钟的大活)。但这次实测下来——**Ubuntu 自带的 QEMU 8.2.2 跑 S-EL2 完全没问题,那条"需要 9.2+"的注释过于保守了。** 直接省掉了编译自定义 QEMU 这一步。 + +(机制上,`QEMU_SEL2` 变量是"有 `tools/qemu-system-aarch64` 就用它,否则回退系统 qemu"。我们不构建自定义版本,它就自动用系统的 8.2.2,正好。) + +--- + +## 那 Android AVF 到底能不能用 QEMU 跑通? + +这是当天另一个核心问题。答案要拆成两层: + +| 层 | 是什么 | QEMU 能跑吗 | +|---|---|---| +| **pKVM**(EL2 hypervisor) | `kvm-arm.mode=protected` 的保护型 KVM | ✅ 能。`make run-pkvm` 已能在 TCG 下把 AOSP 内核以 pKVM 模式启动到 BusyBox,FF-A 正常 | +| **crosvm / pVM**(EL0 建受保护 VM) | VMM 通过 `/dev/kvm` 创建 pVM | ❌ 当前不能,卡在 `failed to create IRQ chip` | + +为什么 crosvm 这层卡住:crosvm 要在 guest **内部**再调 `/dev/kvm` 建 pVM,这需要 guest 内核的 KVM 真正可用;而 QEMU TCG 无法创建 `KVM_DEV_TYPE_ARM_VGIC_V3`(vGICv3)这个 KVM 设备。要让它工作,要么 QEMU 用 `-accel kvm` + **嵌套虚拟化**(给 guest 一个真 EL2),但本机 `kvm.nested` 没开;要么**原生在宿主跑 crosvm**,但 `/dev/kvm` 权限被挡。这跟我们的 hypervisor 无关,是 Android pKVM+crosvm 自身在纯 TCG 下的边界。 + +所以一句话总结:**Android 的 pKVM hypervisor 层,QEMU 里已经跑通;完整 AVF 的 pVM 创建层,QEMU TCG 下跑不通**,得开嵌套虚拟化或原生 KVM。 + +--- + +## 几条可复用的经验 + +1. **安全世界(EL3/S-EL2)永远是 TCG**,KVM 帮不上忙,但 TCG 完全够用——别为此纠结要不要上 KVM。 +2. **conda-forge 的 arm64 频道没有 qemu-system 本体**,别在这上面浪费时间。 +3. **`sg docker -c '...'`** 能在刚加完 docker 组、不想注销的情况下立刻用上组权限。 +4. **发行版自带 QEMU(8.2.2)足以跑 S-EL2 `secure=on`**,不必非编译 9.2.3。 +5. **S-EL2 的 Rust 构建必须 `opt-level >= 1`**,否则 debug 模式的 NEON 对齐检查会静默挂死。 +6. 跑 `run` / `run-spmc` 记得用 `timeout` 包一层,因为最后会停在空转/阻塞状态。 + +真 ARM 硬件给项目带来的真正增量,不在安全世界(那本来就该 TCG),而在于**未来能不能解锁 AVF 的 crosvm 那层**——前提是把嵌套虚拟化或原生 `/dev/kvm` 这两道门之一打开。那是下一篇的事了。 diff --git a/guest/linux/build-crosvm-native.sh b/guest/linux/build-crosvm-native.sh new file mode 100644 index 0000000..10f9e68 --- /dev/null +++ b/guest/linux/build-crosvm-native.sh @@ -0,0 +1,70 @@ +#!/bin/bash +# Build crosvm NATIVELY on an aarch64 host (no cross-compile). +# Reuses the crosvm-build-cache clone; fixes the minijail bindgen +# 'sys/resource.h not found' failure caused by the cross sysroot. +set -euo pipefail +export DEBIAN_FRONTEND=noninteractive + +echo "=== Building crosvm natively (aarch64 host) ===" +echo ">>> arch: $(dpkg --print-architecture)" + +echo ">>> Installing native build dependencies..." +apt-get update -qq +apt-get install -y -qq \ + build-essential pkg-config git curl ca-certificates \ + protobuf-compiler libprotobuf-dev python3 \ + libcap-dev libfdt-dev libc6-dev \ + libclang-dev clang \ + 2>&1 | tail -3 + +export RUSTUP_HOME=/build/.rustup +export CARGO_HOME=/build/.cargo +if [ ! -f "$CARGO_HOME/bin/rustup" ]; then + echo ">>> Installing Rust..." + curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable 2>&1 | tail -3 +fi +export PATH="$CARGO_HOME/bin:$PATH" + +cd /build +if [ ! -d crosvm ]; then + echo ">>> Cloning crosvm..." + git clone --depth 1 https://chromium.googlesource.com/crosvm/crosvm 2>&1 | tail -3 +fi +cd crosvm +git submodule update --init --depth 1 2>&1 | tail -3 +echo ">>> crosvm commit: $(git rev-parse --short HEAD)" + +# Pin toolchain to the channel crosvm wants, native target only. +if [ -f rust-toolchain ] || [ -f rust-toolchain.toml ]; then + CHANNEL=$(grep 'channel' rust-toolchain* | head -1 | sed 's/.*"\(.*\)".*/\1/') + echo ">>> Patching rust-toolchain (channel=$CHANNEL, native)..." + cat > rust-toolchain << TOOLCHAIN +[toolchain] +channel = "$CHANNEL" +components = [ "rustfmt", "clippy", "llvm-tools-preview" ] +TOOLCHAIN +fi + +# Remove any cross config left by the previous run. +rm -f .cargo/config.toml + +rustup show 2>&1 | tail -3 +echo ">>> Rust: $(rustc --version)" + +# NATIVE build — no --target, no cross sysroot. bindgen/clang find +# /usr/include/aarch64-linux-gnu/sys/resource.h on their own. +echo ">>> Building crosvm natively (--no-default-features)..." +cargo build --release --no-default-features 2>&1 + +echo "" +ls -lh target/release/crosvm +cp target/release/crosvm /output/crosvm +echo ">>> NEEDED libs:" +readelf -d target/release/crosvm | grep NEEDED || true + +mkdir -p /output/crosvm-libs +for lib in libc.so.6 libm.so.6 libdl.so.2 libpthread.so.0 libgcc_s.so.1 ld-linux-aarch64.so.1 libcap.so.2; do + find /usr/lib/aarch64-linux-gnu/ /lib/aarch64-linux-gnu/ -name "$lib*" -exec cp -L {} /output/crosvm-libs/ \; 2>/dev/null || true +done +echo ">>> Libs:"; ls -la /output/crosvm-libs/ +echo "=== crosvm native build complete: /output/crosvm ($(du -sh /output/crosvm | awk '{print $1}')) ==="