Skip to content

Patches: template arch-derived paths in auto-generated base patch#820

Open
lacraig2 wants to merge 11 commits into
mainfrom
config-reshape-patches
Open

Patches: template arch-derived paths in auto-generated base patch#820
lacraig2 wants to merge 11 commits into
mainfrom
config-reshape-patches

Conversation

@lacraig2

@lacraig2 lacraig2 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Third PR in the config-reshape series. Targets config-reshape-core (#818) — it builds on that branch's Jinja templating. Independent of the plugin PR (#819); they're siblings on #818.

Puts the new templating to work in Penguin's auto-generated patches so architecture-dependent values are resolved from core.arch at load time instead of being compiled into the generated YAML. This directly addresses the original gripe that changing arch didn't update arch-derived values baked into patches.

What changes

  • New arch-derived template variables exposed by templating.build_context, derived from the file's core.arch:
    • {{ arch_dir }} — generic static subdir (e.g. intel64x86_64, powerpc64lepowerpc64)
    • {{ dylib_dir }} — prebuilt-dylib subdir (e.g. aarch64arm64, powerpc64leppc64el)
    • Imports are lazy/guarded so self-contained contexts (schema generation) are unaffected.
  • arch.get_dylib_subdir() is now the single source of truth for the dylib mapping; BasePatch.set_arch_info uses it instead of an inline copy (verified equivalent for every schema-supported arch).
  • BasePatch.generate emits {{ dylib_dir }} / {{ arch_dir }} in the /igloo/dylibs/* and /igloo/utils/* host paths instead of baking the subdir in. The generated base.yaml defines core.arch, so these resolve when load_config renders the patch.
  • core.arch and core.kernel deliberately stay concrete — they're the source values (templating them would be circular). Serial-device major/minor (conditional ints) and kernel-module version paths (padded-version mismatch with {{ kernel_version }}) are intentionally left baked; noted for a possible follow-up.

Testing

  • tests/unit_tests/test_config.py: +10 tests (32 total, all passing on host) — derived-var values across 7 arches, get_dylib_subdir parity, in-patch resolution, and an assertion that the generator emits the placeholders.
  • End-to-end load_config (a patch that defines core.arch and uses the derived vars) resolves to /igloo_static/dylibs/arm64/* and /igloo_static/aarch64/* as expected.

@lacraig2 lacraig2 force-pushed the config-reshape-core branch from f1b6227 to 74f934b Compare June 7, 2026 02:09
@lacraig2 lacraig2 force-pushed the config-reshape-patches branch from 098c882 to d87ba87 Compare June 7, 2026 02:09
Base automatically changed from config-reshape-core to main June 7, 2026 02:55
@lacraig2 lacraig2 force-pushed the config-reshape-patches branch from d87ba87 to 388e2b4 Compare June 7, 2026 03:08
@lacraig2 lacraig2 marked this pull request as ready for review June 7, 2026 03:23
@lacraig2 lacraig2 force-pushed the config-reshape-patches branch 3 times, most recently from 48d2db9 to 7250a13 Compare June 9, 2026 00:00
Use the new Jinja templating in auto-generated config patches so
architecture-dependent values are resolved from core.arch at load time instead
of being compiled into the generated YAML.

- Expose two arch-derived template variables, {{ arch_dir }} (generic static
  subdir, e.g. intel64 -> x86_64) and {{ dylib_dir }} (prebuilt-dylib subdir,
  e.g. aarch64 -> arm64), in templating.build_context, derived from the file's
  core.arch (lazy/guarded imports so schema-gen contexts are unaffected).
- Add arch.get_dylib_subdir() as the single source of truth for the dylib
  mapping; BasePatch.set_arch_info now uses it instead of an inline copy.
- BasePatch.generate emits {{ dylib_dir }} / {{ arch_dir }} in the /igloo/dylibs
  and /igloo/utils host paths instead of baking the subdir in. core.arch and
  core.kernel stay concrete (they are the source values).
- Tests for the derived vars, mapping parity, in-patch resolution, and that the
  generator emits the placeholders. Docs updated.
lacraig2 added 10 commits June 10, 2026 02:00
Capture which step faults on the ppc64 MTD read-back without aborting:
- enable kernel print-fatal-signals (logs faulting comm + nip to console.log)
- run the read-back as a standalone cat (capture exit code + size + hexdump)
  instead of inside a pipeline, before and after the write, repeated
- only print the PASS markers when the read-back actually contains the new
  data, so working arches stay green while ppc64 records diagnostics

Temporary diagnostic commit; revert once the fault is characterized.
Green run showed the standalone 'cat /dev/mtdN > file' read-back works on
ppc64 -- the segfault is bound to the 'cat | strings | grep' pipeline.
Add stage isolation (strings-on-file, cat|cat, cat|strings, full pipe) and
gate PASS on the original pipeline so ppc64 fails and uploads artifacts with
the per-step exit codes + the kernel fatal-signal line.
Warm-up reads made the fault vanish (commits 1-2 went green), so the fault
needs the piped read to be the first read after the write. Restore the exact
original single-read/write/single-piped-read order and only arm fault capture
(print-fatal-signals, show_unhandled_signals, core_pattern -> shared, ulimit -c)
so the failing ppc64 run uploads the faulting process+nip and a coredump.
The ppc64 native_mmap MTD read-back segfault is a host-side memory bug, not a
test/data bug: a standalone 'cat /dev/mtdN > file' read-back returns correct
data, but the original 'cat | strings | grep' pipeline deterministically
segfaults a sibling process, and any test-script perturbation hides it.

The MTD read callback delivered data via plugins.mem.write -> the PANDA
virtual_memory_write_external fast path, which translates the guest kernel
read buffer's virtual address host-side. On ppc64 that translation appears
unreliable for some kbuf addresses and corrupts a co-running process.

Add an opt-in write_bytes(prefer_portal=True) that skips the PANDA fast path
and writes via the guest-executed portal path, and use it for the native MTD
read callback. Test script restored to the exact original faulting sequence so
CI verifies the fix without perturbing the heisenbug.
Restore the original PANDA virtual-memory write in the MTD read callback (the
suspected-faulty path) and, after each write, read the same kernel VA back two
ways: via PANDA (cpu_memory_rw_debug, same host translation) and via the portal
(guest-executed, the guest's real mapping). Log any mismatch + the VA to
results/issue831_probe.txt.

- portal_rb != src (or panda_rb != portal_rb)  => PANDA write hit a different
  physical page than the guest sees -> wrong-PA confirmed (host translation bug)
- everything matches                            => write lands correctly; retract
  the corruption story (mechanism is timing/coherency, fix is perturbation)

add read_bytes(prefer_portal=) to mirror write_bytes. A never-matching verifier
condition forces artifact upload so the probe file is captured even if the
heisenbug crash does not reproduce. Guest test script left byte-identical.
Revert the wrong-PA probe and the forced-failure verifier sentinel. The probe
showed that whenever the PANDA write is observable (panda + portal read-back),
panda==portal==src on every read incl. post-write read-backs at the same kernel
VA class as the faulting run -> the 'wrong physical page' hypothesis is NOT
supported; the per-read portal round-trip also perturbed the heisenbug away
(mmap_test PASSed under the probe).

Conclusion: the fault is in the host-side PANDA memory write (cpu_memory_rw_debug)
used by the MTD read callback on ppc64, is timing/coherency-sensitive (any
observation hides it), and writing via the guest-executed portal path
deterministically avoids it. That portal-write fix is what remains here.
Single CI run, two contemporaneous data points (both use the PANDA write path):
- native_mmap: revert fix -> PANDA write, full native-mmap machinery intact
  (qemu_mem aperture + dev/proc/anon mmap). BASELINE, expect red.
- mtd_only (new minimal plugin+test): one OOP MTD device on the PANDA write
  path, NO qemu_mem aperture, NO dev/proc/anon mmap, NO micropython -- just the
  cat|strings|grep read-back.

If native_mmap red but mtd_only green -> native-mmap/qemu_mem machinery is
required to trigger the fault. If mtd_only also red -> it is the MTD callback's
PANDA write itself, independent of native mmap. Temporary; restore fix after.
…olding

The mtd_only isolation test was inconclusive: adding it to the suite shifted
timing/layout enough to perturb the heisenbug away (baseline native_mmap with
the PANDA write also passed). Differential testing via suite/script edits is
confounded by the bug's fragility. Revert the experiment and keep the
host-side portal-write fix; correct the code comments to the honest mechanism
(timing-sensitive fault in the PANDA host write; not a confirmed wrong-PA).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant