Patches: template arch-derived paths in auto-generated base patch#820
Open
lacraig2 wants to merge 11 commits into
Open
Patches: template arch-derived paths in auto-generated base patch#820lacraig2 wants to merge 11 commits into
lacraig2 wants to merge 11 commits into
Conversation
f1b6227 to
74f934b
Compare
098c882 to
d87ba87
Compare
d87ba87 to
388e2b4
Compare
48d2db9 to
7250a13
Compare
Use the new Jinja templating in auto-generated config patches so
architecture-dependent values are resolved from core.arch at load time instead
of being compiled into the generated YAML.
- Expose two arch-derived template variables, {{ arch_dir }} (generic static
subdir, e.g. intel64 -> x86_64) and {{ dylib_dir }} (prebuilt-dylib subdir,
e.g. aarch64 -> arm64), in templating.build_context, derived from the file's
core.arch (lazy/guarded imports so schema-gen contexts are unaffected).
- Add arch.get_dylib_subdir() as the single source of truth for the dylib
mapping; BasePatch.set_arch_info now uses it instead of an inline copy.
- BasePatch.generate emits {{ dylib_dir }} / {{ arch_dir }} in the /igloo/dylibs
and /igloo/utils host paths instead of baking the subdir in. core.arch and
core.kernel stay concrete (they are the source values).
- Tests for the derived vars, mapping parity, in-patch resolution, and that the
generator emits the placeholders. Docs updated.
7250a13 to
7a740a6
Compare
Capture which step faults on the ppc64 MTD read-back without aborting: - enable kernel print-fatal-signals (logs faulting comm + nip to console.log) - run the read-back as a standalone cat (capture exit code + size + hexdump) instead of inside a pipeline, before and after the write, repeated - only print the PASS markers when the read-back actually contains the new data, so working arches stay green while ppc64 records diagnostics Temporary diagnostic commit; revert once the fault is characterized.
Green run showed the standalone 'cat /dev/mtdN > file' read-back works on ppc64 -- the segfault is bound to the 'cat | strings | grep' pipeline. Add stage isolation (strings-on-file, cat|cat, cat|strings, full pipe) and gate PASS on the original pipeline so ppc64 fails and uploads artifacts with the per-step exit codes + the kernel fatal-signal line.
Warm-up reads made the fault vanish (commits 1-2 went green), so the fault needs the piped read to be the first read after the write. Restore the exact original single-read/write/single-piped-read order and only arm fault capture (print-fatal-signals, show_unhandled_signals, core_pattern -> shared, ulimit -c) so the failing ppc64 run uploads the faulting process+nip and a coredump.
The ppc64 native_mmap MTD read-back segfault is a host-side memory bug, not a test/data bug: a standalone 'cat /dev/mtdN > file' read-back returns correct data, but the original 'cat | strings | grep' pipeline deterministically segfaults a sibling process, and any test-script perturbation hides it. The MTD read callback delivered data via plugins.mem.write -> the PANDA virtual_memory_write_external fast path, which translates the guest kernel read buffer's virtual address host-side. On ppc64 that translation appears unreliable for some kbuf addresses and corrupts a co-running process. Add an opt-in write_bytes(prefer_portal=True) that skips the PANDA fast path and writes via the guest-executed portal path, and use it for the native MTD read callback. Test script restored to the exact original faulting sequence so CI verifies the fix without perturbing the heisenbug.
Restore the original PANDA virtual-memory write in the MTD read callback (the suspected-faulty path) and, after each write, read the same kernel VA back two ways: via PANDA (cpu_memory_rw_debug, same host translation) and via the portal (guest-executed, the guest's real mapping). Log any mismatch + the VA to results/issue831_probe.txt. - portal_rb != src (or panda_rb != portal_rb) => PANDA write hit a different physical page than the guest sees -> wrong-PA confirmed (host translation bug) - everything matches => write lands correctly; retract the corruption story (mechanism is timing/coherency, fix is perturbation) add read_bytes(prefer_portal=) to mirror write_bytes. A never-matching verifier condition forces artifact upload so the probe file is captured even if the heisenbug crash does not reproduce. Guest test script left byte-identical.
Revert the wrong-PA probe and the forced-failure verifier sentinel. The probe showed that whenever the PANDA write is observable (panda + portal read-back), panda==portal==src on every read incl. post-write read-backs at the same kernel VA class as the faulting run -> the 'wrong physical page' hypothesis is NOT supported; the per-read portal round-trip also perturbed the heisenbug away (mmap_test PASSed under the probe). Conclusion: the fault is in the host-side PANDA memory write (cpu_memory_rw_debug) used by the MTD read callback on ppc64, is timing/coherency-sensitive (any observation hides it), and writing via the guest-executed portal path deterministically avoids it. That portal-write fix is what remains here.
Single CI run, two contemporaneous data points (both use the PANDA write path): - native_mmap: revert fix -> PANDA write, full native-mmap machinery intact (qemu_mem aperture + dev/proc/anon mmap). BASELINE, expect red. - mtd_only (new minimal plugin+test): one OOP MTD device on the PANDA write path, NO qemu_mem aperture, NO dev/proc/anon mmap, NO micropython -- just the cat|strings|grep read-back. If native_mmap red but mtd_only green -> native-mmap/qemu_mem machinery is required to trigger the fault. If mtd_only also red -> it is the MTD callback's PANDA write itself, independent of native mmap. Temporary; restore fix after.
…olding The mtd_only isolation test was inconclusive: adding it to the suite shifted timing/layout enough to perturb the heisenbug away (baseline native_mmap with the PANDA write also passed). Differential testing via suite/script edits is confounded by the bug's fragility. Revert the experiment and keep the host-side portal-write fix; correct the code comments to the honest mechanism (timing-sensitive fault in the PANDA host write; not a confirmed wrong-PA).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Third PR in the config-reshape series. Targets
config-reshape-core(#818) — it builds on that branch's Jinja templating. Independent of the plugin PR (#819); they're siblings on #818.Puts the new templating to work in Penguin's auto-generated patches so architecture-dependent values are resolved from
core.archat load time instead of being compiled into the generated YAML. This directly addresses the original gripe that changing arch didn't update arch-derived values baked into patches.What changes
templating.build_context, derived from the file'score.arch:{{ arch_dir }}— generic static subdir (e.g.intel64→x86_64,powerpc64le→powerpc64){{ dylib_dir }}— prebuilt-dylib subdir (e.g.aarch64→arm64,powerpc64le→ppc64el)arch.get_dylib_subdir()is now the single source of truth for the dylib mapping;BasePatch.set_arch_infouses it instead of an inline copy (verified equivalent for every schema-supported arch).BasePatch.generateemits{{ dylib_dir }}/{{ arch_dir }}in the/igloo/dylibs/*and/igloo/utils/*host paths instead of baking the subdir in. The generatedbase.yamldefinescore.arch, so these resolve whenload_configrenders the patch.core.archandcore.kerneldeliberately stay concrete — they're the source values (templating them would be circular). Serial-device major/minor (conditional ints) and kernel-module version paths (padded-version mismatch with{{ kernel_version }}) are intentionally left baked; noted for a possible follow-up.Testing
tests/unit_tests/test_config.py: +10 tests (32 total, all passing on host) — derived-var values across 7 arches,get_dylib_subdirparity, in-patch resolution, and an assertion that the generator emits the placeholders.load_config(a patch that definescore.archand uses the derived vars) resolves to/igloo_static/dylibs/arm64/*and/igloo_static/aarch64/*as expected.