I am encountering Orion O6 stability issues which look like intermittent memory/tag/data corruption under real workloads:
Board / firmware / kernel context:
- Board: Radxa Orion O6, 32GB LPDDR5.
- Firmware: Radxa firmware 1.2.1, DMI reports
BIOS 1.2.1 2026-03-19T10:41:29+00:00;
- Early firmware log reports LPDDR5, 4 channels, 2 ranks, 32768MB, with DDR operating points including
f1 750Mhz 16B_MODE and f2 3000Mhz BG_MODE;
- Kernels tested include CIX-derived 6.18.x and 7.0.x trees;
- Toolchain: clang/LLVM 21.x in the failing kernel-build cases;
- Debug kernel:
CONFIG_KASAN_HW_TAGS, synchronous MTE tag checks, CONFIG_DMA_API_DEBUG, page owner, page poisoning, init_on_alloc=1, init_on_free=1, strict IOMMU.
The symptoms are intermittent and usually disappear on retry/reboot:
podman image commit/copy failures with digest mismatch, e.g. expected one SHA256 for a layer blob but got another;
- Kernel/toolchain build failures which look impossible from source alone, e.g.
llvm-ar segfaulting, free(): invalid pointer while compiling unrelated code, and BTF/pahole failures with many missing DWARF type references;
- One container image had a kernel header file corrupted with embedded
NUL/control bytes and binary-looking data. Recreating the container from the same inputs reproduced the corrupted file because the corruption had already been committed into the image layer;
- A later BTF failure showed hundreds of messages like:
die__process_unit: DW_TAG_member (...) not handled in a asm CU!
namespace__recode_dwarf_types: couldn't find ... type for ... (member)!
Segmentation fault
FAILED: load BTF from vmlinux.unstripped: Invalid argument
Rebooting and rerunning the same build using the same cached build tree succeeded, so this does not look like a deterministic source/config/toolchain problem.
Initially I suspected PCIe/DMA, so tested with mitigations such as strict translated IOMMU domains, pci=noats, disabling NVMe host memory buffer/SGL paths, and reducing suspected PCIe/NVMe variables. That did not give a clean explanation.
The most useful evidence came from KASAN HW-tags. We have now seen the same KASAN signature several times:
BUG: KASAN: invalid-access in copy_page+0x48/0xc4
Write at addr f8fffc... by task kcompactd0/99
Pointer tag: [f8], memory tag: [f0]
copy_page
folio_mc_copy
__migrate_folio
filemap_migrate_folio
btrfs_migrate_folio
move_to_new_folio
migrate_pages_batch
migrate_pages
compact_zone
kcompactd
The affected pages differ each time:
| Case |
Workload |
PFN |
Bad offset |
Last free path |
| 1 |
Podman/Btrfs-ish workload |
0x59706 |
0x5380 |
Btrfs subvolume deletion |
| 2 |
stressapptest running |
0x73c71 |
0x9b00 |
kswapd reclaim |
| 3 |
ordinary kernel build |
0x65c62 |
0xd280 |
kswapd reclaim |
The memory tag dump is always the same pattern: an otherwise consistently f8-tagged page contains a single f0 granule.
Example:
Memory state around the buggy address:
... f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> ... f8 f8 f8 f8 f8 f8 f8 f8 f0 f8 f8 f8 f8 f8 f8 f8
^
... f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
Current interpretation:
kcompactd is probably the detector/victim, not necessarily the root cause;
- Btrfs is probably involved because the workload is on Btrfs-backed page cache, not necessarily because Btrfs is corrupting memory;
- This is not one fixed bad PFN, not one fixed offset, and not one fixed CPU;
- It does not look like a simple source-code bug;
- It does not look like a single bad driver stack, because the failure appears in generic memory compaction;
- It also does not look like ordinary userspace-visible byte corruption in the
stressapptest buffer: stressapptest -M 16384 -s 43200 -W --pause_delay 0 ran for 9473s at ~10031 MB/s and reported:
Found 0 hardware incidents
Completed ... with 0 hardware incidents, 0 errors
Status: PASS - please verify no corrected errors
That said, stressapptest checks data bytes, not necessarily MTE allocation tags or tag storage. The repeated KASAN signature looks like an MTE allocation-tag inconsistency: a freshly allocated destination page for compaction is almost entirely tagged as expected, but one 16-byte granule has the wrong tag.
@amazingfate suggested trying lower DDR speed, noting that they had seen a similar issue on an early O6 and fixed it by building firmware with lower memory speed. Given the evidence above, my current best hypothesis is that the Orion O6 firmware DDR timings/training/V/F settings may be too aggressive for at least some boards, possibly affecting memory-controller/tag-storage/cacheline behavior rather than producing simple random bit flips.
The Minisforum MS-R1 is also Sky1/CP8180-based, and its firmware exposes Inline ECC capability. Can CIX please assist Radxa to produce an Orion O6 firmware which enables Inline ECC (if the hardware supports it?) and exposes corrected/uncorrected memory-controller error counters to Linux?
I am encountering Orion O6 stability issues which look like intermittent memory/tag/data corruption under real workloads:
Board / firmware / kernel context:
BIOS 1.2.1 2026-03-19T10:41:29+00:00;f1 750Mhz 16B_MODEandf2 3000Mhz BG_MODE;CONFIG_KASAN_HW_TAGS, synchronous MTE tag checks,CONFIG_DMA_API_DEBUG, page owner, page poisoning,init_on_alloc=1,init_on_free=1, strict IOMMU.The symptoms are intermittent and usually disappear on retry/reboot:
podmanimage commit/copy failures with digest mismatch, e.g. expected one SHA256 for a layer blob but got another;llvm-arsegfaulting,free(): invalid pointerwhile compiling unrelated code, and BTF/pahole failures with many missing DWARF type references;NUL/control bytes and binary-looking data. Recreating the container from the same inputs reproduced the corrupted file because the corruption had already been committed into the image layer;Rebooting and rerunning the same build using the same cached build tree succeeded, so this does not look like a deterministic source/config/toolchain problem.
Initially I suspected PCIe/DMA, so tested with mitigations such as strict translated IOMMU domains,
pci=noats, disabling NVMe host memory buffer/SGL paths, and reducing suspected PCIe/NVMe variables. That did not give a clean explanation.The most useful evidence came from KASAN HW-tags. We have now seen the same KASAN signature several times:
The affected pages differ each time:
The memory tag dump is always the same pattern: an otherwise consistently
f8-tagged page contains a singlef0granule.Example:
Current interpretation:
kcompactdis probably the detector/victim, not necessarily the root cause;stressapptestbuffer:stressapptest -M 16384 -s 43200 -W --pause_delay 0ran for 9473s at ~10031 MB/s and reported:That said,
stressapptestchecks data bytes, not necessarily MTE allocation tags or tag storage. The repeated KASAN signature looks like an MTE allocation-tag inconsistency: a freshly allocated destination page for compaction is almost entirely tagged as expected, but one 16-byte granule has the wrong tag.@amazingfate suggested trying lower DDR speed, noting that they had seen a similar issue on an early O6 and fixed it by building firmware with lower memory speed. Given the evidence above, my current best hypothesis is that the Orion O6 firmware DDR timings/training/V/F settings may be too aggressive for at least some boards, possibly affecting memory-controller/tag-storage/cacheline behavior rather than producing simple random bit flips.
The Minisforum MS-R1 is also Sky1/CP8180-based, and its firmware exposes Inline ECC capability. Can CIX please assist Radxa to produce an Orion O6 firmware which enables Inline ECC (if the hardware supports it?) and exposes corrected/uncorrected memory-controller error counters to Linux?