Orion O6 memory corruption and MTE tagging errors

I am encountering Orion O6 stability issues which look like intermittent memory/tag/data corruption under real workloads:

Board / firmware / kernel context:

- Board: Radxa Orion O6, 32GB LPDDR5.
- Firmware: Radxa firmware 1.2.1, DMI reports `BIOS 1.2.1 2026-03-19T10:41:29+00:00`;
- Early firmware log reports LPDDR5, 4 channels, 2 ranks, 32768MB, with DDR operating points including `f1 750Mhz 16B_MODE` and `f2 3000Mhz BG_MODE`;
- Kernels tested include CIX-derived 6.18.x and 7.0.x trees;
- Toolchain: clang/LLVM 21.x in the failing kernel-build cases;
- Debug kernel: `CONFIG_KASAN_HW_TAGS`, synchronous MTE tag checks, `CONFIG_DMA_API_DEBUG`, page owner, page poisoning, `init_on_alloc=1`, `init_on_free=1`, strict IOMMU.

The symptoms are intermittent and usually disappear on retry/reboot:

- `podman` image commit/copy failures with digest mismatch, e.g. expected one SHA256 for a layer blob but got another;
- Kernel/toolchain build failures which look impossible from source alone, e.g. `llvm-ar` segfaulting, `free(): invalid pointer` while compiling unrelated code, and BTF/pahole failures with many missing DWARF type references;
- One container image had a kernel header file corrupted with embedded `NUL`/control bytes and binary-looking data. Recreating the container from the same inputs reproduced the corrupted file because the corruption had already been committed into the image layer;
- A later BTF failure showed hundreds of messages like:

```text
die__process_unit: DW_TAG_member (...) not handled in a asm CU!
namespace__recode_dwarf_types: couldn't find ... type for ... (member)!
Segmentation fault
FAILED: load BTF from vmlinux.unstripped: Invalid argument
```

Rebooting and rerunning the same build using the same cached build tree succeeded, so this does not look like a deterministic source/config/toolchain problem.

Initially I suspected PCIe/DMA, so tested with mitigations such as strict translated IOMMU domains, `pci=noats`, disabling NVMe host memory buffer/SGL paths, and reducing suspected PCIe/NVMe variables. That did not give a clean explanation.
The most useful evidence came from KASAN HW-tags. We have now seen the same KASAN signature several times:

```
BUG: KASAN: invalid-access in copy_page+0x48/0xc4
Write at addr f8fffc... by task kcompactd0/99
Pointer tag: [f8], memory tag: [f0]

copy_page
folio_mc_copy
__migrate_folio
filemap_migrate_folio
btrfs_migrate_folio
move_to_new_folio
migrate_pages_batch
migrate_pages
compact_zone
kcompactd
```

The affected pages differ each time:

Case | Workload | PFN | Bad offset | Last free path
-- | -- | -- | -- | --
1 | Podman/Btrfs-ish workload | 0x59706 | 0x5380 | Btrfs subvolume deletion
2 | stressapptest running | 0x73c71 | 0x9b00 | kswapd reclaim
3 | ordinary kernel build | 0x65c62 | 0xd280 | kswapd reclaim

The memory tag dump is always the same pattern: an otherwise consistently `f8`-tagged page contains a single `f0` granule.

Example:

```
Memory state around the buggy address:
  ... f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> ... f8 f8 f8 f8 f8 f8 f8 f8 f0 f8 f8 f8 f8 f8 f8 f8
                               ^
  ... f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
```

Current interpretation:

* `kcompactd` is probably the detector/victim, not necessarily the root cause;
* Btrfs is probably involved because the workload is on Btrfs-backed page cache, not necessarily because Btrfs is corrupting memory;
* This is not one fixed bad PFN, not one fixed offset, and not one fixed CPU;
* It does not look like a simple source-code bug;
* It does not look like a single bad driver stack, because the failure appears in generic memory compaction;
* It also does not look like ordinary userspace-visible byte corruption in the `stressapptest` buffer: `stressapptest -M 16384 -s 43200 -W --pause_delay 0` ran for 9473s at ~10031 MB/s and reported:

```
Found 0 hardware incidents
Completed ... with 0 hardware incidents, 0 errors
Status: PASS - please verify no corrected errors
```

That said, `stressapptest` checks data bytes, not necessarily MTE allocation tags or tag storage. The repeated KASAN signature looks like an MTE allocation-tag inconsistency: a freshly allocated destination page for compaction is almost entirely tagged as expected, but one 16-byte granule has the wrong tag.

@amazingfate suggested trying lower DDR speed, noting that they had seen a similar issue on an early O6 and fixed it by building firmware with lower memory speed. Given the evidence above, my current best hypothesis is that the Orion O6 firmware DDR timings/training/V/F settings may be too aggressive for at least some boards, possibly affecting memory-controller/tag-storage/cacheline behavior rather than producing simple random bit flips.

The Minisforum MS-R1 is also Sky1/CP8180-based, and its firmware exposes Inline ECC capability.  Can CIX please assist Radxa to produce an Orion O6 firmware which enables Inline ECC (if the hardware supports it?) and exposes corrected/uncorrected memory-controller error counters to Linux?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orion O6 memory corruption and MTE tagging errors #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Case	Workload	PFN	Bad offset	Last free path
1	Podman/Btrfs-ish workload	0x59706	0x5380	Btrfs subvolume deletion
2	stressapptest running	0x73c71	0x9b00	kswapd reclaim
3	ordinary kernel build	0x65c62	0xd280	kswapd reclaim

Orion O6 memory corruption and MTE tagging errors #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions