[PW_SID:1091438] riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance#1883
[PW_SID:1091438] riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance#1883linux-riscv-bot wants to merge 2 commits into
Conversation
copy_mc_to_kernel() and copy_mc_to_user() are architecture hooks that
let the kernel survive an uncorrectable hardware memory error (e.g. an
uncorrectable ECC fault) raised during the *source* read of a memory
copy. They are the cornerstone of graceful error recovery on every
path that has to duplicate a page whose contents might already be bad:
- COW (copy-on-write): wp_page_copy(), do_cow_fault(),
copy_present_page() in fork, and __wp_page_copy_user() all
route their per-page copy through copy_mc_user_highpage();
- hugetlb / THP: copy_user_gigantic_page(), copy_subpage(),
__collapse_huge_page_copy() and collapse_file() rely on the
same hook (copy_mc_user_highpage / copy_mc_highpage) to clone
or collapse 2 MiB / 1 GiB folios without tearing the kernel
down on a single bad cacheline;
- page reclaim / migration / KSM: folio_mc_copy(),
ksm_might_need_to_copy(), compaction and NUMA balancing;
- file I/O on byte-addressable memory (DAX, CXL.mem, and the
iov_iter MC helpers) and core-dump writeout.
When any of these callers hits a hardware error during the load, the
copy_mc_* helper returns a non-zero byte count instead of oopsing the
kernel. The caller can then react in whatever way fits the context:
propagate -EFAULT back to the originating syscall, isolate the poisoned
page through memory_failure_queue(), retry on a clean replica, or as
a last resort kill the owning task. The system as a whole keeps
running.
This is also why a new copy routine is required rather than reusing
the existing memcpy(). The C contract for memcpy() is
void *memcpy(void *dst, const void *src, size_t n);
it returns dst unconditionally and has no out-of-band way to tell
the caller whether the copy actually succeeded. MC-aware callers
need exactly that signal - a single "did the hardware raise an
exception during this copy or not" bit - so the API has to be
unsigned long memcpy_mc(void *dst, const void *src, size_t n);
where the return value serves as the error indicator (0 on success,
non-zero when a load faulted). The fact that the non-zero value
happens to be the remaining byte count is just a useful implementation
detail for optional follow-up work such as poisoning the exact
sub-range; the essential point is that a successful copy and a
failed copy can no longer be distinguished from the outside with
memcpy()'s void-pointer return, so a new function is unavoidable.
RISC-V previously did not provide either of the copy_mc_* hooks, so
the generic fallback in <linux/uaccess.h> was used:
static inline unsigned long
copy_mc_to_kernel(void *dst, const void *src, size_t cnt)
{
memcpy(dst, src, cnt);
return 0;
}
That fallback has no exception-table entry on the load side, so an
access to poisoned memory (reported through the RAS/AIA path) takes
the kernel down just like any other unhandled fault - defeating the
whole point of the copy_mc_* API and leaving every COW / hugetlb /
THP-collapse / migration path above exposed on RISC-V. A native
implementation that actually stops on the faulting load and signals
the error through its return value is therefore required.
A word on the exception-table entry type used by this patch: x86 and
arm64 both carry dedicated "MC-safe" flavours in their extable
infrastructure. x86 defines EX_TYPE_DEFAULT_MCE_SAFE and
EX_TYPE_FAULT_MCE_SAFE (see arch/x86/include/asm/extable_fixup_types.h
and the fixups in arch/x86/lib/copy_mc_64.S), while arm64 ends up
reusing EX_TYPE_KACCESS_ERR_ZERO for the same purpose. Tempting as
it is to mirror that and introduce a new RISC-V-specific type, it
turns out to buy very little in practice: inspecting
arch/x86/mm/extable.c shows that EX_TYPE_DEFAULT_MCE_SAFE shares its
handler with EX_TYPE_DEFAULT, and EX_TYPE_FAULT_MCE_SAFE shares its
handler with EX_TYPE_FAULT - in every case the fixup simply redirects
PC to the fixup label and lets the caller return. The tag mostly
serves as documentation. Because the fix-up behaviour we need is
identical to that of a plain extable entry, this patch keeps things
simple and uses the existing _asm_extable helper rather than growing
a new EX_TYPE_* constant; if a future RAS integration ever needs to
discriminate MC-safe sites from ordinary ones, the tag can be added
later without touching any call site.
Implement it by factoring the existing hand-written memcpy into a
shared template and reusing it for the MC variant:
- memcpy_template.S
The whole body of the original memcpy.S is moved here verbatim,
with every load/store expressed through six parametric macros:
LOAD_B / STORE_B, LOAD_W / STORE_W and LOAD_REG / STORE_REG.
The template is #include'd into a SYM_FUNC_START/END wrapper by
the caller, which also supplies the macro definitions.
- memcpy.S
Now only defines plain lb/sb, lw/sw and REG_L/REG_S macros and
includes the template. The generated code for __memcpy() is
byte-for-byte equivalent to the previous open-coded version, so
the hot path for regular kernel memcpy is unchanged.
- memcpy_mc.S (new)
Defines the same macros, but every *load* is wrapped in a "fixup"
macro that emits an _asm_extable entry pointing at a local label
6:. On the happy path __memcpy_mc() returns 0; on a hardware
error the exception handler jumps to label 6:, which returns a
non-zero value (the still-outstanding byte count held
in a2) to flag the failure to the caller.
Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Signed-off-by: Linux RISC-V bot <linux.riscv.bot@gmail.com>
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
|
Patch 1: "riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance" |
2d4fcdd to
cd9d421
Compare
PR for series 1091438 applied to workflow__riscv__fixes
Name: riscv: add copy_mc_to_{kernel,user} support to enable MC fault tolerance
URL: https://patchwork.kernel.org/project/linux-riscv/list/?series=1091438
Version: 1