Alex for next sbi 3.0 rebase 6.15 rc6 manual#458
Closed
AlexGhiti wants to merge 100 commits into
Closed
Conversation
Apparently sec_base doesn't mean relocated symbol value, which seems a copy-pasting error in the comment. Assigned with the address of section indexed by sym->st_shndx, it should represent base address of the relevant section. Let's fix the comment to avoid possible confusion. Fixes: 838b3e2 ("RISC-V: Load purgatory in kexec_file") Signed-off-by: Yao Zi <ziyao@disroot.org> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250326073450.57648-2-ziyao@disroot.org Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Add the necessary page table functions to deal with PUD THP, this enables the use of PUD pfnmap. Link: https://lore.kernel.org/r/20250321123954.225097-1-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
When threads/tasks are switched we need to ensure the old execution's SR_SUM state is saved and the new thread has the old SR_SUM state restored. The issue was seen under heavy load especially with the syz-stress tool running, with crashes as follows in schedule_tail: Unable to handle kernel access to user memory without uaccess routines at virtual address 000000002749f0d0 Oops [#1] Modules linked in: CPU: 1 PID: 4875 Comm: syz-executor.0 Not tainted 5.12.0-rc2-syzkaller-00467-g0d7588ab9ef9 #0 Hardware name: riscv-virtio,qemu (DT) epc : schedule_tail+0x72/0xb2 kernel/sched/core.c:4264 ra : task_pid_vnr include/linux/sched.h:1421 [inline] ra : schedule_tail+0x70/0xb2 kernel/sched/core.c:4264 epc : ffffffe00008c8b0 ra : ffffffe00008c8ae sp : ffffffe025d17ec0 gp : ffffffe005d25378 tp : ffffffe00f0d0000 t0 : 0000000000000000 t1 : 0000000000000001 t2 : 00000000000f4240 s0 : ffffffe025d17ee0 s1 : 000000002749f0d0 a0 : 000000000000002a a1 : 0000000000000003 a2 : 1ffffffc0cfac500 a3 : ffffffe0000c80cc a4 : 5ae9db91c19bbe00 a5 : 0000000000000000 a6 : 0000000000f00000 a7 : ffffffe000082eba s2 : 0000000000040000 s3 : ffffffe00eef96c0 s4 : ffffffe022c77fe0 s5 : 0000000000004000 s6 : ffffffe067d74e00 s7 : ffffffe067d74850 s8 : ffffffe067d73e18 s9 : ffffffe067d74e00 s10: ffffffe00eef96e8 s11: 000000ae6cdf8368 t3 : 5ae9db91c19bbe00 t4 : ffffffc4043cafb2 t5 : ffffffc4043cafba t6 : 0000000000040000 status: 0000000000000120 badaddr: 000000002749f0d0 cause: 000000000000000f Call Trace: [<ffffffe00008c8b0>] schedule_tail+0x72/0xb2 kernel/sched/core.c:4264 [<ffffffe000005570>] ret_from_exception+0x0/0x14 Dumping ftrace buffer: (ftrace buffer empty) ---[ end trace b5f8f9231dc87dda ]--- The issue comes from the put_user() in schedule_tail (kernel/sched/core.c) doing the following: asmlinkage __visible void schedule_tail(struct task_struct *prev) { ... if (current->set_child_tid) put_user(task_pid_vnr(current), current->set_child_tid); ... } the put_user() macro causes the code sequence to come out as follows: 1: __enable_user_access() 2: reg = task_pid_vnr(current); 3: *current->set_child_tid = reg; 4: __disable_user_access() The problem is that we may have a sleeping function as argument which could clear SR_SUM causing the panic above. This was fixed by evaluating the argument of the put_user() macro outside the user-enabled section in commit 285a76b ("riscv: evaluate put_user() arg before enabling user access")" In order for riscv to take advantage of unsafe_get/put_XXX() macros and to avoid the same issue we had with put_user() and sleeping functions we must ensure code flow can go through switch_to() from within a region of code with SR_SUM enabled and come back with SR_SUM still enabled. This patch addresses the problem allowing future work to enable full use of unsafe_get/put_XXX() macros without needing to take a CSR bit flip cost on every access. Make switch_to() save and restore SR_SUM. Reported-by: syzbot+e74b94fe601ab9552d69@syzkaller.appspotmail.com Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Cyril Bur <cyrilbur@tenstorrent.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20250410070526.3160847-2-cyrilbur@tenstorrent.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Currently, when a function like strncpy_from_user() is called, the userspace access protection is disabled and enabled for every word read. By implementing user_access_begin() and families, the protection is disabled at the beginning of the copy and enabled at the end. The __inttype macro is borrowed from x86 implementation. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Signed-off-by: Cyril Bur <cyrilbur@tenstorrent.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20250410070526.3160847-3-cyrilbur@tenstorrent.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Putting ptr in the inputs as opposed to output may seem incorrect but this is done for a few reasons: - Not having it in the output permits the use of asm goto in a subsequent patch. There are bugs in gcc [1] which would otherwise prevent it. - Since the output memory is userspace there isn't any real benefit from telling the compiler about the memory clobber. - x86, arm and powerpc all use this technique. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921 # 1 Signed-off-by: Jisheng Zhang <jszhang@kernel.org> [Cyril Bur: Rewritten commit message] Signed-off-by: Cyril Bur <cyrilbur@tenstorrent.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Tested-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20250410070526.3160847-4-cyrilbur@tenstorrent.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
With 'asm goto' we don't need to test the error etc, the exception just jumps to the error handling directly. Because there are no output clobbers which could trigger gcc bugs [1] the use of asm_goto_output() macro is not necessary here. Not using asm_goto_output() is desirable as the generated output asm will be cleaner. Use of the volatile keyword is redundant as per gcc 14.2.0 manual section 6.48.2.7 Goto Labels: > Also note that an asm goto statement is always implicitly considered volatile. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921 # 1 Signed-off-by: Jisheng Zhang <jszhang@kernel.org> [Cyril Bur: Rewritten commit message] Signed-off-by: Cyril Bur <cyrilbur@tenstorrent.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20250410070526.3160847-5-cyrilbur@tenstorrent.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
With 'asm goto' we don't need to test the error etc, the exception just jumps to the error handling directly. Unlike put_user(), get_user() must work around GCC bugs [1] when using output clobbers in an asm goto statement. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113921 # 1 Signed-off-by: Jisheng Zhang <jszhang@kernel.org> [Cyril Bur: Rewritten commit message] Signed-off-by: Cyril Bur <cyrilbur@tenstorrent.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20250410070526.3160847-6-cyrilbur@tenstorrent.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Some caller-saved registers which are not defined as function arguments in the ABI can still be passed as arguments when the kernel is compiled with Clang. As a result, we must save and restore those registers to prevent ftrace from clobbering them. - [1]: https://reviews.llvm.org/D68559 Reported-by: Evgenii Shatokhin <e.shatokhin@yadro.com> Closes: https://lore.kernel.org/linux-riscv/7e7c7914-445d-426d-89a0-59a9199c45b1@yadro.com/ Fixes: 7caa976 ("ftrace: riscv: move from REGS to ARGS") Acked-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Signed-off-by: Andy Chiu <andy.chiu@sifive.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250407180838.42877-1-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
DYNAMIC_FTRACE selects DYNAMIC_FTRACE_WITH_ARGS and mcount-dyn.S in riscv, so we can remove ifdef jargons of WITH_ARG when it is known that DYNAMIC_FTRACE is true. Signed-off-by: Andy Chiu <andybnac@gmail.com> Link: https://lore.kernel.org/r/20250407180838.42877-2-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
We are changing ftrace code patching in order to remove dependency from stop_machine() and enable kernel preemption. This requires us to align functions entry at a 4-B align address. However, -falign-functions on older versions of GCC alone was not strong enoungh to align all functions. In fact, cold functions are not aligned after turning on optimizations. We consider this is a bug in GCC and turn off guess-branch-probility as a workaround to align all functions. GCC bug id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345 The option -fmin-function-alignment is able to align all functions properly on newer versions of gcc. So, we add a cc-option to test if the toolchain supports it. Suggested-by: Evgenii Shatokhin <e.shatokhin@yadro.com> Signed-off-by: Andy Chiu <andy.chiu@sifive.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250407180838.42877-3-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
The following ftrace patch for riscv uses a data store to update ftrace function. Therefore, a romote fence is required to order it against function_trace_op updates. The mechanism is similar to the fence between function_trace_op and update_ftrace_func in the generic ftrace, so we leverage the same ftrace_sync_ipi function. [ alex: Fix build warning when !CONFIG_DYNAMIC_FTRACE ] Signed-off-by: Andy Chiu <andybnac@gmail.com> Link: https://lore.kernel.org/r/20250407180838.42877-4-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since instruction fetch can break down to 4 byte at a time, it is impossible to update two instructions without a race. In order to mitigate it, we initialize the patchable entry to AUIPC + NOP4. Then, the run-time code patching can change NOP4 to JALR to eable/disable ftrcae from a function. This limits the reach of each ftrace entry to +-2KB displacing from ftrace_caller. Starting from the trampoline, we add a level of indirection for it to reach ftrace caller target. Now, it loads the target address from a memory location, then perform the jump. This enable the kernel to update the target atomically. The new don't-stop-the-world text patching on change only one RISC-V instruction: | -8: &ftrace_ops of the associated tracer function. | <ftrace enable>: | 0: auipc t0, hi(ftrace_caller) | 4: jalr t0, lo(ftrace_caller) | | -8: &ftrace_nop_ops | <ftrace disable>: | 0: auipc t0, hi(ftrace_caller) | 4: nop This means that f+0x0 is fixed, and should not be claimed by ftrace, e.g. kprobe should be able to put a probe in f+0x0. Thus, we adjust the offset and MCOUNT_INSN_SIZE accordingly. [ alex: Fix build errors with !CONFIG_DYNAMIC_FTRACE ] Co-developed-by: Björn Töpel <bjorn@rivosinc.com> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Signed-off-by: Andy Chiu <andy.chiu@sifive.com> Link: https://lore.kernel.org/r/20250407180838.42877-5-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Now it is safe to remove dependency from stop_machine() for us to patch code in ftrace. Signed-off-by: Andy Chiu <andy.chiu@sifive.com> Link: https://lore.kernel.org/r/20250407180838.42877-6-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Each function entry implies a call to ftrace infrastructure. And it may call into schedule in some cases. So, it is possible for preemptible kernel-mode Vector to implicitly call into schedule. Since all V-regs are caller-saved, it is possible to drop all V context when a thread voluntarily call schedule(). Besides, we currently don't pass argument through vector register, so we don't have to save/restore V-regs in ftrace trampoline. Signed-off-by: Andy Chiu <andy.chiu@sifive.com> Link: https://lore.kernel.org/r/20250407180838.42877-7-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
RISC-V spec explicitly calls out that a local fence.i is not enough for the code modification to be visble from a remote hart. In fact, it states: To make a store to instruction memory visible to all RISC-V harts, the writing hart also has to execute a data FENCE before requesting that all remote RISC-V harts execute a FENCE.I. Although current riscv drivers for IPI use ordered MMIO when sending IPIs in order to synchronize the action between previous csd writes, riscv does not restrict itself to any particular flavor of IPI. Any driver or firmware implementation that does not order data writes before the IPI may pose a risk for code-modifying race. Thus, add a fence here to order data writes before making the IPI. Signed-off-by: Andy Chiu <andybnac@gmail.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250407180838.42877-8-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Now, we can safely enable dynamic ftrace with kernel preemption. Signed-off-by: Andy Chiu <andy.chiu@sifive.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250407180838.42877-9-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V. This allows each ftrace callsite to provide an ftrace_ops to the common ftrace trampoline, allowing each callsite to invoke distinct tracer functions without the need to fall back to list processing or to allocate custom trampolines for each callsite. This significantly speeds up cases where multiple distinct trace functions are used and callsites are mostly traced by a single tracer. The idea and most of the implementation is taken from the ARM64's implementation of the same feature. The idea is to place a pointer to the ftrace_ops as a literal at a fixed offset from the function entry point, which can be recovered by the common ftrace trampoline. We use -fpatchable-function-entry to reserve 8 bytes above the function entry by emitting 2 4 byte or 4 2 byte nops depending on the presence of CONFIG_RISCV_ISA_C. These 8 bytes are patched at runtime with a pointer to the associated ftrace_ops for that callsite. Functions are aligned to 8 bytes to make sure that the accesses to this literal are atomic. This approach allows for directly invoking ftrace_ops::func even for ftrace_ops which are dynamically-allocated (or part of a module), without going via ftrace_ops_list_func. We've benchamrked this with the ftrace_ops sample module on Spacemit K1 Jupiter: Without this patch: baseline (Linux rivos 6.14.0-09584-g7d06015d936c #3 SMP Sat Mar 29 +-----------------------+-----------------+----------------------------+ | Number of tracers | Total time (ns) | Per-call average time | |-----------------------+-----------------+----------------------------| | Relevant | Irrelevant | 100000 calls | Total (ns) | Overhead (ns) | |----------+------------+-----------------+------------+---------------| | 0 | 0 | 1357958 | 13 | - | | 0 | 1 | 1302375 | 13 | - | | 0 | 2 | 1302375 | 13 | - | | 0 | 10 | 1379084 | 13 | - | | 0 | 100 | 1302458 | 13 | - | | 0 | 200 | 1302333 | 13 | - | |----------+------------+-----------------+------------+---------------| | 1 | 0 | 13677833 | 136 | 123 | | 1 | 1 | 18500916 | 185 | 172 | | 1 | 2 | 2285645 | 228 | 215 | | 1 | 10 | 58824709 | 588 | 575 | | 1 | 100 | 505141584 | 5051 | 5038 | | 1 | 200 | 1580473126 | 15804 | 15791 | |----------+------------+-----------------+------------+---------------| | 1 | 0 | 13561000 | 135 | 122 | | 2 | 0 | 19707292 | 197 | 184 | | 10 | 0 | 67774750 | 677 | 664 | | 100 | 0 | 714123125 | 7141 | 7128 | | 200 | 0 | 1918065668 | 19180 | 19167 | +----------+------------+-----------------+------------+---------------+ Note: per-call overhead is estimated relative to the baseline case with 0 relevant tracers and 0 irrelevant tracers. With this patch: v4-rc4 (Linux rivos 6.14.0-09598-gd75747611c93 #4 SMP Sat Mar 29 +-----------------------+-----------------+----------------------------+ | Number of tracers | Total time (ns) | Per-call average time | |-----------------------+-----------------+----------------------------| | Relevant | Irrelevant | 100000 calls | Total (ns) | Overhead (ns) | |----------+------------+-----------------+------------+---------------| | 0 | 0 | 1459917 | 14 | - | | 0 | 1 | 1408000 | 14 | - | | 0 | 2 | 1383792 | 13 | - | | 0 | 10 | 1430709 | 14 | - | | 0 | 100 | 1383791 | 13 | - | | 0 | 200 | 1383750 | 13 | - | |----------+------------+-----------------+------------+---------------| | 1 | 0 | 5238041 | 52 | 38 | | 1 | 1 | 5228542 | 52 | 38 | | 1 | 2 | 5325917 | 53 | 40 | | 1 | 10 | 5299667 | 52 | 38 | | 1 | 100 | 5245250 | 52 | 39 | | 1 | 200 | 5238459 | 52 | 39 | |----------+------------+-----------------+------------+---------------| | 1 | 0 | 5239083 | 52 | 38 | | 2 | 0 | 19449417 | 194 | 181 | | 10 | 0 | 67718584 | 677 | 663 | | 100 | 0 | 709840708 | 7098 | 7085 | | 200 | 0 | 2203580626 | 22035 | 22022 | +----------+------------+-----------------+------------+---------------+ Note: per-call overhead is estimated relative to the baseline case with 0 relevant tracers and 0 irrelevant tracers. As can be seen from the above: a) Whenever there is a single relevant tracer function associated with a tracee, the overhead of invoking the tracer is constant, and does not scale with the number of tracers which are *not* associated with that tracee. b) The overhead for a single relevant tracer has dropped to ~1/3 of the overhead prior to this series (from 122ns to 38ns). This is largely due to permitting calls to dynamically-allocated ftrace_ops without going through ftrace_ops_list_func. Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> [update kconfig, asm, refactor] Signed-off-by: Andy Chiu <andybnac@gmail.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250407180838.42877-10-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
jump to FTRACE_ADDR if distance is out of reach Co-developed-by: Björn Töpel <bjorn@rivosinc.com> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Signed-off-by: Andy Chiu <andybnac@gmail.com> Link: https://lore.kernel.org/r/20250407180838.42877-11-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Add a section in cmodx to describe how dynamic ftrace works on riscv, limitations, and assumptions. Signed-off-by: Andy Chiu <andybnac@gmail.com> Link: https://lore.kernel.org/r/20250407180838.42877-12-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
perf reports that 99.63% of the cycles from `modprobe amdgpu` are spent inside module_frob_arch_sections(). This is because amdgpu.ko contains about 300000 relocations in its .rela.text section, and the algorithm in count_max_entries() takes quadratic time. Apply two optimizations from the arm64 code, which together reduce the total execution time by 99.58%. First, sort the relocations so duplicate entries are adjacent. Second, reduce the number of relocations that must be sorted by filtering to only relocations that need PLT/GOT entries, as done in commit d4e0340 ("arm64/module: Optimize module load time by optimizing PLT counting"). Unlike the arm64 code, here the filtering and sorting is done in a scratch buffer, because the HI20 relocation search optimization in apply_relocate_add() depends on the original order of the relocations. This allows accumulating PLT/GOT relocations across sections so sorting and counting is only done once per module. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20250409171526.862481-3-samuel.holland@sifive.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
It is the built-in command line appended to the bootloader command line, not the bootloader command line appended to the built-in command line. Fixes: 3aed8c4 ("RISC-V: Update Kconfig to better handle CMDLINE") Signed-off-by: 谢致邦 (XIE Zhibang) <Yeking@Red54.com> Link: https://lore.kernel.org/r/tencent_A93C7FB46BFD20054AD2FEF4645913FF550A@qq.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
This is the preparative patch for kexec_file_load Image support. It separates the elf_kexec_load() as two parts: - the first part loads the vmlinux (or Image) - the second part loads other segments (e.g. initrd,fdt,purgatory) And the second part is exported as the load_extra_segments() function which would be used in both kexec-elf.c and kexec-image.c. No functional change intended. Signed-off-by: Song Shuai <songshuaishuai@tinylab.org> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250409193004.643839-2-bjorn@kernel.org Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
This patch creates image_kexec_ops to load Image binary file for kexec_file_load() syscall. Signed-off-by: Song Shuai <songshuaishuai@tinylab.org> Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20250409193004.643839-3-bjorn@kernel.org Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
RISCV ELF use mapping symbols with special names $x, $d to identify regions of RISCV code or code with different ISAs[1]. These symbols don't identify functions, so will confuse the perf output. The patch filters out these symbols at load time, similar to "4886f2ca perf symbols: Ignore mapping symbols on aarch64". [1] https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/ master/riscv-elf.adoc#mapping-symbol Signed-off-by: Haibo Xu <haibo1.xu@intel.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250409025202.201046-1-haibo1.xu@intel.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
The return value of regs_irqs_disabled() is true or false, so change its type to reflect that and also make it always inline. Suggested-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20250422113156.25742-1-yangtiezhu@loongson.cn Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Export Zabha through the hwprobe syscall. Reviewed-by: Clément Léger <cleger@rivosinc.com> Link: https://lore.kernel.org/r/20250421141413.394444-1-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
The S-type instructions are first introduced and then used to define the encoding of the Zicbop prefetching instructions. Co-developed-by: Guo Ren <guoren@kernel.org> Signed-off-by: Guo Ren <guoren@kernel.org> Tested-by: Andrea Parri <parri.andrea@gmail.com> Link: https://lore.kernel.org/r/20250421142441.395849-2-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Zicbop introduces cache blocks prefetching instructions, add the necessary support for the kernel to use it in the coming commits. Co-developed-by: Guo Ren <guoren@kernel.org> Signed-off-by: Guo Ren <guoren@kernel.org> Tested-by: Andrea Parri <parri.andrea@gmail.com> Link: https://lore.kernel.org/r/20250421142441.395849-3-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Enable Linux prefetch and prefetchw primitives using Zicbop. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Link: https://lore.kernel.org/r/20231231082955.16516-3-guoren@kernel.org Tested-by: Andrea Parri <parri.andrea@gmail.com> Link: https://lore.kernel.org/r/20250421142441.395849-4-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
The cost of changing a cacheline from shared to exclusive state can be significant, especially when this is triggered by an exclusive store, since it may result in having to retry the transaction. This patch makes use of prefetch.w to prefetch cachelines for write prior to lr/sc loops when using the xchg_small atomic routine. This patch is inspired by commit 0ea366f ("arm64: atomics: prefetch the destination word for write prior to stxr"). Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Link: https://lore.kernel.org/r/20231231082955.16516-4-guoren@kernel.org Tested-by: Andrea Parri <parri.andrea@gmail.com> Link: https://lore.kernel.org/r/20250421142441.395849-5-alexghiti@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
…ry/exit Carves out space in arch specific thread struct for cfi status and shadow stack in usermode on riscv. This patch does following - defines a new structure cfi_status with status bit for cfi feature - defines shadow stack pointer, base and size in cfi_status structure - defines offsets to new member fields in thread in asm-offsets.c - Saves and restore shadow stack pointer on trap entry (U --> S) and exit (S --> U) Shadow stack save/restore is gated on feature availiblity and implemented using alternative. CSR can be context switched in `switch_to` as well but soon as kernel shadow stack support gets rolled in, shadow stack pointer will need to be switched at trap entry/exit point (much like `sp`). It can be argued that kernel using shadow stack deployment scenario may not be as prevalant as user mode using this feature. But even if there is some minimal deployment of kernel shadow stack, that means that it needs to be supported. And thus save/restore of shadow stack pointer in entry.S instead of in `switch_to.h`. Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Reviewed-by: Zong Li <zong.li@sifive.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-5-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
68b7f2f to
359ba26
Compare
`arch_calc_vm_prot_bits` is implemented on risc-v to return VM_READ | VM_WRITE if PROT_WRITE is specified. Similarly `riscv_sys_mmap` is updated to convert all incoming PROT_WRITE to (PROT_WRITE | PROT_READ). This is to make sure that any existing apps using PROT_WRITE still work. Earlier `protection_map[VM_WRITE]` used to pick read-write PTE encodings. Now `protection_map[VM_WRITE]` will always pick PAGE_SHADOWSTACK PTE encodings for shadow stack. Above changes ensure that existing apps continue to work because underneath kernel will be picking `protection_map[VM_WRITE|VM_READ]` PTE encodings. [ alex: Fix build error with newly introduced vdso getrandom ] Reviewed-by: Zong Li <zong.li@sifive.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-6-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
This patch implements creating shadow stack pte (on riscv). Creating shadow stack PTE on riscv means that clearing RWX and then setting W=1. [ alex: Use riscv/mm as commit prefix ] Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-7-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
pte_mkwrite creates PTEs with WRITE encodings for underlying arch. Underlying arch can have two types of writeable mappings. One that can be written using regular store instructions. Another one that can only be written using specialized store instructions (like shadow stack stores). pte_mkwrite can select write PTE encoding based on VMA range (i.e. VM_SHADOW_STACK) [ alex: Use riscv/mm as commit prefix, remove blank line at EOF ] Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-8-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
`fork` implements copy on write (COW) by making pages readonly in child and parent both. ptep_set_wrprotect and pte_wrprotect clears _PAGE_WRITE in PTE. Assumption is that page is readable and on fault copy on write happens. To implement COW on shadow stack pages, clearing up W bit makes them XWR = 000. This will result in wrong PTE setting which says no perms but V=1 and PFN field pointing to final page. Instead desired behavior is to turn it into a readable page, take an access (load/store) fault on sspush/sspop (shadow stack) and then perform COW on such pages. This way regular reads would still be allowed and not lead to COW maintaining current behavior of COW on non-shadow stack but writeable memory. On the other hand it doesn't interfere with existing COW for read-write memory. Assumption is always that _PAGE_READ must have been set and thus setting _PAGE_READ is harmless. [ alex: Use riscv/mm as commit prefix ] Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-9-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
As discussed extensively in the changelog for the addition of this
syscall on x86 ("x86/shstk: Introduce map_shadow_stack syscall") the
existing mmap() and madvise() syscalls do not map entirely well onto the
security requirements for shadow stack memory since they lead to windows
where memory is allocated but not yet protected or stacks which are not
properly and safely initialised. Instead a new syscall map_shadow_stack()
has been defined which allocates and initialises a shadow stack page.
This patch implements this syscall for riscv. riscv doesn't require token
to be setup by kernel because user mode can do that by itself. However to
provide compatibility and portability with other architectues, user mode
can specify token set flag.
Reviewed-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Deepak Gupta <debug@rivosinc.com>
Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-10-64f61a35eee7@rivosinc.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Userspace specifies CLONE_VM to share address space and spawn new thread. `clone` allow userspace to specify a new stack for new thread. However there is no way to specify new shadow stack base address without changing API. This patch allocates a new shadow stack whenever CLONE_VM is given. In case of CLONE_VFORK, parent is suspended until child finishes and thus can child use parent shadow stack. In case of !CLONE_VM, COW kicks in because entire address space is copied from parent to child. `clone3` is extensible and can provide mechanisms using which shadow stack as an input parameter can be provided. This is not settled yet and being extensively discussed on mailing list. Once that's settled, this commit will adapt to that. Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-11-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Implement architecture agnostic prctls() interface for setting and getting shadow stack status. prctls implemented are PR_GET_SHADOW_STACK_STATUS, PR_SET_SHADOW_STACK_STATUS and PR_LOCK_SHADOW_STACK_STATUS. As part of PR_SET_SHADOW_STACK_STATUS/PR_GET_SHADOW_STACK_STATUS, only PR_SHADOW_STACK_ENABLE is implemented because RISCV allows each mode to write to their own shadow stack using `sspush` or `ssamoswap`. PR_LOCK_SHADOW_STACK_STATUS locks current configuration of shadow stack enabling. Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-12-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Three architectures (x86, aarch64, riscv) have support for indirect branch
tracking feature in a very similar fashion. On a very high level, indirect
branch tracking is a CPU feature where CPU tracks branches which uses
memory operand to perform control transfer in program. As part of this
tracking on indirect branches, CPU goes in a state where it expects a
landing pad instr on target and if not found then CPU raises some fault
(architecture dependent)
x86 landing pad instr - `ENDBRANCH`
arch64 landing pad instr - `BTI`
riscv landing instr - `lpad`
Given that three major arches have support for indirect branch tracking,
This patch makes `prctl` for indirect branch tracking arch agnostic.
To allow userspace to enable this feature for itself, following prtcls are
defined:
- PR_GET_INDIR_BR_LP_STATUS: Gets current configured status for indirect
branch tracking.
- PR_SET_INDIR_BR_LP_STATUS: Sets a configuration for indirect branch
tracking.
Following status options are allowed
- PR_INDIR_BR_LP_ENABLE: Enables indirect branch tracking on user
thread.
- PR_INDIR_BR_LP_DISABLE; Disables indirect branch tracking on user
thread.
- PR_LOCK_INDIR_BR_LP_STATUS: Locks configured status for indirect branch
tracking for user thread.
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Deepak Gupta <debug@rivosinc.com>
Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-13-64f61a35eee7@rivosinc.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
prctls implemented are: PR_SET_INDIR_BR_LP_STATUS, PR_GET_INDIR_BR_LP_STATUS and PR_LOCK_INDIR_BR_LP_STATUS Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-14-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
zicfiss / zicfilp introduces a new exception to priv isa `software check exception` with cause code = 18. This patch implements software check exception. Additionally it implements a cfi violation handler which checks for code in xtval. If xtval=2, it means that sw check exception happened because of an indirect branch not landing on 4 byte aligned PC or not landing on `lpad` instruction or label value embedded in `lpad` not matching label value setup in `x7`. If xtval=3, it means that sw check exception happened because of mismatch between link register (x1 or x5) and top of shadow stack (on execution of `sspopchk`). In case of cfi violation, SIGSEGV is raised with code=SEGV_CPERR. SEGV_CPERR was introduced by x86 shadow stack patches. Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-15-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
The function save_v_state() served two purposes. First, it saved extension context into the signal stack. Then, it constructed the extension header if there was no fault. The second part is independent of the extension itself. As a result, we can pull that part out, so future extensions may reuse it. This patch adds arch_ext_list and makes setup_sigcontext() go through all possible extensions' save() callback. The callback returns a positive value indicating the size of the successfully saved extension. Then the kernel proceeds to construct the header for that extension. The kernel skips an extension if it does not exist, or if the saving fails for some reasons. The error code is propagated out on the later case. This patch does not introduce any functional changes. Signed-off-by: Andy Chiu <andybnac@gmail.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-16-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Save shadow stack pointer in sigcontext structure while delivering signal. Restore shadow stack pointer from sigcontext on sigreturn. As part of save operation, kernel uses `ssamoswap` to save snapshot of current shadow stack on shadow stack itself (can be called as a save token). During restore on sigreturn, kernel retrieves token from top of shadow stack and validates it. This allows that user mode can't arbitrary pivot to any shadow stack address without having a token and thus provide strong security assurance between signaly delivery and sigreturn window. Use ABI compatible way of saving/restoring shadow stack pointer into signal stack. This follows what Vector extension, where extra registers are placed in a form of extension header + extension body in the stack. The extension header indicates the size of the extra architectural states plus the size of header itself, and a magic identifier of the extension. Then, the extensions body contains the new architectural states in the form defined by uapi. Signed-off-by: Andy Chiu <andy.chiu@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-17-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Updating __show_regs to print captured shadow stack pointer as well. On tasks where shadow stack is disabled, it'll simply print 0. Signed-off-by: Deepak Gupta <debug@rivosinc.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-18-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Expose a new register type NT_RISCV_USER_CFI for risc-v cfi status and state. Intentionally both landing pad and shadow stack status and state are rolled into cfi state. Creating two different NT_RISCV_USER_XXX would not be useful and wastage of a note type. Enabling, disabling and locking of feature is not allowed via ptrace set interface. However setting `elp` state or setting shadow stack pointer are allowed via ptrace set interface . It is expected `gdb` might have use to fixup `elp` state or `shadow stack` pointer. Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-19-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Adding enumeration of zicfilp and zicfiss extensions in hwprobe syscall. Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-20-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
This commit adds a kernel command line option using which user cfi can be disabled. User backward cfi and forward cfi can be enabled independently. Kernel command line parameter "riscv_nousercfi" can take below values: - "all" : Disable forward and backward cfi both. - "bcfi" : Disable backward cfi. - "fcfi" : Disable forward cfi Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-21-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Kernel will have to perform shadow stack operations on user shadow stack. Like during signal delivery and sigreturn, shadow stack token must be created and validated respectively. Thus shadow stack access for kernel must be enabled. In future when kernel shadow stacks are enabled for linux kernel, it must be enabled as early as possible for better coverage and prevent imbalance between regular stack and shadow stack. After `relocate_enable_mmu` has been done, this is as early as possible it can enabled. Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-22-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
user mode tasks compiled with zicfilp may call indirectly into vdso (like hwprobe indirect calls). Add landing pad compile support in vdso. vdso with landing pad in it will be nop for tasks which have not enabled landing pad. This patch allows to run user mode tasks with cfi eanbled and do no harm. Future work can be done on this to do below - labeled landing pad on vdso functions (whenever labeling support shows up in gnu-toolchain) - emit shadow stack instructions only in vdso compiled objects as part of kernel compile. Signed-off-by: Jim Shu <jim.shu@sifive.com> Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-23-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
This patch creates a config for shadow stack support and landing pad instr support. Shadow stack support and landing instr support can be enabled by selecting `CONFIG_RISCV_USER_CFI`. Selecting `CONFIG_RISCV_USER_CFI` wires up path to enumerate CPU support and if cpu support exists, kernel will support cpu assisted user mode cfi. If CONFIG_RISCV_USER_CFI is selected, select `ARCH_USES_HIGH_VMA_FLAGS`, `ARCH_HAS_USER_SHADOW_STACK` and DYNAMIC_SIGFRAME for riscv. Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-24-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Adding documentation on landing pad aka indirect branch tracking on riscv and kernel interfaces exposed so that user tasks can enable it. Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-25-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Adding documentation on shadow stack for user mode on riscv and kernel interfaces exposed so that user tasks can enable it. Reviewed-by: Zong Li <zong.li@sifive.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-26-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Adds kselftest for RISC-V control flow integrity implementation for user mode. There is not a lot going on in kernel for enabling landing pad for user mode. cfi selftest are intended to be compiled with zicfilp and zicfiss enabled compiler. Thus kselftest simply checks if landing pad / shadow stack for the process are enabled or not and executes ptrace selftests on cfi. selftest then register a signal handler for SIGSEGV. Any control flow violation are reported as SIGSEGV with si_code = SEGV_CPERR. Test will fail on receiving any SEGV_CPERR. Shadow stack part has more changes in kernel and thus there are separate tests for that - Exercise `map_shadow_stack` syscall - `fork` test to make sure COW works for shadow stack pages - gup tests Kernel uses FOLL_FORCE when access happens to memory via /proc/<pid>/mem. Not breaking that for shadow stack. - signal test. Make sure signal delivery results in token creation on shadow stack and consumes (and verifies) token on sigreturn - shadow stack protection test. attempts to write using regular store instruction on shadow stack memory must result in access faults - ptrace test: adds landing pad violation, clears ELP and continues In case toolchain doesn't support cfi extension, cfi kselftest wont get built. Test outut ========== """ TAP version 13 1..5 This is to ensure shadow stack is indeed enabled and working This is to ensure shadow stack is indeed enabled and working ok 1 shstk fork test ok 2 map shadow stack syscall ok 3 shadow stack gup tests ok 4 shadow stack signal tests ok 5 memory protections of shadow stack memory """ Signed-off-by: Deepak Gupta <debug@rivosinc.com> Suggested-by: Charlie Jenkins <charlie@rivosinc.com> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-27-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Deepak Gupta <debug@rivosinc.com> says: Basics and overview =================== Software with larger attack surfaces (e.g. network facing apps like databases, browsers or apps relying on browser runtimes) suffer from memory corruption issues which can be utilized by attackers to bend control flow of the program to eventually gain control (by making their payload executable). Attackers are able to perform such attacks by leveraging call-sites which rely on indirect calls or return sites which rely on obtaining return address from stack memory. To mitigate such attacks, risc-v extension zicfilp enforces that all indirect calls must land on a landing pad instruction `lpad` else cpu will raise software check exception (a new cpu exception cause code on riscv). Similarly for return flow, risc-v extension zicfiss extends architecture with - `sspush` instruction to push return address on a shadow stack - `sspopchk` instruction to pop return address from shadow stack and compare with input operand (i.e. return address on stack) - `sspopchk` to raise software check exception if comparision above was a mismatch - Protection mechanism using which shadow stack is not writeable via regular store instructions More information an details can be found at extensions github repo [1]. Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel CET [3] and branch target identification (BTI) [4] on arm. Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack. x86 and arm64 support for user mode shadow stack is already in mainline. Kernel awareness for user control flow integrity ================================================ This series picks up Samuel Holland's envcfg changes [2] as well. So if those are being applied independently, they should be removed from this series. Enabling: In order to maintain compatibility and not break anything in user mode, kernel doesn't enable control flow integrity cpu extensions on binary by default. Instead exposes a prctl interface to enable, disable and lock the shadow stack or landing pad feature for a task. This allows userspace (loader) to enumerate if all objects in its address space are compiled with shadow stack and landing pad support and accordingly enable the feature. Additionally if a subsequent `dlopen` happens on a library, user mode can take a decision again to disable the feature (if incoming library is not compiled with support) OR terminate the task (if user mode policy is strict to have all objects in address space to be compiled with control flow integirty cpu feature). prctl to enable shadow stack results in allocating shadow stack from virtual memory and activating for user address space. x86 and arm64 are also following same direction due to similar reason(s). clone/fork: On clone and fork, cfi state for task is inherited by child. Shadow stack is part of virtual memory and is a writeable memory from kernel perspective (writeable via a restricted set of instructions aka shadow stack instructions) Thus kernel changes ensure that this memory is converted into read-only when fork/clone happens and COWed when fault is taken due to sspush, sspopchk or ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled, kernel will automatically allocate a shadow stack for that clone call. map_shadow_stack: x86 introduced `map_shadow_stack` system call to allow user space to explicitly map shadow stack memory in its address space. It is useful to allocate shadow for different contexts managed by a single thread (green threads or contexts) risc-v implements this system call as well. signal management: If shadow stack is enabled for a task, kernel performs an asynchronous control flow diversion to deliver the signal and eventually expects userspace to issue sigreturn so that original execution can be resumed. Even though resume context is prepared by kernel, it is in user space memory and is subject to memory corruption and corruption bugs can be utilized by attacker in this race window to perform arbitrary sigreturn and eventually bypass cfi mechanism. Another issue is how to ensure that cfi related state on sigcontext area is not trampled by legacy apps or apps compiled with old kernel headers. In order to mitigate control-flow hijacting, kernel prepares a token and place it on shadow stack before signal delivery and places address of token in sigcontext structure. During sigreturn, kernel obtains address of token from sigcontext struture, reads token from shadow stack and validates it and only then allow sigreturn to succeed. Compatiblity issue is solved by adopting dynamic sigcontext management introduced for vector extension. This series re-factor the code little bit to allow future sigcontext management easy (as proposed by Andy Chiu from SiFive) config and compilation: Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this config option picks the kernel support for user control flow integrity. This optin is presented only if toolchain has shadow stack and landing pad support. And is on purpose guarded by toolchain support. Reason being that eventually vDSO also needs to be compiled in with shadow stack and landing pad support. vDSO compile patches are not included as of now because landing pad labeling scheme is yet to settle for usermode runtime. To get more information on kernel interactions with respect to zicfilp and zicfiss, patch series adds documentation for `zicfilp` and `zicfiss` in following: Documentation/arch/riscv/zicfiss.rst Documentation/arch/riscv/zicfilp.rst vDSO related Opens (in the flux) ================================= I am listing these opens for laying out plan and what to expect in future patch sets. And of course for the sake of discussion. Shadow stack and landing pad enabling in vDSO ---------------------------------------------- vDSO must have shadow stack and landing pad support compiled in for task to have shadow stack and landing pad support. This patch series doesn't enable that (yet). Enabling shadow stack support in vDSO should be straight forward (intend to do that in next versions of patch set). Enabling landing pad support in vDSO requires some collaboration with toolchain folks to follow a single label scheme for all object binaries. This is necessary to ensure that all indirect call-sites are setting correct label and target landing pads are decorated with same label scheme. How many vDSOs --------------- Shadow stack instructions are carved out of zimop (may be operations) and if CPU doesn't implement zimop, they're illegal instructions. Kernel could be running on a CPU which may or may not implement zimop. And thus kernel will have to carry 2 different vDSOs and expose the appropriate one depending on whether CPU implements zimop or not. References ========== [1] - https://github.com/riscv/riscv-cfi [2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.com/ [3] - https://lwn.net/Articles/889475/ [4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identification [5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-introduction-intel-cet-844137.pdf [6] - https://lwn.net/Articles/940403/ * patches from https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@rivosinc.com: (27 commits) kselftest/riscv: kselftest for user mode cfi riscv: Documentation for shadow stack on riscv riscv: Documentation for landing pad / indirect branch tracking riscv: create a config for shadow stack and landing pad instr support arch/riscv: compile vdso with landing pad riscv: enable kernel access to shadow stack memory via FWFT sbi call riscv: kernel command line option to opt out of user cfi riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe riscv/ptrace: riscv cfi status and state via ptrace and in core files riscv/kernel: update __show_regs to print shadow stack register riscv/signal: save and restore of shadow stack for signal riscv: signal: abstract header saving for setup_sigcontext riscv/traps: Introduce software check exception riscv: Implements arch agnostic indirect branch tracking prctls prctl: arch-agnostic prctl for indirect branch tracking riscv: Implements arch agnostic shadow stack prctls riscv/shstk: If needed allocate a new shadow stack on clone riscv/mm: Implement map_shadow_stack() syscall riscv mmu: write protect and shadow stack riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs ... Link: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Handle the uprobe event first before handling the CFI violation in software-check exception handler. Because when the landing pad is activated, if the uprobe point is set at the lpad instruction at the beginning of a function, the system triggers a software-check exception instead of an ebreak exception due to the exception priority, then uprobe can't work successfully. Co-developed-by: Deepak Gupta <debug@rivosinc.com> Signed-off-by: Deepak Gupta <debug@rivosinc.com> Signed-off-by: Zong Li <zong.li@sifive.com> Link: https://lore.kernel.org/r/20250314092614.27372-1-zong.li@sifive.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
3928998 to
cb05f4e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.