sync: rvck #267: [Backport] KVM support THP and huge pages spliting for dirty logging #231
sync: rvck #267: [Backport] KVM support THP and huge pages spliting for dirty logging #231yechao-w wants to merge 19 commits into
Conversation
mainline inclusion from mainline-6.17-rc1 commit 4a50578 category: feature bugzilla: RVCK-Project#257 -------------------------------- The kvm_riscv_vcpu_alloc_vector_context() does return an error code upon failure so don't ignore this in kvm_arch_vcpu_create(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-2-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 7c67de2 category: feature bugzilla: RVCK-Project#257 -------------------------------- The kvm_riscv_vcpu_aia_init() does not return any failure so drop the return value which is always zero. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-3-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit b79bf20 category: feature bugzilla: RVCK-Project#257 -------------------------------- The kvm_riscv_local_tlb_sanitize() deals with sanitizing current VMID related TLB mappings when a VCPU is moved from one host CPU to another. Let's move kvm_riscv_local_tlb_sanitize() to VMID management sources and rename it to kvm_riscv_gstage_vmid_sanitize(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-4-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 7584eb6 category: feature bugzilla: RVCK-Project#257 -------------------------------- The KVM_REQ_HFENCE_GVMA_VMID_ALL is same as KVM_REQ_TLB_FLUSH so to avoid confusion let's replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH. Also, rename kvm_riscv_hfence_gvma_vmid_all_process() to kvm_riscv_tlb_flush_process(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-5-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit eaa98ba category: feature bugzilla: RVCK-Project#257 -------------------------------- The gstage_set_pte() and gstage_op_pte() should flush TLB only when a leaf PTE changes so that unnecessary TLB flushes can be avoided. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-6-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit ca539ba category: feature bugzilla: RVCK-Project#257 -------------------------------- The kvm_arch_flush_remote_tlbs_range() expected by KVM core can be easily implemented for RISC-V using kvm_riscv_hfence_gvma_vmid_gpa() hence provide it. Also with kvm_arch_flush_remote_tlbs_range() available for RISC-V, the mmu_wp_memory_region() can happily use kvm_flush_remote_tlbs_memslot() instead of kvm_flush_remote_tlbs(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-7-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 77ba646 category: feature bugzilla: RVCK-Project#257 -------------------------------- The H-extension CSRs accessed by kvm_riscv_vcpu_trap_redirect() will trap when KVM RISC-V is running as Guest/VM hence remove these traps by using ncsr_xyz() instead of csr_xyz(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-8-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 4ecbd3e category: feature bugzilla: RVCK-Project#257 -------------------------------- The MMU, TLB, and VMID management for KVM RISC-V already exists as seprate sources so create separate headers along these lines. This further simplifies asm/kvm_host.h header. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-9-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit f035b44 category: feature bugzilla: RVCK-Project#257 -------------------------------- Introduce struct kvm_gstage_mapping which represents a g-stage mapping at a particular g-stage page table level. Also, update the kvm_riscv_gstage_map() to return the g-stage mapping upon success. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-10-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 4c933f3 category: feature bugzilla: RVCK-Project#257 -------------------------------- Currently, the struct kvm_riscv_hfence does not have vmid field and various hfence processing functions always pick vmid assigned to the guest/VM. This prevents us from doing hfence operation on arbitrary vmid hence add vmid field to struct kvm_riscv_hfence and use it wherever applicable. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-11-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit dd82e35 category: feature bugzilla: RVCK-Project#257 -------------------------------- The upcoming nested virtualization can share g-stage page table management with the current host g-stage implementation hence factor-out g-stage page table management as separate sources and also use "kvm_riscv_mmu_" prefix for host g-stage functions. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-12-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 1f6d0ee category: feature bugzilla: RVCK-Project#257 -------------------------------- Currently, all kvm_riscv_hfence_xyz() APIs assume VMID to be the host VMID of the Guest/VM which resticts use of these APIs only for host TLB maintenance. Let's allow passing VMID as a parameter to all kvm_riscv_hfence_xyz() APIs so that they can be re-used for nested virtualization related TLB maintenance. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-13-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
category: feature bugzilla: RVCK-Project#257 -------------------------------- This function provides the inverse operation of pte_mkhuge(). Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.15-rc1
commit 03dc00a
category: feature
bugzilla: RVCK-Project#257
--------------------------------
Use RSW0 as the special bit for pmds and puds, just like for ptes.
Also define the {pte,pmd,pud}_pgprot helpers which were previously
missing and are needed for the follow_pfnmap APIs.
Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20250108135700.2614848-1-abrestic@rivosinc.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-7.0-rc1 commit ed7ae7a category: feature bugzilla: RVCK-Project#257 -------------------------------- Use block mapping if backed by a THP, as implemented in architectures like ARM and x86_64. Signed-off-by: Jessica Liu <liu.xuemei1@zte.com.cn> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20251127165137780QbUOVPKPAfWSGAFl5qtRy@zte.com.cn Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-v7.0-rc4
commit b342166
category: feature
bugzilla: RVCK-Project#257
--------------------------------
When dirty logging is enabled, guest stage mappings are forced to
PAGE_SIZE granularity. Changing the mapping page size at this point
is incorrect.
Fixes: ed7ae7a34bea ("RISC-V: KVM: Transparent huge page support")
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260226191231140_X1Juus7s2kgVlc0ZyW_K@zte.com.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
…ging mainline inclusion from mainline-7.1-rc1 commit a216e24 category: feature bugzilla: RVCK-Project#257 -------------------------------- When enabling dirty log in small chunks (e.g., QEMU default chunk size of 256K), the chunk size is always smaller than the page size of huge pages (1G or 2M) used in the gstage page tables. This caused the write protection to be incorrectly skipped for huge PTEs because the condition `(end - addr) >= page_size` was not satisfied. Remove the size check in `kvm_riscv_gstage_wp_range()` to ensure huge PTEs are always write-protected regardless of the chunk size. Additionally, explicitly align the address down to the page size before invoking `kvm_riscv_gstage_op_pte()` to guarantee that the address passed to the operation function is page-aligned. This fixes the issue where dirty pages might not be tracked correctly when using huge pages. Fixes: 9d05c1f ("RISC-V: KVM: Implement stage2 page table programming") Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn> Reviewed-by: Nutty Liu <nutty.liu@hotmail.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/202603301610527120YZ-pAJY6x9SBpSRo1Wg4@zte.com.cn Signed-off-by: Anup Patel <anup@brainfault.org>
mainline inclusion from mainline-7.1-rc1 commit 6ad36f3 category: feature bugzilla: RVCK-Project#257 -------------------------------- During dirty logging, all huge pages are write-protected. When the guest writes to a write-protected huge page, a page fault is triggered. Before recovering the write permission, the huge page must be split into smaller pages (e.g., 4K). After splitting, the normal mapping process proceeds, allowing write permission to be restored at the smaller page granularity. If dirty logging is disabled because migration failed or was cancelled, only recover the write permission at the 4K level, and skip recovering the huge page mapping at this time to avoid the overhead of freeing page tables. The huge page mapping can be recovered in the ioctl context, similar to x86, in a later patch. Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/202603301612587174XZ6QMCrymBqv30S6BN50@zte.com.cn Signed-off-by: Anup Patel <anup@brainfault.org>
mainline inclusion
from mainline-7.0-rc4
commit dec9ed9
category: feature
bugzilla: RVCK-Project#257
--------------------------------
While fuzzing KVM on RISC-V, a use-after-free was observed in
kvm_riscv_gstage_get_leaf(), where ptep_get() dereferences a
freed gstage page table page during gfn unmap.
The crash manifests as:
use-after-free in ptep_get include/linux/pgtable.h:340 [inline]
use-after-free in kvm_riscv_gstage_get_leaf arch/riscv/kvm/gstage.c:89
Call Trace:
ptep_get include/linux/pgtable.h:340 [inline]
kvm_riscv_gstage_get_leaf+0x2ea/0x358 arch/riscv/kvm/gstage.c:89
kvm_riscv_gstage_unmap_range+0xf0/0x308 arch/riscv/kvm/gstage.c:265
kvm_unmap_gfn_range+0x168/0x1fc arch/riscv/kvm/mmu.c:256
kvm_mmu_unmap_gfn_range virt/kvm/kvm_main.c:724 [inline]
page last free pid 808 tgid 808 stack trace:
kvm_riscv_mmu_free_pgd+0x1b6/0x26a arch/riscv/kvm/mmu.c:457
kvm_arch_flush_shadow_all+0x1a/0x24 arch/riscv/kvm/mmu.c:134
kvm_flush_shadow_all virt/kvm/kvm_main.c:344 [inline]
The UAF is caused by gstage page table walks running concurrently with
gstage pgd teardown. In particular, kvm_unmap_gfn_range() can traverse
gstage page tables while kvm_arch_flush_shadow_all() frees the pgd,
leading to use-after-free of page table pages.
Fix the issue by serializing gstage unmap and pgd teardown with
kvm->mmu_lock. Holding mmu_lock ensures that gstage page tables
remain valid for the duration of unmap operations and prevents
concurrent frees.
This matches existing RISC-V KVM usage of mmu_lock to protect gstage
map/unmap operations, e.g. kvm_riscv_mmu_iounmap.
Fixes: dd82e35638d67f ("RISC-V: KVM: Factor-out g-stage page table management")
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260202040059.1801167-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
|
开始测试 log: https://github.com/RVCK-Project/rvck-olk/actions/runs/26392473868 参数解析结果
测试完成 详细结果:
Kunit Test Resultkunit test failed
Kernel Build Result
Check Patch Result
LAVA Check (qemu)
result: Lava check fail!
LAVA Check (sg2042)
result: Lava check done!
LAVA Check (spacemit-k1-bananapi-f3)
result: Lava check fail!
LAVA Check (lpi4a)
result: Lava check fail!
|
|
经过验证,可以正常编译、qemu可以正常启动,以下为AI补充的技术细节。
|
Sync commits from rvck PR 267.
Source PR: RVCK-Project/rvck#267
补丁冲突:
上游补丁: 03dc00a2b67854604c0aa8cbd32105f8b633451e “riscv: Support huge pfnmaps” 在THP开启的情况下,默认开启ARCH_SUPPORTS_HUGE_PFNMAP配置。此开关和olk版本中的自研功能: https://gitee.com/src-openeuler/kernel/issues/ID4NL4 有冲突,具体表现为 923e224 “mm: introduce remap_pfn_range_try_pmd() for PMD-level hugepage mapping” 引入了pte_clrhuge函数,此函数在riscv架构上缺少定义,导致编译失败。
冲突修改:
参照 cee7684 "pgtable: add pte_clrhuge() implementation for arm64",在riscv下实现pte_clrhuge函数。由于此函数为pte_mkhuge的反向操作,riscv上pte_mkhuge未做任何操作,所以pte_clrhuge实现也是直接返回pte即可。参见补丁集中第13个补丁:
riscv: pgtable: add pte_clrhuge() implementation for riscv