fix: [linux-nvidia-6.6] Resolve IOMMU passthrough conflicts with NVIDIA UVM on ARM64 causing NEMO/PyTorch failures #202

KobaKoNvidia · 2025-08-29T07:43:52Z

Summary

This PR addresses a critical compatibility issue between IOMMU passthrough mode and NVIDIA UVM driver on ARM64 platforms, which causes NEMO framework and PyTorch CUDA initialization failures.

Problem

During NEMO framework testing on ARM64 platforms, tests consistently fail on machines configured with iommu.passthrough=1. The failure is specifically triggered by torch.cuda.set_device(1) calls, resulting in kernel warnings and IOMMU-related errors.

Note: iommu.passthrough=1 was previously recommended by NVIDIA in official presentation slides as a performance optimization parameter, making this a high-impact issue for users following NVIDIA's best practices.

Root Cause

The issue occurs when:

ARM64 system is configured with kernel parameter iommu.passthrough=1
NVIDIA Coherent GPU Memory Mode (CDMM) is enabled
PyTorch attempts to initialize CUDA devices

The kernel warning indicates a conflict in the ARM SMMU v3 driver when UVM attempts to bind GPU VA space:

DSB is used for monitoring “events”. Events are something that occurs at some point in time. It could be a state decode, the act of writing/reading a particular address, a FIFO being empty, etc. This decoding of the event desired is done outside TPDM. DSB subunit need to be configured in enablement and disablement. A struct that specifics associated to dsb dataset is needed. It saves the configuration and parameters of the dsb datasets. This change is to add this struct and initialize the configuration of DSB subunit. Signed-off-by: Tao Zhang <quic_taozha@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/1695882586-10306-6-git-send-email-quic_taozha@quicinc.com (cherry picked from commit f01e494) Signed-off-by: Koba Ko <kobak@nvidia.com>

TPDM device need a node to reset the configurations and status of it. This change provides a node to reset the configurations and disable the TPDM if it has been enabled. Signed-off-by: Tao Zhang <quic_taozha@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/1695882586-10306-7-git-send-email-quic_taozha@quicinc.com (cherry picked from commit 8fbbce1) Signed-off-by: Koba Ko <kobak@nvidia.com>

The nodes are needed to set or show the trigger timestamp and trigger type. This change is to add these nodes to achieve these function. Signed-off-by: Tao Zhang <quic_taozha@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/1695882586-10306-8-git-send-email-quic_taozha@quicinc.com (cherry picked from commit 851b3f9) Signed-off-by: Koba Ko <kobak@nvidia.com>

Add node to set and show programming mode for TPDM DSB subunit. Once the DSB programming mode is set, it will be written to the register DSB_CR. Signed-off-by: Tao Zhang <quic_taozha@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/1695882586-10306-9-git-send-email-quic_taozha@quicinc.com (cherry picked from commit 018e43a) Signed-off-by: Koba Ko <kobak@nvidia.com>

Add the nodes to set value for DSB edge control and DSB edge control mask. Each DSB subunit TPDM has maximum of n(n<16) EDCR resgisters to configure edge control. DSB edge detection control 00: Rising edge detection 01: Falling edge detection 10: Rising and falling edge detection (toggle detection) And each DSB subunit TPDM has maximum of m(m<8) ECDMR registers to configure mask. Eight 32 bit registers providing DSB interface edge detection mask control. Add the nodes to configure DSB edge control and DSB edge control mask. Each DSB subunit TPDM maximum of 256 edge detections can be configured. The index and value sysfs files need to be paired and written to order. The index sysfs file is to set the index number of the edge detection which needs to be configured. And the value sysfs file is to set the control or mask for the edge detection. DSB edge detection control should be set as the following values. 00: Rising edge detection 01: Falling edge detection 10: Rising and falling edge detection (toggle detection) And DSB edge mask should be set as 0 or 1. Each DSB subunit TPDM has maximum of n(n<16) EDCR resgisters to configure edge control. And each DSB subunit TPDM has maximum of m(m<8) ECDMR registers to configure mask. Add the nodes to read a set of the edge control value and mask of the DSB in TPDM. Signed-off-by: Tao Zhang <quic_taozha@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/1695882586-10306-10-git-send-email-quic_taozha@quicinc.com (cherry picked from commit f376caf) Signed-off-by: Koba Ko <kobak@nvidia.com>

Add nodes to configure trigger pattern and trigger pattern mask. Each DSB subunit TPDM has maximum of n(n<7) XPR registers to configure trigger pattern match output. Eight 32 bit registers providing DSB interface trigger output pattern match comparison. And each DSB subunit TPDM has maximum of m(m<7) XPMR registers to configure trigger pattern mask match output. Eight 32 bit registers providing DSB interface trigger output pattern match mask. Signed-off-by: Tao Zhang <quic_taozha@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/1695882586-10306-11-git-send-email-quic_taozha@quicinc.com (cherry picked from commit a8138a9) Signed-off-by: Koba Ko <kobak@nvidia.com>

Add nodes to configure the timestamp request based on input pattern match. Each TPDM that support DSB subunit has maximum of n(n<7) TPR registers to configure value for timestamp request based on input pattern match. Eight 32 bit registers providing DSB interface timestamp request pattern match comparison. And each TPDM that support DSB subunit has maximum of m(m<7) TPMR registers to configure pattern mask for timestamp request. Eight 32 bit registers providing DSB interface timestamp request pattern match mask generation. Add nodes to enable/disable pattern timestamp and set pattern timestamp type. Signed-off-by: Tao Zhang <quic_taozha@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/1695882586-10306-12-git-send-email-quic_taozha@quicinc.com (cherry picked from commit 4c98338) Signed-off-by: Koba Ko <kobak@nvidia.com>

Add property "qcom,dsb-msrs-num" to support DSB(Discrete Single Bit) MSR(mux select register) for TPDM. It specifies the number of MSR registers supported by the DSB TDPM. Signed-off-by: Tao Zhang <quic_taozha@quicinc.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/1695882586-10306-13-git-send-email-quic_taozha@quicinc.com (cherry picked from commit 8e05f86) Signed-off-by: Koba Ko <kobak@nvidia.com>

Add the nodes for DSB subunit MSR(mux select register) support. The TPDM MSR (mux select register) interface is an optional interface and associated bank of registers per TPDM subunit. The intent of mux select registers is to control muxing structures driving the TPDM’s’ various subunit interfaces. Signed-off-by: Tao Zhang <quic_taozha@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/1695882586-10306-14-git-send-email-quic_taozha@quicinc.com (cherry picked from commit 350ba15) Signed-off-by: Koba Ko <kobak@nvidia.com>

Activated has the specific meaning of a sink that's selected for use by the user via sysfs. But comments in some code that's shared by Perf use the same word, so in those cases change them to just say "selected" instead. With selected implying either via Perf or "activated" via sysfs. coresight_get_enabled_sink() doesn't actually get an enabled sink, it only gets an activated one, so change that too. And change the activated variable name to include "sysfs" so it can't be confused as a general status. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-3-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit a0fef3f) Signed-off-by: Koba Ko <kobak@nvidia.com>

The check for the existence of callbacks before using them implies that this happens and is supported. There are no devices without enable/disable callbacks, and it wouldn't be possible to add a new working device without adding them either, so just remove them. Furthermore, there are more callbacks than just enable and disable that are already used unguarded in other places. The comment about new session compatibility doesn't seem to match up to the line of code that it's on so remove it. I think it's alluding to the fact that sinks will check if they were already enabled via sysfs or Perf and fail the enable. But there are more detailed comments at those places, and this one isn't very useful. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-4-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit a11ebe1) Signed-off-by: Koba Ko <kobak@nvidia.com>

Most devices use mode, so move the mode definition out of the individual devices and up to the Coresight device. This will allow the core code to also know the mode which will be useful in a later commit. This also fixes the inconsistency of the documentation of the mode field on the individual device types. For example ETB10 had "this ETB is being used". Two devices didn't require an atomic mode type, so these usages have been converted to atomic_get() and atomic_set() only to make it compile, but the documentation of the field in struct coresight_device explains this type of usage. In the future, manipulation of the mode could be completely moved out of the individual devices and into the core code because it's almost all duplicate code, and this change is a step towards that. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-5-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit 9cae77c) Signed-off-by: Koba Ko <kobak@nvidia.com>

'enable', which probably should have been 'enabled', is only ever read in the core code in relation to controlling sources, and specifically only sources in sysfs mode. Confusingly it's not labelled as such and relying on it can be a source of bugs like the one fixed by commit 078dbba3f0c9 ("coresight: Fix crash when Perf and sysfs modes are used concurrently"). Most importantly, it can only be used when the coresight_mutex is held which is only done when enabling and disabling paths in sysfs mode, and not Perf mode. So to prevent its usage spreading and leaking out to other devices, remove it. It's use is equivalent to checking if the mode is currently sysfs, as due to the coresight_mutex lock, mode == CS_MODE_SYSFS can only become true or untrue when that lock is held, and when mode == CS_MODE_SYSFS the device is both enabled and in sysfs mode. The one place it was used outside of the core code is in TPDA, but that pattern is more appropriately represented using refcounts inside the device's own spinlock. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-6-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit d5e83f9) Signed-off-by: Koba Ko <kobak@nvidia.com>

At the moment the core file contains both sysfs functionality and core functionality, while the Perf mode is in a separate file in coresight-etm-perf.c Many of the functions have ambiguous names like coresight_enable_source() which actually only work in relation to the sysfs mode. To avoid further confusion, move everything that isn't core functionality into the sysfs file and append _sysfs to the ambiguous functions. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-7-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit 1f5149c) Signed-off-by: Koba Ko <kobak@nvidia.com>

Refcnt is only ever accessed from either inside the coresight_mutex, or the device's spinlock, making the atomic type and atomic_dec_return() calls confusing and unnecessary. The only point of synchronisation outside of these two types of locks is already done with a compare and swap on 'mode', which a comment has been added for. There was one instance of refcnt being used outside of a lock in TPIU, but that can easily be fixed by making it the same as all the other devices and adding a spinlock. Potentially in the future all the refcounting and locking can be moved up into the core code, and all the mostly duplicate code from the individual devices can be removed. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-8-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit 4545b38) Signed-off-by: Koba Ko <kobak@nvidia.com>

These are a bit annoying to keep up to date when the function signatures change. But if CONFIG_CORESIGHT isn't enabled, then they're not used anyway so just delete them. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-9-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit 053ad9a) Signed-off-by: Koba Ko <kobak@nvidia.com>

These could potentially become wrong silently if the enum is changed, so explicitly initialize them. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-10-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit 812265e) Signed-off-by: Koba Ko <kobak@nvidia.com>

Now that mode is in struct coresight_device, this pattern can be wrapped in a helper. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-11-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit d724f65) Signed-off-by: Koba Ko <kobak@nvidia.com>

Now that mode is in struct coresight_device accesses can be wrapped. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-12-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit c95c273) Signed-off-by: Koba Ko <kobak@nvidia.com>

Now that mode is in struct coresight_device, sets can be wrapped. This also allows us to add a sanity check that there have been no concurrent modifications of mode. Currently all usages of local_set() were inside the device's spin locks so this new warning shouldn't be triggered. coresight_take_mode() could maybe have been used in place of adding the warning, but there may be use cases which set the mode to the same mode which are valid but would fail in coresight_take_mode() because it requires the device to only be in the disabled state. Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240129154050.569566-13-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> (cherry picked from commit bcaabb9) Signed-off-by: Koba Ko <kobak@nvidia.com>

This file is never included anywhere if CONFIG_CORESIGHT is not set so they are unused and aren't currently compile tested with any config so remove them. Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: James Clark <james.clark@arm.com> Tested-by: Leo Yan <leo.yan@arm.com> Tested-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240722101202.26915-10-james.clark@linaro.org (cherry picked from commit 3417200) Signed-off-by: Koba Ko <kobak@nvidia.com>

"Process being monitored" and "pid of the process to monitor" imply that this would be the same PID if there were two sessions targeting the same process. But this is actually the PID of the process that did the Perf event open call, rather than the target of the session. So update the comments to make this clearer. Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: James Clark <james.clark@arm.com> Tested-by: Leo Yan <leo.yan@arm.com> Tested-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240722101202.26915-11-james.clark@linaro.org (cherry picked from commit eda1d11) Signed-off-by: Koba Ko <kobak@nvidia.com>

The trace ID maps will need to be created and stored by the core and Perf code so move the definition up to the common header. Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: James Clark <james.clark@arm.com> Tested-by: Leo Yan <leo.yan@arm.com> Tested-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240722101202.26915-12-james.clark@linaro.org (cherry picked from commit acb0184) Signed-off-by: Koba Ko <kobak@nvidia.com>

The trace ID API is currently hard coded to always use the global map. Add public versions that allow the map to be passed in so that Perf mode can use per-sink maps. Keep the non-map versions so that sysfs mode can continue to use the default global map. System ID functions are unchanged because they will always use the default map. Signed-off-by: James Clark <james.clark@arm.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Tested-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240722101202.26915-13-james.clark@linaro.org (cherry picked from commit 7e52877) Signed-off-by: Koba Ko <kobak@nvidia.com>

The global CPU ID mappings won't work for per-sink ID maps so move it to the ID map struct. coresight_trace_id_release_all_pending() is hard coded to operate on the default map, but once Perf sessions use their own maps the pending release mechanism will be deleted. So it doesn't need to be extended to accept a trace ID map argument at this point. Signed-off-by: James Clark <james.clark@arm.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Tested-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240722101202.26915-14-james.clark@linaro.org (cherry picked from commit d53c825) Signed-off-by: Koba Ko <kobak@nvidia.com>

This will allow sessions with more than CORESIGHT_TRACE_IDS_MAX ETMs as long as there are fewer than that many ETMs connected to each sink. Each sink owns its own trace ID map, and any Perf session connecting to that sink will allocate from it, even if the sink is currently in use by other users. This is similar to the existing behavior where the dynamic trace IDs are constant as long as there is any concurrent Perf session active. It's not completely optimal because slightly more IDs will be used than necessary, but the optimal solution involves tracking the PIDs of each session and allocating ID maps based on the session owner. This is difficult to do with the combination of per-thread and per-cpu modes and some scheduling issues. The complexity of this isn't likely to worth it because even with multiple users they'd just see a difference in the ordering of ID allocations rather than hitting any limits (unless the hardware does have too many ETMs connected to one sink). Signed-off-by: James Clark <james.clark@arm.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240722101202.26915-15-james.clark@linaro.org (cherry picked from commit 5ad628a) Signed-off-by: Koba Ko <kobak@nvidia.com>

Pending the release of IDs was a way of managing concurrent sysfs and Perf sessions in a single global ID map. Perf may have finished while sysfs hadn't, and Perf shouldn't release the IDs in use by sysfs and vice versa. Now that Perf uses its own exclusive ID maps, pending release doesn't result in any different behavior than just releasing all IDs when the last Perf session finishes. As part of the per-sink trace ID change, we would have still had to make the pending mechanism work on a per-sink basis, due to the overlapping ID allocations, so instead of making that more complicated, just remove it. Signed-off-by: James Clark <james.clark@arm.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240722101202.26915-16-james.clark@linaro.org (cherry picked from commit de0029f) Signed-off-by: Koba Ko <kobak@nvidia.com>

For Perf to be able to decode when per-sink trace IDs are used, emit the sink that's being written to for each ETM. Perf currently errors out if it sees a newer packet version so instead of bumping it, add a new minor version field. This can be used to signify new versions that have backwards compatible fields. Considering this change is only for high core count machines, it doesn't make sense to make a breaking change for everyone. Signed-off-by: James Clark <james.clark@arm.com> Tested-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240722101202.26915-17-james.clark@linaro.org (cherry picked from commit 487eec8) Signed-off-by: Koba Ko <kobak@nvidia.com>

Reduce contention on the lock by replacing the global lock with one for each map. Signed-off-by: James Clark <james.clark@arm.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240722101202.26915-18-james.clark@linaro.org (cherry picked from commit 988d40a) Signed-off-by: Koba Ko <kobak@nvidia.com>

Use the observation that the critical and hot trip points are never updated by the ACPI thermal driver, because the flags passed from acpi_thermal_notify() to acpi_thermal_trips_update() do not include ACPI_TRIPS_CRITICAL or ACPI_TRIPS_HOT, to move the initialization of those trip points directly into acpi_thermal_get_trip_points() and reduce the size of __acpi_thermal_trips_update(). Also make the critical and hot trip points initialization code more straightforward and drop the flags that are not needed any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> (cherry picked from commit 4be3233) Signed-off-by: Koba Ko <kobak@nvidia.com>

Only the attach callers can perform an allocation for the CD table entry, the other callers must not do so, they do not have the correct locking and they cannot sleep. Split up the functions so this is clear. arm_smmu_get_cd_ptr() will return pointer to a CD table entry without doing any kind of allocation. arm_smmu_alloc_cd_ptr() will allocate the table and any required leaf. A following patch will add lockdep assertions to arm_smmu_alloc_cd_ptr() once the restructuring is completed and arm_smmu_alloc_cd_ptr() is never called in the wrong context. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v9-5040dc602008+177d7-smmuv3_newapi_p2_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit b2f4c0f) Signed-off-by: Koba Ko <kobak@nvidia.com>

Avoid arm_smmu_attach_dev() having to undo the changes to the smmu_domain->devices list, acquire the cdptr earlier so we don't need to handle that error. Now there is a clear break in arm_smmu_attach_dev() where all the prep-work has been done non-disruptively and we commit to making the HW change, which cannot fail. This completes transforming arm_smmu_attach_dev() so that it does not disturb the HW if it fails. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Michael Shavit <mshavit@google.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/6-v9-5040dc602008+177d7-smmuv3_newapi_p2_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 13abe4f) Signed-off-by: Koba Ko <kobak@nvidia.com>

Pull all the calculations for building the CD table entry for a mmu_struct into arm_smmu_make_sva_cd(). Call it in the two places installing the SVA CD table entry. Open code the last caller of arm_smmu_update_ctx_desc_devices() and remove the function. Remove arm_smmu_write_ctx_desc() since all callers are gone. Add the locking assertions to arm_smmu_alloc_cd_ptr() since arm_smmu_update_ctx_desc_devices() was the last problematic caller. Remove quiet_cd since all users are gone, arm_smmu_make_sva_cd() creates the same value. The behavior of quiet_cd changes slightly, the old implementation edited the CD in place to set CTXDESC_CD_0_TCR_EPD0 assuming it was a SVA CD entry. This version generates a full CD entry with a 0 TTB0 and relies on arm_smmu_write_cd_entry() to install it hitlessly. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/7-v9-5040dc602008+177d7-smmuv3_newapi_p2_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 7b87c93) Signed-off-by: Koba Ko <kobak@nvidia.com>

Half the code was living in arm_smmu_domain_finalise_s1(), just move it here and take the values directly from the pgtbl_ops instead of storing copies. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Michael Shavit <mshavit@google.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/8-v9-5040dc602008+177d7-smmuv3_newapi_p2_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 04905c1) Signed-off-by: Koba Ko <kobak@nvidia.com>

Add tests for some of the more common STE update operations that we expect to see, as well as some artificial STE updates to test the edges of arm_smmu_write_entry. These also serve as a record of which common operation is expected to be hitless, and how many syncs they require. arm_smmu_write_entry implements a generic algorithm that updates an STE/CD to any other abritrary STE/CD configuration. The update requires a sequence of write+sync operations with some invariants that must be held true after each sync. arm_smmu_write_entry lends itself well to unit-testing since the function's interaction with the STE/CD is already abstracted by input callbacks that we can hook to introspect into the sequence of operations. We can use these hooks to guarantee that invariants are held throughout the entire update operation. Link: https://lore.kernel.org/r/20240106083617.1173871-3-mshavit@google.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Michael Shavit <mshavit@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/9-v9-5040dc602008+177d7-smmuv3_newapi_p2_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 56e1a4c) Signed-off-by: Koba Ko <kobak@nvidia.com>

Static checker is complaining about the ASID possibly set uninitialized. This only happens in case of error and this value would be ignored anyway. A simple fix would be just to initialize the local variable to zero, this path will only be reached on the first attach to a domain where the CD is already initialized to zero. This avoids having to bloat the function with an error path. Closes: https://lore.kernel.org/linux-iommu/849e3d77-0a3c-43c4-878d-a0e061c8cd61@moroto.mountain/T/#u Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Mostafa Saleh <smostafa@google.com> Fixes: 04905c1 ("iommu/arm-smmu-v3: Build the whole CD in arm_smmu_make_s1_cd()") Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20240604185218.2602058-1-smostafa@google.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit d3867e7) Signed-off-by: Koba Ko <kobak@nvidia.com>

This allows the driver the receive the mm and always a device during allocation. Later patches need this to properly setup the notifier when the domain is first allocated. Remove ops->domain_alloc() as SVA was the only remaining purpose. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Michael Shavit <mshavit@google.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 678d79b) Signed-off-by: Koba Ko <kobak@nvidia.com>

Add arm_smmu_set_pasid()/arm_smmu_remove_pasid() which are to be used by callers that already constructed the arm_smmu_cd they wish to program. These functions will encapsulate the shared logic to setup a CD entry that will be shared by SVA and S1 domain cases. Prior fixes had already moved most of this logic up into __arm_smmu_sva_bind(), move it to it's final home. Following patches will relieve some of the remaining SVA restrictions: - The RID domain is a S1 domain and has already setup the STE to point to the CD table - The programmed PASID is the mm_get_enqcmd_pasid() - Nothing changes while SVA is running (sva_enable) SVA invalidation will still iterate over the S1 domain's master list, later patches will resolve that. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 85f2fb6) Signed-off-by: Koba Ko <kobak@nvidia.com>

The next patch will need to store the same master twice (with different SSIDs), so allocate memory for each list element. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Michael Shavit <mshavit@google.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit ad10dce) Signed-off-by: Koba Ko <kobak@nvidia.com>

The core code allows the domain to be changed on the fly without a forced stop in BLOCKED/IDENTITY. In this flow the driver should just continually maintain the ATS with no change while the STE is updated. ATS relies on a linked list smmu_domain->devices to keep track of which masters have the domain programmed, but this list is also used by arm_smmu_share_asid(), unrelated to ats. Create two new functions to encapsulate this combined logic: arm_smmu_attach_prepare() <caller generates and sets the STE> arm_smmu_attach_commit() The two functions can sequence both enabling ATS and disabling across the STE store. Have every update of the STE use this sequence. Installing a S1/S2 domain always enables the ATS if the PCIe device supports it. The enable flow is now ordered differently to allow it to be hitless: 1) Add the master to the new smmu_domain->devices list 2) Program the STE 3) Enable ATS at PCIe 4) Remove the master from the old smmu_domain This flow ensures that invalidations to either domain will generate an ATC invalidation to the device while the STE is being switched. Thus we don't need to turn off the ATS anymore for correctness. The disable flow is the reverse: 1) Disable ATS at PCIe 2) Program the STE 3) Invalidate the ATC 4) Remove the master from the old smmu_domain Move the nr_ats_masters adjustments to be close to the list manipulations. It is a count of the number of ATS enabled masters currently in the list. This is stricly before and after the STE/CD are revised, and done under the list's spin_lock. This is part of the bigger picture to allow changing the RID domain while a PASID is in use. If a SVA PASID is relying on ATS to function then changing the RID domain cannot just temporarily toggle ATS off without also wrecking the SVA PASID. The new infrastructure here is organized so that the PASID attach/detach flows will make use of it as well in following patches. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Michael Shavit <mshavit@google.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 7497f42) Signed-off-by: Koba Ko <kobak@nvidia.com>

Prepare to allow a S1 domain to be attached to a PASID as well. Keep track of the SSID the domain is using on each master in the arm_smmu_master_domain. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Michael Shavit <mshavit@google.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 64efb3d) Signed-off-by: Koba Ko <kobak@nvidia.com>

We no longer need a master->sva_enable to control what attaches are allowed. Instead we can tell if the attach is legal based on the current configuration of the master. Keep track of the number of valid CD entries for SSID's in the cd_table and if the cd_table has been installed in the STE directly so we know what the configuration is. The attach logic is then made into: - SVA bind, check if the CD is installed - RID attach of S2, block if SSIDs are used - RID attach of IDENTITY/BLOCKING, block if SSIDs are used arm_smmu_set_pasid() is already checking if it is possible to setup a CD entry, at this patch it means the RID path already set a STE pointing at the CD table. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/6-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit be7c90d) Signed-off-by: Koba Ko <kobak@nvidia.com>

Allow creating and managing arm_smmu_mater_domain's with a non-zero SSID through the arm_smmu_attach_*() family of functions. This triggers ATC invalidation for the correct SSID in PASID cases and tracks the per-attachment SSID in the struct arm_smmu_master_domain. Generalize arm_smmu_attach_remove() to be able to remove SSID's as well by ensuring the ATC for the PASID is flushed properly. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/7-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 1d5f34f) Signed-off-by: Koba Ko <kobak@nvidia.com>

Currently the SVA domain is a naked struct iommu_domain, allocate a struct arm_smmu_domain instead. This is necessary to be able to use the struct arm_master_domain mechanism. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Michael Shavit <mshavit@google.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/8-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit d7b2d2b) Signed-off-by: Koba Ko <kobak@nvidia.com>

Fill in the smmu_domain->devices list in the new struct arm_smmu_domain that SVA allocates. Keep track of every SSID and master that is using the domain reusing the logic for the RID attach. This is the first step to making the SVA invalidation follow the same design as S1/S2 invalidation. At present nothing will read this list. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/9-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 49db2ed) Signed-off-by: Koba Ko <kobak@nvidia.com>

This removes all the notifier de-duplication logic in the driver and relies on the core code to de-duplicate and allocate only one SVA domain per mm per smmu instance. This naturally gives a 1:1 relationship between SVA domain and mmu notifier. It is a significant simplication of the flow, as we end up with a single struct arm_smmu_domain for each MM and the invalidation can then be shifted to properly use the masters list like S1/S2 do. Remove all of the previous mmu_notifier, bond, shared cd, and cd refcount logic entirely. The logic here is tightly wound together with the unusued BTM support. Since the BTM logic requires holding all the iommu_domains in a global ASID xarray it conflicts with the design to have a single SVA domain per PASID, as multiple SMMU instances will need to have different domains. Following patches resolve this by making the ASID xarray per-instance instead of global. However, converting the BTM code over to this methodology requires many changes. Thus, since ARM_SMMU_FEAT_BTM is never enabled, remove the parts of the BTM support for ASID sharing that interact with SVA as well. A followup series is already working on fully enabling the BTM support, that requires iommufd's VIOMMU feature to bring in the KVM's VMID as well. It will come with an already written patch to bring back the ASID sharing using a per-instance ASID xarray. https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/ https://lore.kernel.org/linux-iommu/26-v6-228e7adf25eb+4155-smmuv3_newapi_p2_jgg@nvidia.com/ Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Michael Shavit <mshavit@google.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/10-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit d38c28d) Signed-off-by: Koba Ko <kobak@nvidia.com>

The HW supports this, use the S1DSS bits to configure the behavior of SSID=0 which is the RID's translation. If SSID's are currently being used in the CD table then just update the S1DSS bits in the STE, remove the master_domain and leave ATS alone. For iommufd the driver design has a small problem that all the unused CD table entries are set with V=0 which will generate an event if VFIO userspace tries to use the CD entry. This patch extends this problem to include the RID as well if PASID is being used. For BLOCKED with used PASIDs the F_STREAM_DISABLED (STRTAB_STE_1_S1DSS_TERMINATE) event is generated on untagged traffic and a substream CD table entry with V=0 (removed pasid) will generate C_BAD_CD. Arguably there is no advantage to using S1DSS over the CD entry 0 with V=0. As we don't yet support PASID in iommufd this is a problem to resolve later, possibly by using EPD0 for unused CD table entries instead of V=0, and not using S1DSS for BLOCKED. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/11-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit ce26ea9) Signed-off-by: Koba Ko <kobak@nvidia.com>

S1DSS brings in quite a few new transition pairs that are interesting. Test to/from S1DSS_BYPASS <-> S1DSS_SSID0, and BYPASS <-> S1DSS_SSID0. Test a contrived non-hitless flow to make sure that the logic works. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Michael Shavit <mshavit@google.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/12-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 3b5302c) Signed-off-by: Koba Ko <kobak@nvidia.com>

If the STE doesn't point to the CD table we can upgrade it by reprogramming the STE with the appropriate S1DSS. We may also need to turn on ATS at the same time. Keep track if the installed STE is pointing at the cd_table and the ATS state to trigger this path. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/13-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit 8ee9175) Signed-off-by: Koba Ko <kobak@nvidia.com>

The SVA cleanup made the SSID logic entirely general so all we need to do is call it with the correct cd table entry for a S1 domain. This is slightly tricky because of the ASID and how the locking works, the simple fix is to just update the ASID once we get the right locks. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/14-v9-5cd718286059+79186-smmuv3_newapi_p2b_jgg@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit f3b273b) Signed-off-by: Koba Ko <kobak@nvidia.com>

Add a comprehensive kernel configuration file for Linux 6.6.63 targeting ARM64 architecture with 64k page size. This configuration enables essential features for NVIDIA ARM64 platforms including: - 64k page size configuration (CONFIG_ARM64_64K_PAGES) - Full ARM64 architecture support with NUMA balancing - BPF subsystem with JIT compilation - Control groups (cgroups) support for resource management - Memory controller with swap support - CPU isolation and RCU subsystem configuration - Process accounting and task statistics - Standard kernel debugging and security features The configuration is built with GCC 12.2.0 and includes the build salt "6.6.y.bsk.z-64k-arm64" to identify this specific kernel build variant. This configuration serves as the baseline for NVIDIA ARM64 kernel builds on the linux-nvidia-6.6 branch. Signed-off-by: Koba Ko <kobak@nvidia.com>

This function returns NULL on errors, not ERR_PTR. Fixes: 1c68cbc ("iommu: Add IOMMU_DOMAIN_PLATFORM") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/8fb75157-6c81-4a9c-9992-d73d49902fa8@moroto.mountain Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/0-v2-ee2bae9af0f2+96-iommu_ga_err_ptr_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de> (cherry picked from commit b85b4f3) Signed-off-by: Koba Ko <kobak@nvidia.com>

In the review for iommu_copy_struct_to_user() helper, Matt pointed out that a NULL pointer should be rejected prior to dereferencing it: https://lore.kernel.org/all/86881827-8E2D-461C-BDA3-FA8FD14C343C@nvidia.com And Alok pointed out a typo at the same time: https://lore.kernel.org/all/480536af-6830-43ce-a327-adbd13dc3f1d@oracle.com Since both issues were copied from iommu_copy_struct_from_user(), fix them first in the current header. Fixes: e9d36c0 ("iommu: Add iommu_copy_struct_from_user helper") Cc: stable@vger.kernel.org Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Matthew R. Ochs <mochs@nvidia.com> Link: https://lore.kernel.org/r/20250414191635.450472-1-nicolinc@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de> (cherry picked from commit 30a3f2f) Signed-off-by: Koba Ko <kobak@nvidia.com>

Commit bcb81ac ("iommu: Get DT/ACPI parsing into the proper probe path") changed the sequence of probing the SYSMMU controller devices and calls to arm_iommu_attach_device(), what results in resuming SYSMMU controller earlier, when it is still set to IDENTITY mapping. Such change revealed the bug in IDENTITY handling in the exynos-iommu driver. When SYSMMU controller is set to IDENTITY mapping, data->domain is NULL, so adjust checks in suspend & resume callbacks to handle this case correctly. Fixes: b3d1496 ("iommu/exynos: Implement an IDENTITY domain") Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250401202731.2810474-1-m.szyprowski@samsung.com Signed-off-by: Joerg Roedel <jroedel@suse.de> (cherry picked from commit 99deffc) Signed-off-by: Koba Ko <kobak@nvidia.com>

It turns out kconfig has problems ensuring the SMMU module and the KUNIT module are consistently y/m to allow linking. It will permit KUNIT to be a module while SMMU is built in. Also, Fedora apparently enables kunit on production kernels. So, put the entire kunit in its own module using the VISIBLE_IF_KUNIT/EXPORT_SYMBOL_IF_KUNIT machinery. This keeps it out of vmlinus on Fedora and makes the kconfig work in the normal way. There is no cost if kunit is disabled. Fixes: 56e1a4c ("iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry") Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Link: https://lore.kernel.org/all/aeea8546-5bce-4c51-b506-5d2008e52fef@leemhuis.info Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Thorsten Leemhuis <linux@leemhuis.info> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/0-v1-24cba6c0f404+2ae-smmu_kunit_module_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de> (cherry picked from commit da55da5) Signed-off-by: Koba Ko <kobak@nvidia.com>

Previously with tegra-smmu, even with CONFIG_IOMMU_DMA, the default domain could have been left as NULL. The NULL domain is specially recognized by host1x_iommu_attach() as meaning it is not the DMA domain and should be replaced with the special shared domain. This happened prior to the below commit because tegra-smmu was using the NULL domain to mean IDENTITY. Now that the domain is properly labled the test in DRM doesn't see NULL. Check for IDENTITY as well to enable the special domains. This is the same issue and basic fix as seen in commit fae6e66 ("drm/tegra: Do not assume that a NULL domain means no DMA IOMMU"). Fixes: c8cc265 ("iommu/tegra-smmu: Implement an IDENTITY domain") Reported-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt> Closes: https://lore.kernel.org/all/c6a6f114-3acd-4d56-a13b-b88978e927dc@tecnico.ulisboa.pt/ Tested-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Thierry Reding <treding@nvidia.com> Link: https://patchwork.freedesktop.org/patch/msgid/0-v1-10dcc8ce3869+3a7-host1x_identity_jgg@nvidia.com (cherry picked from commit cb83f4b) Signed-off-by: Koba Ko <kobak@nvidia.com>

The commit 2ad56ef ("powerpc/iommu: Setup a default domain and remove set_platform_dma_ops") refactored the code removing the set_platform_dma_ops(). It missed out the table group release_ownership() call which would have got called otherwise during the guest shutdown via vfio_group_detach_container(). On PPC64, this particular call actually sets up the 32-bit TCE table, and enables the 64-bit DMA bypass etc. Now after guest shutdown, the subsequent host driver (e.g megaraid-sas) probe post unbind from vfio-pci fails like, megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0x7fffffffffffffff, table unavailable megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0xffffffff, table unavailable megaraid_sas 0031:01:00.0: Failed to set DMA mask megaraid_sas 0031:01:00.0: Failed from megasas_init_fw 6539 The patch brings back the call to table_group release_ownership() call when switching back to PLATFORM domain from BLOCKED, while also separates the domain_ops for both. Fixes: 2ad56ef ("powerpc/iommu: Setup a default domain and remove set_platform_dma_ops") Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/170628173462.3742.18330000394415935845.stgit@ltcd48-lp2.aus.stglab.ibm.com Signed-off-by: Joerg Roedel <jroedel@suse.de> (cherry picked from commit d2d00e1) Signed-off-by: Koba Ko <kobak@nvidia.com>

The rewind routine should remove the reserved iovas added to the new hwpt. Fixes: 89db316 ("iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/r/20240718050130.1956804-1-nicolinc@nvidia.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 950aeef) Signed-off-by: Koba Ko <kobak@nvidia.com>

Tao Zhang and others added 30 commits August 20, 2025 20:39

jgunthorpe and others added 28 commits December 3, 2025 16:46

KobaKoNvidia force-pushed the nv#5459649 branch from c08b4c7 to 5c5c4ac Compare December 3, 2025 17:35

ianm-nv force-pushed the linux-nvidia-6.6 branch from 50610a5 to 900281b Compare January 30, 2026 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: [linux-nvidia-6.6] Resolve IOMMU passthrough conflicts with NVIDIA UVM on ARM64 causing NEMO/PyTorch failures #202

fix: [linux-nvidia-6.6] Resolve IOMMU passthrough conflicts with NVIDIA UVM on ARM64 causing NEMO/PyTorch failures #202

Uh oh!

KobaKoNvidia commented Aug 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

fix: [linux-nvidia-6.6] Resolve IOMMU passthrough conflicts with NVIDIA UVM on ARM64 causing NEMO/PyTorch failures #202

Are you sure you want to change the base?

fix: [linux-nvidia-6.6] Resolve IOMMU passthrough conflicts with NVIDIA UVM on ARM64 causing NEMO/PyTorch failures #202

Uh oh!

Conversation

KobaKoNvidia commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

KobaKoNvidia commented Aug 29, 2025 •

edited

Loading