Add GPU kernel for `calc_sources!` by MarcoArtiano · Pull Request #3012 · trixi-framework/Trixi.jl

MarcoArtiano · 2026-05-17T20:38:10Z

I first naively implemented the approach following the other existing kernel. However it was as slow as computing the surface fluxes. Here each thread of the GPU is launched for each quadrature node (i, j, element), which makes the GPU kernel roughly 6 times faster.

Per element kernel
source terms 19.2k 4.89s 12.0% 254μs 143MiB 10.5% 7.62KiB

Per quadrature node kernel
source terms 19.2k 792ms 2.0% 41.2μs 104MiB 8.1% 5.55KiB

ps: it may also be worth trying (nnodes(dg)*nnodes(dg), nelements(dg, cache)) and reconstructing the indices i and j in the kernel.

github-actions · 2026-05-17T20:38:20Z

codecov · 2026-05-17T21:56:14Z

Codecov Report

❌ Patch coverage is 81.69014% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.09%. Comparing base (bc1048b) to head (e19fe5f).

Files with missing lines	Patch %	Lines
src/solvers/dgsem_p4est/dg_3d_gpu.jl	75.00%	11 Missing ⚠️
src/solvers/dgsem_p4est/dg_2d_gpu.jl	86.67%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3012      +/-   ##
==========================================
- Coverage   97.09%   97.09%   -0.01%     
==========================================
  Files         630      631       +1     
  Lines       48855    48885      +30     
==========================================
+ Hits        47435    47461      +26     
- Misses       1420     1424       +4

Flag	Coverage Δ
unittests	`97.09% <81.69%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ranocha · 2026-05-18T05:09:45Z

ps: it may also be worth trying (nnodes(dg)*nnodes(dg), nelements(dg, cache)) and to reconstruct the indices i and j in the kernel.

Would you expect performance improvements coming from this, @vchuravy?

MarcoArtiano · 2026-05-18T12:04:37Z

ps: it may also be worth trying (nnodes(dg)*nnodes(dg), nelements(dg, cache)) and to reconstruct the indices i and j in the kernel.

Would you expect performance improvements coming from this, @vchuravy?

Comment Flux Diff GPU here I've tested these two cases and I didn't notice any major differences. I'm not sure if one option has a better scalability over the other.

ranocha · 2026-05-18T15:13:18Z

The GPU CI job on buildkite fails. Could you please check what is going on there? Please ping me when your PR improving the performance of applying the Jacobian has been merged and this PR is updated accordingly to the new structure.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

MarcoArtiano · 2026-05-18T17:30:11Z

This PR will be ready to be reviewed after #3013 will be merged and the conflicts will be resolved.

MarcoArtiano · 2026-05-18T21:42:56Z

@ranocha this is again ready for another round of review.

benegee

Just hijacking the PR for two more general questions, which we can of course address later.

benegee · 2026-05-19T06:06:53Z

+# ODE solvers, callbacks etc.
+
+tspan = (0.0, 2.0)
+ode = semidiscretize(semi, tspan; real_type = real_type, storage_type = storage_type)


This is unrelated to this specific PR, but as more examples are added, we could think about unifying the _gpu and non _gpu elixirs. I think the only difference so far are the additional keyword arguments, which we could set to local nothing variables, to be overwritten by trixi_include.

(I know that our template example elixir_advection_basic_gpu.jl suggests to have different files, and I think there was a reason, I'm just not so certain about it anymore)

I totally agree. Indeed here there was also a mistake, where real_type and storage_type should be nothing by default. I fixed that.

Regarding unifying the elixirs that would be great. Although I found a problem: for example for the Euler equations when trying to run with Float32, there's nothing that adapts the Float64 equation structure to Float32 and the code crashes. That's the reason I had to add gamma = Float32(1.4) in the tests. I'm not sure if this is expected or not.

To avoid changing the elixirs we could also redefine the ode by redefining the semi object locally in the tests, and then
ode = semidiscretize(ode, tspan; real_type = ....etc), otherwise, as I said, we should make the kwargs explicit in all P4estMesh elixirs, which might be not really what we want.
Another alternative would be to create: dgsem_p4est_gpu or dgsem_gpu for the examples on GPU, as at the moment not all the features are GPU ready.

This is unfortunate. I thought the equations would be adapted as part of the semidiscretization as well. We should look into this, in a further issue.

This is the error that I get if I don't set gamma to be Float32 and I want to run Float32 with the GPU.

ERROR: LoadError: BoundsError: attempt to access NTuple{4, String} at index [5] Stacktrace: [1] getindex(t::Tuple, i::Int64) @ Base ./tuple.jl:31 [2] (::Trixi.var"#1796#1803"{…})(file::HDF5.File) @ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution_dg.jl:67 [3] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, Trixi.var"#1796#1803"{…}, HDF5.File})() @ HDF5 ~/.julia/packages/HDF5/8g5ny/src/file.jl:101 [4] task_local_storage(body::HDF5.var"#17#18"{…}, key::Symbol, val::HDF5.HDF5Context) @ Base ./task.jl:304 [5] #h5open#16 @ ~/.julia/packages/HDF5/8g5ny/src/file.jl:96 [inlined] [6] h5open @ ~/.julia/packages/HDF5/8g5ny/src/file.jl:94 [inlined] [7] save_solution_file(u::Array{…}, time::Float64, dt::Float64, timestep::Int64, mesh::P4estMesh{…}, equations::CompressibleEulerEquations2D{…}, dg::DGSEM{…}, cache::@NamedTuple{…}, solution_callback::SaveSolutionCallback{…}, element_variables::Dict{…}, node_variables::Dict{…}; system::String) @ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution_dg.jl:47 [8] save_solution_file @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution_dg.jl:8 [inlined] [9] #save_solution_file#1791 @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:306 [inlined] [10] save_solution_file @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:294 [inlined] [11] macro expansion @ ~/.julia/packages/TrixiBase/MGeKl/src/trixi_timeit.jl:67 [inlined] [12] #save_solution_file#1788 @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:285 [inlined] [13] save_solution_file @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:260 [inlined] [14] macro expansion @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:252 [inlined] [15] macro expansion @ ~/.julia/packages/TrixiBase/MGeKl/src/trixi_timeit.jl:67 [inlined] [16] (::SaveSolutionCallback{…})(integrator::OrdinaryDiffEqCore.ODEIntegrator{…}) @ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:247 [17] initialize_save_cb!(solution_callback::SaveSolutionCallback{…}, u::ROCArray{…}, t::Float64, integrator::OrdinaryDiffEqCore.ODEIntegrator{…}) @ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:181 [18] initialize_save_cb! @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:171 [inlined] [19] initialize!(u::ROCArray{…}, t::Float64, integrator::OrdinaryDiffEqCore.ODEIntegrator{…}, any_modified::Bool, c::DiscreteCallback{…}, cs::DiscreteCallback{…}) @ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:17 [20] initialize!(::ROCArray{…}, ::Float64, ::OrdinaryDiffEqCore.ODEIntegrator{…}, ::Bool, ::DiscreteCallback{…}, ::DiscreteCallback{…}, ::Vararg{…}) @ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:18 [21] initialize!(::ROCArray{…}, ::Float64, ::OrdinaryDiffEqCore.ODEIntegrator{…}, ::Bool, ::DiscreteCallback{…}, ::DiscreteCallback{…}, ::Vararg{…}) @ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:18 [22] initialize!(::ROCArray{…}, ::Float64, ::OrdinaryDiffEqCore.ODEIntegrator{…}, ::Bool, ::DiscreteCallback{…}, ::DiscreteCallback{…}, ::Vararg{…}) @ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:18 [23] initialize! @ ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:7 [inlined] [24] initialize_callbacks!(integrator::OrdinaryDiffEqCore.ODEIntegrator{…}, initialize_save::Bool) @ OrdinaryDiffEqCore ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:1089 [25] _ode_init(prob::ODEProblem{…}, alg::CarpenterKennedy2N54{…}, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}; saveat::Tuple{}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_discretes::Bool, save_start::Bool, save_end::Nothing, callback::CallbackSet{…}, dense::Bool, calck::Bool, dt::Float64, dtmin::Float64, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, abstol::Nothing, reltol::Nothing, gamma::Nothing, qmin::Nothing, qmax::Nothing, qsteady_min::Nothing, qsteady_max::Nothing, beta1::Nothing, beta2::Nothing, qoldinit::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqBase.ODE_DEFAULT_NORM), internalopnorm::typeof(LinearAlgebra.opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), verbose::Bool, controller::Nothing, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), progress_id::Symbol, userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias::ODEAliasSpecifier, initializealg::DiffEqBase.DefaultInit, rng::Nothing, save_noise::Bool, delta::Nothing, W::Nothing, P::Nothing, sqdt::Nothing, noise::Nothing, c::Nothing, rate_constants::Nothing, _cache::Nothing, _u::Nothing, _uprev::Nothing, seed::UInt64, kwargs::@Kwargs{}) @ OrdinaryDiffEqCore ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:812 [26] _ode_init @ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:47 [inlined] [27] #__init#71 @ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:37 [inlined] [28] __init (repeats 2 times) @ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:19 [inlined] [29] __solve(::ODEProblem{…}, ::CarpenterKennedy2N54{…}; kwargs::@Kwargs{…}) @ OrdinaryDiffEqCore ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:9 [30] __solve @ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:1 [inlined] [31] solve_call(_prob::ODEProblem{…}, args::CarpenterKennedy2N54{…}; merge_callbacks::Bool, kwargshandle::Nothing, kwargs::@Kwargs{…}) @ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:172 [32] solve_call @ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:137 [inlined] [33] #solve_up#38 @ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:646 [inlined] [34] solve_up @ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:619 [inlined] [35] #solve#37 @ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:603 [inlined] [36] top-level scope @ ~/workspace/dev/Trixi.jl/examples/p4est_2d_dgsem/elixir_euler_source_terms_gpu.jl:60 [37] include(fname::String) @ Base.MainInclude ./client.jl:494 [38] top-level scope @ REPL[20]:1 in expression starting at /home/marco/workspace/dev/Trixi.jl/examples/p4est_2d_dgsem/elixir_euler_source_terms_gpu.jl:60 Some type information was truncated. Use `show(err)` to see complete types.

Apparently we do not retrieve the correct number of variables from:

# Reinterpret the solution array as an array of conservative variables, # compute the solution variables via broadcasting, and reinterpret the # result as a plain array of floating point numbers data = Array(reinterpret(eltype(u), solution_variables.(reinterpret(SVector{nvariables(equations), eltype(u)}, u), Ref(equations)))) # Find out variable count by looking at output from `solution_variables` function n_vars = size(data, 1)

I'll open an issue. If I overwrite n_vars with the correct number of variables it works.

MarcoArtiano · 2026-05-19T08:20:39Z

This is now ready. I've also added the calculation for the 1D source terms on GPU, although we do not have any other kernel for the 1D. Should I delete that file? (it is not tested)

MarcoArtiano · 2026-05-19T09:55:29Z

CUDA tests and KernelAbstraction with CPU backend tests are also missing and need to be added. I will add them as soon as possible.

ranocha · 2026-05-19T10:23:37Z

This is now ready. I've also added the calculation for the 1D source terms on GPU, although we do not have any other kernel for the 1D. Should I delete that file? (it is not tested)

Yes, please. The P4estMesh does not support 1D (and GPU implementations are only available for the P4estMesh).

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

ranocha · 2026-05-19T18:49:09Z

Can you please adapt the allocation check? https://github.com/trixi-framework/Trixi.jl/actions/runs/26111095627/job/76788248592?pr=3012#step:7:1211

MarcoArtiano · 2026-05-20T10:32:15Z

Is the code coverage expected to fail, even if we are running KA with CPU backend?

benegee · 2026-05-20T12:13:30Z

Is the code coverage expected to fail, even if we are running KA with CPU backend?

I think you are good!
The KA tests were explicitly added to get coverage reports for the backend::Backend specializations. However, some parts like @index will still not be detected.

add source term gpu kernel

d50532f

add dispatch for no source terms

453f04d

MarcoArtiano added the gpu label May 17, 2026

add 1d and 3d source gpu kernels

5b7e0e8

MarcoArtiano changed the title ~~WIP: Add 2D GPU kernel for calc_sources!~~ WIP: Add GPU kernel for calc_sources! May 17, 2026

specify signature

158e481

add tests

8f8b0ef

MarcoArtiano marked this pull request as ready for review May 18, 2026 12:11

MarcoArtiano changed the title ~~WIP: Add GPU kernel for calc_sources!~~ Add GPU kernel for calc_sources! May 18, 2026

MarcoArtiano mentioned this pull request May 18, 2026

GPU porting #2822

Open

28 tasks

JoshuaLampert requested changes May 18, 2026

View reviewed changes

Comment thread test/test_amdgpu_2d.jl Outdated

JoshuaLampert and others added 2 commits May 18, 2026 17:17

Update test/test_amdgpu_2d.jl

5add261

Update test/test_amdgpu_2d.jl

0a55992

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

MarcoArtiano changed the base branch from main to ma/jacobian_gpu May 18, 2026 16:43

MarcoArtiano changed the base branch from ma/jacobian_gpu to main May 18, 2026 16:43

MarcoArtiano added 3 commits May 18, 2026 18:54

refactoring

244504d

format

d897a73

refactor also 3D kernels

74e140d

Merge branch 'main' into ma/source_gpu

7ecf686

github-actions Bot reviewed May 18, 2026

View reviewed changes

Comment thread src/solvers/dgsem_p4est/dg_2d_gpu.jl Outdated

Comment thread src/solvers/dgsem_p4est/dg_2d_gpu.jl Outdated

MarcoArtiano added 2 commits May 18, 2026 22:18

format

f1209fe

fix tests

c978036

fix tests

5a5948f

MarcoArtiano mentioned this pull request May 19, 2026

WIP: Optimized apply_jacobian! 3D GPU kernel #3017

Draft

benegee previously approved these changes May 19, 2026

View reviewed changes

fix elixirs

d66c664

MarcoArtiano dismissed benegee’s stale review via d66c664 May 19, 2026 06:16

This was referenced May 19, 2026

Unify GPU elixirs #3018

Open

CUDA tests are missing #3019

Closed

MarcoArtiano and others added 3 commits May 19, 2026 08:54

fix tests

b9bbea5

Merge branch 'main' into ma/source_gpu

69f13e4

fix tests

55ded1f

ranocha reviewed May 19, 2026

View reviewed changes

Comment thread examples/p4est_2d_dgsem/elixir_euler_source_terms.jl

Comment thread examples/p4est_3d_dgsem/elixir_euler_source_terms.jl

ranocha reviewed May 19, 2026

View reviewed changes

Comment thread examples/p4est_2d_dgsem/elixir_euler_source_terms_gpu.jl Outdated

MarcoArtiano and others added 6 commits May 19, 2026 15:38

Apply suggestions from code review

303cd35

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>

rename tests, delete 1D P4est

8733fdd

add kernel abstraction tests

01c2764

add cuda tests

cf75145

fix tests

20e890d

Merge branch 'main' into ma/source_gpu

6d8c9d0

MarcoArtiano added 3 commits May 19, 2026 21:21

fix allocations

031528d

fix allocations 3D

6044e49

fix allocations

e19fe5f

Conversation

MarcoArtiano commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 17, 2026

Review checklist

Purpose and scope

Code quality

Documentation

Testing

Performance

Verification

Uh oh!

codecov Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ranocha commented May 18, 2026

Uh oh!

MarcoArtiano commented May 18, 2026

Uh oh!

ranocha commented May 18, 2026

Uh oh!

Uh oh!

MarcoArtiano commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

MarcoArtiano commented May 18, 2026

Uh oh!

benegee left a comment

Choose a reason for hiding this comment

Uh oh!

benegee May 19, 2026

Choose a reason for hiding this comment

Uh oh!

MarcoArtiano May 19, 2026

Choose a reason for hiding this comment

Uh oh!

benegee May 19, 2026

Choose a reason for hiding this comment

Uh oh!

MarcoArtiano May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MarcoArtiano commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

MarcoArtiano commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ranocha commented May 19, 2026

Uh oh!

Uh oh!

ranocha commented May 19, 2026

Uh oh!

MarcoArtiano commented May 20, 2026

Uh oh!

benegee commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MarcoArtiano commented May 17, 2026 •

edited

Loading

codecov Bot commented May 17, 2026 •

edited

Loading

MarcoArtiano May 19, 2026 •

edited

Loading

MarcoArtiano commented May 19, 2026 •

edited

Loading