Skip to content

Add GPU kernel for calc_sources! #3012

Open
MarcoArtiano wants to merge 27 commits into
mainfrom
ma/source_gpu
Open

Add GPU kernel for calc_sources! #3012
MarcoArtiano wants to merge 27 commits into
mainfrom
ma/source_gpu

Conversation

@MarcoArtiano
Copy link
Copy Markdown
Contributor

@MarcoArtiano MarcoArtiano commented May 17, 2026

I first naively implemented the approach following the other existing kernel. However it was as slow as computing the surface fluxes. Here each thread of the GPU is launched for each quadrature node (i, j, element), which makes the GPU kernel roughly 6 times faster.

Per element kernel
source terms 19.2k 4.89s 12.0% 254μs 143MiB 10.5% 7.62KiB

Per quadrature node kernel
source terms 19.2k 792ms 2.0% 41.2μs 104MiB 8.1% 5.55KiB

ps: it may also be worth trying (nnodes(dg)*nnodes(dg), nelements(dg, cache)) and reconstructing the indices i and j in the kernel.

@github-actions
Copy link
Copy Markdown
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

@MarcoArtiano MarcoArtiano changed the title WIP: Add 2D GPU kernel for calc_sources! WIP: Add GPU kernel for calc_sources! May 17, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

❌ Patch coverage is 81.69014% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.09%. Comparing base (bc1048b) to head (e19fe5f).

Files with missing lines Patch % Lines
src/solvers/dgsem_p4est/dg_3d_gpu.jl 75.00% 11 Missing ⚠️
src/solvers/dgsem_p4est/dg_2d_gpu.jl 86.67% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3012      +/-   ##
==========================================
- Coverage   97.09%   97.09%   -0.01%     
==========================================
  Files         630      631       +1     
  Lines       48855    48885      +30     
==========================================
+ Hits        47435    47461      +26     
- Misses       1420     1424       +4     
Flag Coverage Δ
unittests 97.09% <81.69%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ranocha
Copy link
Copy Markdown
Member

ranocha commented May 18, 2026

ps: it may also be worth trying (nnodes(dg)*nnodes(dg), nelements(dg, cache)) and to reconstruct the indices i and j in the kernel.

Would you expect performance improvements coming from this, @vchuravy?

@MarcoArtiano
Copy link
Copy Markdown
Contributor Author

ps: it may also be worth trying (nnodes(dg)*nnodes(dg), nelements(dg, cache)) and to reconstruct the indices i and j in the kernel.

Would you expect performance improvements coming from this, @vchuravy?

Comment Flux Diff GPU here I've tested these two cases and I didn't notice any major differences. I'm not sure if one option has a better scalability over the other.

@MarcoArtiano MarcoArtiano marked this pull request as ready for review May 18, 2026 12:11
@MarcoArtiano MarcoArtiano changed the title WIP: Add GPU kernel for calc_sources! Add GPU kernel for calc_sources! May 18, 2026
@MarcoArtiano MarcoArtiano mentioned this pull request May 18, 2026
28 tasks
@ranocha
Copy link
Copy Markdown
Member

ranocha commented May 18, 2026

The GPU CI job on buildkite fails. Could you please check what is going on there? Please ping me when your PR improving the performance of applying the Jacobian has been merged and this PR is updated accordingly to the new structure.

Comment thread test/test_amdgpu_2d.jl Outdated
JoshuaLampert and others added 2 commits May 18, 2026 17:17
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@MarcoArtiano MarcoArtiano changed the base branch from main to ma/jacobian_gpu May 18, 2026 16:43
@MarcoArtiano MarcoArtiano changed the base branch from ma/jacobian_gpu to main May 18, 2026 16:43
@MarcoArtiano
Copy link
Copy Markdown
Contributor Author

This PR will be ready to be reviewed after #3013 will be merged and the conflicts will be resolved.

Comment thread src/solvers/dgsem_p4est/dg_2d_gpu.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_gpu.jl Outdated
@MarcoArtiano
Copy link
Copy Markdown
Contributor Author

@ranocha this is again ready for another round of review.

benegee
benegee previously approved these changes May 19, 2026
Copy link
Copy Markdown
Contributor

@benegee benegee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just hijacking the PR for two more general questions, which we can of course address later.

# ODE solvers, callbacks etc.

tspan = (0.0, 2.0)
ode = semidiscretize(semi, tspan; real_type = real_type, storage_type = storage_type)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to this specific PR, but as more examples are added, we could think about unifying the _gpu and non _gpu elixirs. I think the only difference so far are the additional keyword arguments, which we could set to local nothing variables, to be overwritten by trixi_include.

(I know that our template example elixir_advection_basic_gpu.jl suggests to have different files, and I think there was a reason, I'm just not so certain about it anymore)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree. Indeed here there was also a mistake, where real_type and storage_type should be nothing by default. I fixed that.

Regarding unifying the elixirs that would be great. Although I found a problem: for example for the Euler equations when trying to run with Float32, there's nothing that adapts the Float64 equation structure to Float32 and the code crashes. That's the reason I had to add gamma = Float32(1.4) in the tests. I'm not sure if this is expected or not.

To avoid changing the elixirs we could also redefine the ode by redefining the semi object locally in the tests, and then
ode = semidiscretize(ode, tspan; real_type = ....etc), otherwise, as I said, we should make the kwargs explicit in all P4estMesh elixirs, which might be not really what we want.
Another alternative would be to create: dgsem_p4est_gpu or dgsem_gpu for the examples on GPU, as at the moment not all the features are GPU ready.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate. I thought the equations would be adapted as part of the semidiscretization as well. We should look into this, in a further issue.

Copy link
Copy Markdown
Contributor Author

@MarcoArtiano MarcoArtiano May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the error that I get if I don't set gamma to be Float32 and I want to run Float32 with the GPU.

ERROR: LoadError: BoundsError: attempt to access NTuple{4, String} at index [5]
Stacktrace:
  [1] getindex(t::Tuple, i::Int64)
    @ Base ./tuple.jl:31
  [2] (::Trixi.var"#1796#1803"{})(file::HDF5.File)
    @ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution_dg.jl:67
  [3] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, Trixi.var"#1796#1803"{}, HDF5.File})()
    @ HDF5 ~/.julia/packages/HDF5/8g5ny/src/file.jl:101
  [4] task_local_storage(body::HDF5.var"#17#18"{}, key::Symbol, val::HDF5.HDF5Context)
    @ Base ./task.jl:304
  [5] #h5open#16
    @ ~/.julia/packages/HDF5/8g5ny/src/file.jl:96 [inlined]
  [6] h5open
    @ ~/.julia/packages/HDF5/8g5ny/src/file.jl:94 [inlined]
  [7] save_solution_file(u::Array{…}, time::Float64, dt::Float64, timestep::Int64, mesh::P4estMesh{…}, equations::CompressibleEulerEquations2D{…}, dg::DGSEM{…}, cache::@NamedTuple{}, solution_callback::SaveSolutionCallback{…}, element_variables::Dict{…}, node_variables::Dict{…}; system::String)
    @ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution_dg.jl:47
  [8] save_solution_file
    @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution_dg.jl:8 [inlined]
  [9] #save_solution_file#1791
    @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:306 [inlined]
 [10] save_solution_file
    @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:294 [inlined]
 [11] macro expansion
    @ ~/.julia/packages/TrixiBase/MGeKl/src/trixi_timeit.jl:67 [inlined]
 [12] #save_solution_file#1788
    @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:285 [inlined]
 [13] save_solution_file
    @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:260 [inlined]
 [14] macro expansion
    @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:252 [inlined]
 [15] macro expansion
    @ ~/.julia/packages/TrixiBase/MGeKl/src/trixi_timeit.jl:67 [inlined]
 [16] (::SaveSolutionCallback{…})(integrator::OrdinaryDiffEqCore.ODEIntegrator{…})
    @ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:247
 [17] initialize_save_cb!(solution_callback::SaveSolutionCallback{…}, u::ROCArray{…}, t::Float64, integrator::OrdinaryDiffEqCore.ODEIntegrator{…})
    @ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:181
 [18] initialize_save_cb!
    @ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:171 [inlined]
 [19] initialize!(u::ROCArray{…}, t::Float64, integrator::OrdinaryDiffEqCore.ODEIntegrator{…}, any_modified::Bool, c::DiscreteCallback{…}, cs::DiscreteCallback{…})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:17
 [20] initialize!(::ROCArray{…}, ::Float64, ::OrdinaryDiffEqCore.ODEIntegrator{…}, ::Bool, ::DiscreteCallback{…}, ::DiscreteCallback{…}, ::Vararg{…})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:18
 [21] initialize!(::ROCArray{…}, ::Float64, ::OrdinaryDiffEqCore.ODEIntegrator{…}, ::Bool, ::DiscreteCallback{…}, ::DiscreteCallback{…}, ::Vararg{…})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:18
 [22] initialize!(::ROCArray{…}, ::Float64, ::OrdinaryDiffEqCore.ODEIntegrator{…}, ::Bool, ::DiscreteCallback{…}, ::DiscreteCallback{…}, ::Vararg{…})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:18
 [23] initialize!
    @ ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:7 [inlined]
 [24] initialize_callbacks!(integrator::OrdinaryDiffEqCore.ODEIntegrator{…}, initialize_save::Bool)
    @ OrdinaryDiffEqCore ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:1089
 [25] _ode_init(prob::ODEProblem{…}, alg::CarpenterKennedy2N54{…}, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}; saveat::Tuple{}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_discretes::Bool, save_start::Bool, save_end::Nothing, callback::CallbackSet{…}, dense::Bool, calck::Bool, dt::Float64, dtmin::Float64, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, abstol::Nothing, reltol::Nothing, gamma::Nothing, qmin::Nothing, qmax::Nothing, qsteady_min::Nothing, qsteady_max::Nothing, beta1::Nothing, beta2::Nothing, qoldinit::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqBase.ODE_DEFAULT_NORM), internalopnorm::typeof(LinearAlgebra.opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), verbose::Bool, controller::Nothing, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), progress_id::Symbol, userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias::ODEAliasSpecifier, initializealg::DiffEqBase.DefaultInit, rng::Nothing, save_noise::Bool, delta::Nothing, W::Nothing, P::Nothing, sqdt::Nothing, noise::Nothing, c::Nothing, rate_constants::Nothing, _cache::Nothing, _u::Nothing, _uprev::Nothing, seed::UInt64, kwargs::@Kwargs{})
    @ OrdinaryDiffEqCore ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:812
 [26] _ode_init
    @ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:47 [inlined]
 [27] #__init#71
    @ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:37 [inlined]
 [28] __init (repeats 2 times)
    @ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:19 [inlined]
 [29] __solve(::ODEProblem{…}, ::CarpenterKennedy2N54{…}; kwargs::@Kwargs{})
    @ OrdinaryDiffEqCore ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:9
 [30] __solve
    @ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:1 [inlined]
 [31] solve_call(_prob::ODEProblem{…}, args::CarpenterKennedy2N54{…}; merge_callbacks::Bool, kwargshandle::Nothing, kwargs::@Kwargs{})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:172
 [32] solve_call
    @ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:137 [inlined]
 [33] #solve_up#38
    @ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:646 [inlined]
 [34] solve_up
    @ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:619 [inlined]
 [35] #solve#37
    @ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:603 [inlined]
 [36] top-level scope
    @ ~/workspace/dev/Trixi.jl/examples/p4est_2d_dgsem/elixir_euler_source_terms_gpu.jl:60
 [37] include(fname::String)
    @ Base.MainInclude ./client.jl:494
 [38] top-level scope
    @ REPL[20]:1
in expression starting at /home/marco/workspace/dev/Trixi.jl/examples/p4est_2d_dgsem/elixir_euler_source_terms_gpu.jl:60
Some type information was truncated. Use `show(err)` to see complete types.

Apparently we do not retrieve the correct number of variables from:

        # Reinterpret the solution array as an array of conservative variables,
        # compute the solution variables via broadcasting, and reinterpret the
        # result as a plain array of floating point numbers
        data = Array(reinterpret(eltype(u),
                                 solution_variables.(reinterpret(SVector{nvariables(equations),
                                                                         eltype(u)}, u),
                                                     Ref(equations))))

        # Find out variable count by looking at output from `solution_variables` function
        n_vars = size(data, 1)

I'll open an issue. If I overwrite n_vars with the correct number of variables it works.

Comment thread test/test_amdgpu_2d.jl
@MarcoArtiano
Copy link
Copy Markdown
Contributor Author

This is now ready. I've also added the calculation for the 1D source terms on GPU, although we do not have any other kernel for the 1D. Should I delete that file? (it is not tested)

Comment thread examples/p4est_2d_dgsem/elixir_euler_source_terms.jl
Comment thread examples/p4est_3d_dgsem/elixir_euler_source_terms.jl
@MarcoArtiano
Copy link
Copy Markdown
Contributor Author

MarcoArtiano commented May 19, 2026

CUDA tests and KernelAbstraction with CPU backend tests are also missing and need to be added. I will add them as soon as possible.

@ranocha
Copy link
Copy Markdown
Member

ranocha commented May 19, 2026

This is now ready. I've also added the calculation for the 1D source terms on GPU, although we do not have any other kernel for the 1D. Should I delete that file? (it is not tested)

Yes, please. The P4estMesh does not support 1D (and GPU implementations are only available for the P4estMesh).

Comment thread examples/p4est_2d_dgsem/elixir_euler_source_terms_gpu.jl Outdated
@ranocha
Copy link
Copy Markdown
Member

ranocha commented May 19, 2026

@MarcoArtiano
Copy link
Copy Markdown
Contributor Author

Is the code coverage expected to fail, even if we are running KA with CPU backend?

@benegee
Copy link
Copy Markdown
Contributor

benegee commented May 20, 2026

Is the code coverage expected to fail, even if we are running KA with CPU backend?

I think you are good!
The KA tests were explicitly added to get coverage reports for the backend::Backend specializations. However, some parts like @index will still not be detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants