Add GPU kernel for calc_sources! #3012
Conversation
Review checklistThis checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging. Purpose and scope
Code quality
Documentation
Testing
Performance
Verification
Created with ❤️ by the Trixi.jl community. |
calc_sources! calc_sources!
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3012 +/- ##
==========================================
- Coverage 97.09% 97.09% -0.01%
==========================================
Files 630 631 +1
Lines 48855 48885 +30
==========================================
+ Hits 47435 47461 +26
- Misses 1420 1424 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Would you expect performance improvements coming from this, @vchuravy? |
Comment Flux Diff GPU here I've tested these two cases and I didn't notice any major differences. I'm not sure if one option has a better scalability over the other. |
calc_sources! calc_sources!
|
The GPU CI job on buildkite fails. Could you please check what is going on there? Please ping me when your PR improving the performance of applying the Jacobian has been merged and this PR is updated accordingly to the new structure. |
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
|
This PR will be ready to be reviewed after #3013 will be merged and the conflicts will be resolved. |
|
@ranocha this is again ready for another round of review. |
benegee
left a comment
There was a problem hiding this comment.
Just hijacking the PR for two more general questions, which we can of course address later.
| # ODE solvers, callbacks etc. | ||
|
|
||
| tspan = (0.0, 2.0) | ||
| ode = semidiscretize(semi, tspan; real_type = real_type, storage_type = storage_type) |
There was a problem hiding this comment.
This is unrelated to this specific PR, but as more examples are added, we could think about unifying the _gpu and non _gpu elixirs. I think the only difference so far are the additional keyword arguments, which we could set to local nothing variables, to be overwritten by trixi_include.
(I know that our template example elixir_advection_basic_gpu.jl suggests to have different files, and I think there was a reason, I'm just not so certain about it anymore)
There was a problem hiding this comment.
I totally agree. Indeed here there was also a mistake, where real_type and storage_type should be nothing by default. I fixed that.
Regarding unifying the elixirs that would be great. Although I found a problem: for example for the Euler equations when trying to run with Float32, there's nothing that adapts the Float64 equation structure to Float32 and the code crashes. That's the reason I had to add gamma = Float32(1.4) in the tests. I'm not sure if this is expected or not.
To avoid changing the elixirs we could also redefine the ode by redefining the semi object locally in the tests, and then
ode = semidiscretize(ode, tspan; real_type = ....etc), otherwise, as I said, we should make the kwargs explicit in all P4estMesh elixirs, which might be not really what we want.
Another alternative would be to create: dgsem_p4est_gpu or dgsem_gpu for the examples on GPU, as at the moment not all the features are GPU ready.
There was a problem hiding this comment.
This is unfortunate. I thought the equations would be adapted as part of the semidiscretization as well. We should look into this, in a further issue.
There was a problem hiding this comment.
This is the error that I get if I don't set gamma to be Float32 and I want to run Float32 with the GPU.
ERROR: LoadError: BoundsError: attempt to access NTuple{4, String} at index [5]
Stacktrace:
[1] getindex(t::Tuple, i::Int64)
@ Base ./tuple.jl:31
[2] (::Trixi.var"#1796#1803"{…})(file::HDF5.File)
@ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution_dg.jl:67
[3] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, Trixi.var"#1796#1803"{…}, HDF5.File})()
@ HDF5 ~/.julia/packages/HDF5/8g5ny/src/file.jl:101
[4] task_local_storage(body::HDF5.var"#17#18"{…}, key::Symbol, val::HDF5.HDF5Context)
@ Base ./task.jl:304
[5] #h5open#16
@ ~/.julia/packages/HDF5/8g5ny/src/file.jl:96 [inlined]
[6] h5open
@ ~/.julia/packages/HDF5/8g5ny/src/file.jl:94 [inlined]
[7] save_solution_file(u::Array{…}, time::Float64, dt::Float64, timestep::Int64, mesh::P4estMesh{…}, equations::CompressibleEulerEquations2D{…}, dg::DGSEM{…}, cache::@NamedTuple{…}, solution_callback::SaveSolutionCallback{…}, element_variables::Dict{…}, node_variables::Dict{…}; system::String)
@ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution_dg.jl:47
[8] save_solution_file
@ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution_dg.jl:8 [inlined]
[9] #save_solution_file#1791
@ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:306 [inlined]
[10] save_solution_file
@ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:294 [inlined]
[11] macro expansion
@ ~/.julia/packages/TrixiBase/MGeKl/src/trixi_timeit.jl:67 [inlined]
[12] #save_solution_file#1788
@ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:285 [inlined]
[13] save_solution_file
@ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:260 [inlined]
[14] macro expansion
@ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:252 [inlined]
[15] macro expansion
@ ~/.julia/packages/TrixiBase/MGeKl/src/trixi_timeit.jl:67 [inlined]
[16] (::SaveSolutionCallback{…})(integrator::OrdinaryDiffEqCore.ODEIntegrator{…})
@ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:247
[17] initialize_save_cb!(solution_callback::SaveSolutionCallback{…}, u::ROCArray{…}, t::Float64, integrator::OrdinaryDiffEqCore.ODEIntegrator{…})
@ Trixi ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:181
[18] initialize_save_cb!
@ ~/workspace/dev/Trixi.jl/src/callbacks_step/save_solution.jl:171 [inlined]
[19] initialize!(u::ROCArray{…}, t::Float64, integrator::OrdinaryDiffEqCore.ODEIntegrator{…}, any_modified::Bool, c::DiscreteCallback{…}, cs::DiscreteCallback{…})
@ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:17
[20] initialize!(::ROCArray{…}, ::Float64, ::OrdinaryDiffEqCore.ODEIntegrator{…}, ::Bool, ::DiscreteCallback{…}, ::DiscreteCallback{…}, ::Vararg{…})
@ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:18
[21] initialize!(::ROCArray{…}, ::Float64, ::OrdinaryDiffEqCore.ODEIntegrator{…}, ::Bool, ::DiscreteCallback{…}, ::DiscreteCallback{…}, ::Vararg{…})
@ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:18
[22] initialize!(::ROCArray{…}, ::Float64, ::OrdinaryDiffEqCore.ODEIntegrator{…}, ::Bool, ::DiscreteCallback{…}, ::DiscreteCallback{…}, ::Vararg{…})
@ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:18
[23] initialize!
@ ~/.julia/packages/DiffEqBase/CJTzr/src/callbacks.jl:7 [inlined]
[24] initialize_callbacks!(integrator::OrdinaryDiffEqCore.ODEIntegrator{…}, initialize_save::Bool)
@ OrdinaryDiffEqCore ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:1089
[25] _ode_init(prob::ODEProblem{…}, alg::CarpenterKennedy2N54{…}, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}; saveat::Tuple{}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_discretes::Bool, save_start::Bool, save_end::Nothing, callback::CallbackSet{…}, dense::Bool, calck::Bool, dt::Float64, dtmin::Float64, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, abstol::Nothing, reltol::Nothing, gamma::Nothing, qmin::Nothing, qmax::Nothing, qsteady_min::Nothing, qsteady_max::Nothing, beta1::Nothing, beta2::Nothing, qoldinit::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqBase.ODE_DEFAULT_NORM), internalopnorm::typeof(LinearAlgebra.opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), verbose::Bool, controller::Nothing, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), progress_id::Symbol, userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias::ODEAliasSpecifier, initializealg::DiffEqBase.DefaultInit, rng::Nothing, save_noise::Bool, delta::Nothing, W::Nothing, P::Nothing, sqdt::Nothing, noise::Nothing, c::Nothing, rate_constants::Nothing, _cache::Nothing, _u::Nothing, _uprev::Nothing, seed::UInt64, kwargs::@Kwargs{})
@ OrdinaryDiffEqCore ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:812
[26] _ode_init
@ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:47 [inlined]
[27] #__init#71
@ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:37 [inlined]
[28] __init (repeats 2 times)
@ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:19 [inlined]
[29] __solve(::ODEProblem{…}, ::CarpenterKennedy2N54{…}; kwargs::@Kwargs{…})
@ OrdinaryDiffEqCore ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:9
[30] __solve
@ ~/.julia/packages/OrdinaryDiffEqCore/rnOL4/src/solve.jl:1 [inlined]
[31] solve_call(_prob::ODEProblem{…}, args::CarpenterKennedy2N54{…}; merge_callbacks::Bool, kwargshandle::Nothing, kwargs::@Kwargs{…})
@ DiffEqBase ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:172
[32] solve_call
@ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:137 [inlined]
[33] #solve_up#38
@ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:646 [inlined]
[34] solve_up
@ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:619 [inlined]
[35] #solve#37
@ ~/.julia/packages/DiffEqBase/CJTzr/src/solve.jl:603 [inlined]
[36] top-level scope
@ ~/workspace/dev/Trixi.jl/examples/p4est_2d_dgsem/elixir_euler_source_terms_gpu.jl:60
[37] include(fname::String)
@ Base.MainInclude ./client.jl:494
[38] top-level scope
@ REPL[20]:1
in expression starting at /home/marco/workspace/dev/Trixi.jl/examples/p4est_2d_dgsem/elixir_euler_source_terms_gpu.jl:60
Some type information was truncated. Use `show(err)` to see complete types.Apparently we do not retrieve the correct number of variables from:
# Reinterpret the solution array as an array of conservative variables,
# compute the solution variables via broadcasting, and reinterpret the
# result as a plain array of floating point numbers
data = Array(reinterpret(eltype(u),
solution_variables.(reinterpret(SVector{nvariables(equations),
eltype(u)}, u),
Ref(equations))))
# Find out variable count by looking at output from `solution_variables` function
n_vars = size(data, 1)I'll open an issue. If I overwrite n_vars with the correct number of variables it works.
|
This is now ready. I've also added the calculation for the 1D source terms on GPU, although we do not have any other kernel for the 1D. Should I delete that file? (it is not tested) |
|
CUDA tests and KernelAbstraction with CPU backend tests are also missing and need to be added. I will add them as soon as possible. |
Yes, please. The |
Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
|
Can you please adapt the allocation check? https://github.com/trixi-framework/Trixi.jl/actions/runs/26111095627/job/76788248592?pr=3012#step:7:1211 |
|
Is the code coverage expected to fail, even if we are running KA with CPU backend? |
I think you are good! |
I first naively implemented the approach following the other existing kernel. However it was as slow as computing the surface fluxes. Here each thread of the GPU is launched for each quadrature node (
i,j,element), which makes the GPU kernel roughly 6 times faster.Per element kernel
source terms 19.2k 4.89s 12.0% 254μs 143MiB 10.5% 7.62KiBPer quadrature node kernel
source terms 19.2k 792ms 2.0% 41.2μs 104MiB 8.1% 5.55KiBps: it may also be worth trying
(nnodes(dg)*nnodes(dg), nelements(dg, cache))and reconstructing the indicesiandjin the kernel.