Support Julia 1.13 with fix for @device_functions macro#3031
Support Julia 1.13 with fix for @device_functions macro#3031KSepetanc wants to merge 18 commits intoJuliaGPU:masterfrom
Conversation
|
Well, I didn't really ask for a duplicate PR. I suggested to either merge them into CUDA.jl as two separate, sequential PRs, or – if you want to merge them as a single PR into CUDA.jl – create such a PR. I don't care either way; using two separate PRs seems simpler, but I leave that choice up to you. |
|
The way I see it, this is the second option, i.e. single PR with both changes.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3031 +/- ##
==========================================
- Coverage 89.46% 89.35% -0.12%
==========================================
Files 148 148
Lines 13047 13044 -3
==========================================
- Hits 11673 11655 -18
- Misses 1374 1389 +15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: b1e9b57 | Previous: 1810b7a | Ratio |
|---|---|---|---|
latency/precompile |
44328349517 ns |
44300180944.5 ns |
1.00 |
latency/ttfp |
13141566320 ns |
13138137112 ns |
1.00 |
latency/import |
3768894174.5 ns |
3757487166.5 ns |
1.00 |
integration/volumerhs |
9442367.5 ns |
9441754.5 ns |
1.00 |
integration/byval/slices=1 |
145610 ns |
145846 ns |
1.00 |
integration/byval/slices=3 |
422716 ns |
423265 ns |
1.00 |
integration/byval/reference |
143823.5 ns |
143916 ns |
1.00 |
integration/byval/slices=2 |
284142 ns |
284641 ns |
1.00 |
integration/cudadevrt |
102447 ns |
102633 ns |
1.00 |
kernel/indexing |
13360 ns |
13466 ns |
0.99 |
kernel/indexing_checked |
14152 ns |
13982 ns |
1.01 |
kernel/occupancy |
649.422619047619 ns |
699.625850340136 ns |
0.93 |
kernel/launch |
2059.9 ns |
2067.8 ns |
1.00 |
kernel/rand |
16211 ns |
16244 ns |
1.00 |
array/reverse/1d |
18777 ns |
18605 ns |
1.01 |
array/reverse/2dL_inplace |
66066 ns |
66133 ns |
1.00 |
array/reverse/1dL |
69078 ns |
68870 ns |
1.00 |
array/reverse/2d |
20795 ns |
20781 ns |
1.00 |
array/reverse/1d_inplace |
10525.166666666668 ns |
10493.666666666666 ns |
1.00 |
array/reverse/2d_inplace |
10614 ns |
10765 ns |
0.99 |
array/reverse/2dL |
72846 ns |
72777.5 ns |
1.00 |
array/reverse/1dL_inplace |
66097 ns |
66166 ns |
1.00 |
array/copy |
18360.5 ns |
18321 ns |
1.00 |
array/iteration/findall/int |
145080 ns |
145251 ns |
1.00 |
array/iteration/findall/bool |
130258 ns |
130303 ns |
1.00 |
array/iteration/findfirst/int |
82889 ns |
83996 ns |
0.99 |
array/iteration/findfirst/bool |
80606 ns |
81209 ns |
0.99 |
array/iteration/scalar |
66588 ns |
64953 ns |
1.03 |
array/iteration/logical |
194180.5 ns |
197334 ns |
0.98 |
array/iteration/findmin/1d |
83560.5 ns |
85667.5 ns |
0.98 |
array/iteration/findmin/2d |
116518 ns |
117130 ns |
0.99 |
array/reductions/reduce/Int64/1d |
39034 ns |
38913 ns |
1.00 |
array/reductions/reduce/Int64/dims=1 |
41876.5 ns |
41855 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
58784 ns |
59043 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
86987 ns |
87102 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84033 ns |
84295 ns |
1.00 |
array/reductions/reduce/Float32/1d |
33717 ns |
33785 ns |
1.00 |
array/reductions/reduce/Float32/dims=1 |
39187 ns |
48986 ns |
0.80 |
array/reductions/reduce/Float32/dims=2 |
56332 ns |
56655 ns |
0.99 |
array/reductions/reduce/Float32/dims=1L |
51325 ns |
51438 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
69258 ns |
69460.5 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
39257.5 ns |
38699 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1 |
51465.5 ns |
41686 ns |
1.23 |
array/reductions/mapreduce/Int64/dims=2 |
58854 ns |
58974 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
87070 ns |
87184 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
84235 ns |
84571 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
33106 ns |
33512 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=1 |
39493.5 ns |
47745 ns |
0.83 |
array/reductions/mapreduce/Float32/dims=2 |
55782 ns |
56241 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=1L |
51188 ns |
51435 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
68300.5 ns |
69604 ns |
0.98 |
array/broadcast |
20139 ns |
20361 ns |
0.99 |
array/copyto!/gpu_to_gpu |
10570.666666666666 ns |
10601.666666666666 ns |
1.00 |
array/copyto!/cpu_to_gpu |
213494 ns |
214964 ns |
0.99 |
array/copyto!/gpu_to_cpu |
281531 ns |
282717 ns |
1.00 |
array/accumulate/Int64/1d |
118212 ns |
118054 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79130 ns |
78929 ns |
1.00 |
array/accumulate/Int64/dims=2 |
155192 ns |
155861 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1705618 ns |
1705368 ns |
1.00 |
array/accumulate/Int64/dims=2L |
960326 ns |
960330.5 ns |
1.00 |
array/accumulate/Float32/1d |
100301.5 ns |
100426 ns |
1.00 |
array/accumulate/Float32/dims=1 |
75752 ns |
75943 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143887.5 ns |
143974 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1583879 ns |
1584300 ns |
1.00 |
array/accumulate/Float32/dims=2L |
658657 ns |
658063 ns |
1.00 |
array/construct |
1268.2 ns |
1252.6 ns |
1.01 |
array/random/randn/Float32 |
35400 ns |
35435 ns |
1.00 |
array/random/randn!/Float32 |
29917 ns |
29972 ns |
1.00 |
array/random/rand!/Int64 |
32291 ns |
28260 ns |
1.14 |
array/random/rand!/Float32 |
8209.333333333334 ns |
8310 ns |
0.99 |
array/random/rand/Int64 |
29169.5 ns |
29927 ns |
0.97 |
array/random/rand/Float32 |
12376.5 ns |
12324 ns |
1.00 |
array/permutedims/4d |
51508 ns |
51686 ns |
1.00 |
array/permutedims/2d |
52234.5 ns |
52279 ns |
1.00 |
array/permutedims/3d |
52585 ns |
52911 ns |
0.99 |
array/sorting/1d |
2733948 ns |
2735042.5 ns |
1.00 |
array/sorting/by |
3303469 ns |
3304486.5 ns |
1.00 |
array/sorting/2d |
1066285 ns |
1066581 ns |
1.00 |
cuda/synchronization/stream/auto |
1039.7272727272727 ns |
993.5882352941177 ns |
1.05 |
cuda/synchronization/stream/nonblocking |
7625.1 ns |
7392.700000000001 ns |
1.03 |
cuda/synchronization/stream/blocking |
783.9074074074074 ns |
811.8282828282828 ns |
0.97 |
cuda/synchronization/context/auto |
1145.4 ns |
1160.9 ns |
0.99 |
cuda/synchronization/context/nonblocking |
7973.1 ns |
7875.6 ns |
1.01 |
cuda/synchronization/context/blocking |
881.3770491803278 ns |
899.7058823529412 ns |
0.98 |
This comment was automatically generated by workflow using github-action-benchmark.
|
I'll fold this into #3020. |
Closes #3019.
@eschnett asked me to create a new duplicate PR #3020 of his, but with fix for macro
@device_functions. He couldn't test if the fix works as I made PR on his fork that does not have CI infrastructure.