Add Base.min override for Float16 and extend LLVM version guard to v20.#3038
Merged
Add Base.min override for Float16 and extend LLVM version guard to v20.#3038
Conversation
LLVM 20 lowers Base.min(::Float16, ::Float16) to min.NaN.f16, a PTX instruction requiring sm_80+, causing failures on Turing (sm_75) GPUs. Add a Julia-level override matching the existing Base.max workaround, and extend the version guard from LLVM 18 to 20 since the upstream fix (llvm/llvm-project@6f318d47) only landed in LLVM 21. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3038 +/- ##
==========================================
- Coverage 89.49% 89.33% -0.17%
==========================================
Files 148 148
Lines 13047 13047
==========================================
- Hits 11676 11655 -21
- Misses 1371 1392 +21 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: de9be6a | Previous: 1810b7a | Ratio |
|---|---|---|---|
latency/precompile |
44675226940.5 ns |
44300180944.5 ns |
1.01 |
latency/ttfp |
13291171512 ns |
13138137112 ns |
1.01 |
latency/import |
3784128603 ns |
3757487166.5 ns |
1.01 |
integration/volumerhs |
9440873.5 ns |
9441754.5 ns |
1.00 |
integration/byval/slices=1 |
145616 ns |
145846 ns |
1.00 |
integration/byval/slices=3 |
422814 ns |
423265 ns |
1.00 |
integration/byval/reference |
143792 ns |
143916 ns |
1.00 |
integration/byval/slices=2 |
284069 ns |
284641 ns |
1.00 |
integration/cudadevrt |
102357 ns |
102633 ns |
1.00 |
kernel/indexing |
13245 ns |
13466 ns |
0.98 |
kernel/indexing_checked |
14083 ns |
13982 ns |
1.01 |
kernel/occupancy |
635.202380952381 ns |
699.625850340136 ns |
0.91 |
kernel/launch |
2025.5 ns |
2067.8 ns |
0.98 |
kernel/rand |
14585 ns |
16244 ns |
0.90 |
array/reverse/1d |
18615 ns |
18605 ns |
1.00 |
array/reverse/2dL_inplace |
66177 ns |
66133 ns |
1.00 |
array/reverse/1dL |
68804 ns |
68870 ns |
1.00 |
array/reverse/2d |
21266 ns |
20781 ns |
1.02 |
array/reverse/1d_inplace |
10491 ns |
10493.666666666666 ns |
1.00 |
array/reverse/2d_inplace |
11367 ns |
10765 ns |
1.06 |
array/reverse/2dL |
73210 ns |
72777.5 ns |
1.01 |
array/reverse/1dL_inplace |
66188 ns |
66166 ns |
1.00 |
array/copy |
18366 ns |
18321 ns |
1.00 |
array/iteration/findall/int |
145622.5 ns |
145251 ns |
1.00 |
array/iteration/findall/bool |
130340 ns |
130303 ns |
1.00 |
array/iteration/findfirst/int |
85134 ns |
83996 ns |
1.01 |
array/iteration/findfirst/bool |
82631 ns |
81209 ns |
1.02 |
array/iteration/scalar |
67040 ns |
64953 ns |
1.03 |
array/iteration/logical |
197058.5 ns |
197334 ns |
1.00 |
array/iteration/findmin/1d |
83432 ns |
85667.5 ns |
0.97 |
array/iteration/findmin/2d |
117087 ns |
117130 ns |
1.00 |
array/reductions/reduce/Int64/1d |
38905 ns |
38913 ns |
1.00 |
array/reductions/reduce/Int64/dims=1 |
41600 ns |
41855 ns |
0.99 |
array/reductions/reduce/Int64/dims=2 |
58808 ns |
59043 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
87117 ns |
87102 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84669 ns |
84295 ns |
1.00 |
array/reductions/reduce/Float32/1d |
34237 ns |
33785 ns |
1.01 |
array/reductions/reduce/Float32/dims=1 |
43934 ns |
48986 ns |
0.90 |
array/reductions/reduce/Float32/dims=2 |
56239 ns |
56655 ns |
0.99 |
array/reductions/reduce/Float32/dims=1L |
51394 ns |
51438 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
69575 ns |
69460.5 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
39210.5 ns |
38699 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1 |
46057 ns |
41686 ns |
1.10 |
array/reductions/mapreduce/Int64/dims=2 |
58993 ns |
58974 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
87229 ns |
87184 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
84397 ns |
84571 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
34022 ns |
33512 ns |
1.02 |
array/reductions/mapreduce/Float32/dims=1 |
39843 ns |
47745 ns |
0.83 |
array/reductions/mapreduce/Float32/dims=2 |
55903 ns |
56241 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=1L |
51260 ns |
51435 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
69261 ns |
69604 ns |
1.00 |
array/broadcast |
20628 ns |
20361 ns |
1.01 |
array/copyto!/gpu_to_gpu |
10673.333333333334 ns |
10601.666666666666 ns |
1.01 |
array/copyto!/cpu_to_gpu |
213909 ns |
214964 ns |
1.00 |
array/copyto!/gpu_to_cpu |
283527 ns |
282717 ns |
1.00 |
array/accumulate/Int64/1d |
118150.5 ns |
118054 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79533 ns |
78929 ns |
1.01 |
array/accumulate/Int64/dims=2 |
155242 ns |
155861 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1697447 ns |
1705368 ns |
1.00 |
array/accumulate/Int64/dims=2L |
960552 ns |
960330.5 ns |
1.00 |
array/accumulate/Float32/1d |
100637.5 ns |
100426 ns |
1.00 |
array/accumulate/Float32/dims=1 |
76099 ns |
75943 ns |
1.00 |
array/accumulate/Float32/dims=2 |
144215 ns |
143974 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1584181 ns |
1584300 ns |
1.00 |
array/accumulate/Float32/dims=2L |
656485 ns |
658063 ns |
1.00 |
array/construct |
1291 ns |
1252.6 ns |
1.03 |
array/random/randn/Float32 |
36310 ns |
35435 ns |
1.02 |
array/random/randn!/Float32 |
30120 ns |
29972 ns |
1.00 |
array/random/rand!/Int64 |
34550 ns |
28260 ns |
1.22 |
array/random/rand!/Float32 |
8320.166666666668 ns |
8310 ns |
1.00 |
array/random/rand/Int64 |
36976 ns |
29927 ns |
1.24 |
array/random/rand/Float32 |
12342 ns |
12324 ns |
1.00 |
array/permutedims/4d |
50805 ns |
51686 ns |
0.98 |
array/permutedims/2d |
52400 ns |
52279 ns |
1.00 |
array/permutedims/3d |
52639 ns |
52911 ns |
0.99 |
array/sorting/1d |
2734832 ns |
2735042.5 ns |
1.00 |
array/sorting/by |
3304279 ns |
3304486.5 ns |
1.00 |
array/sorting/2d |
1067131 ns |
1066581 ns |
1.00 |
cuda/synchronization/stream/auto |
1064.090909090909 ns |
993.5882352941177 ns |
1.07 |
cuda/synchronization/stream/nonblocking |
7534.299999999999 ns |
7392.700000000001 ns |
1.02 |
cuda/synchronization/stream/blocking |
821.4470588235295 ns |
811.8282828282828 ns |
1.01 |
cuda/synchronization/context/auto |
1150.3 ns |
1160.9 ns |
0.99 |
cuda/synchronization/context/nonblocking |
7125.9 ns |
7875.6 ns |
0.90 |
cuda/synchronization/context/blocking |
894.469387755102 ns |
899.7058823529412 ns |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
LLVM 20 lowers Base.min(::Float16, ::Float16) to min.NaN.f16, a PTX instruction requiring sm_80+, causing failures on Turing (sm_75) GPUs. Add a Julia-level override matching the existing Base.max workaround, and extend the version guard from LLVM 18 to 20 since the upstream fix (llvm/llvm-project@6f318d47) only landed in LLVM 21.
As observed in #3020