Various improvements to the docs by giordano · Pull Request #3030 · JuliaGPU/CUDA.jl

giordano · 2026-02-13T15:49:38Z

I had some...uhm...fun in the last couple of days trying to port some C++ CUDA code to CUDA.jl, and profile it. I dumped into this PR my experience, hoping to make lives of people after me a little bit easier 🙂

…onding C/C++ variables

github-actions · 2026-02-13T15:50:12Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/device/intrinsics/indexing.jl b/src/device/intrinsics/indexing.jl
index 5e6209fe3..dd9655911 100644
--- a/src/device/intrinsics/indexing.jl
+++ b/src/device/intrinsics/indexing.jl
@@ -92,62 +92,62 @@ end
 @doc """
     threadIdx()::NamedTuple
 
-Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
+    Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
 """ threadIdx
 @inline threadIdx() = (x=threadIdx_x(), y=threadIdx_y(), z=threadIdx_z())
 
 @doc """
     blockDim()::NamedTuple
 
-Returns the dimensions (in threads) of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
-Unlike the `*Idx` intrinsics, `blockDim` returns the same value as its C/C++ extension counterpart.
+    Returns the dimensions (in threads) of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+    Unlike the `*Idx` intrinsics, `blockDim` returns the same value as its C/C++ extension counterpart.
 """ blockDim
 @inline blockDim() = (x=blockDim_x(), y=blockDim_y(), z=blockDim_z())
 
 @doc """
     blockIdx()::NamedTuple
 
-Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
+    Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
 """ blockIdx
 @inline blockIdx() = (x=blockIdx_x(), y=blockIdx_y(), z=blockIdx_z())
 
 @doc """
     gridDim()::NamedTuple
 
-Returns the dimensions (in blocks) of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-Unlike the `*Idx` intrinsics, `gridDim` returns the same value as its C/C++ extension counterpart.
+    Returns the dimensions (in blocks) of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    Unlike the `*Idx` intrinsics, `gridDim` returns the same value as its C/C++ extension counterpart.
 """ gridDim
 @inline gridDim() = (x=gridDim_x(), y=gridDim_y(), z=gridDim_z())
 
 @doc """
     blockIdxInCluster()::NamedTuple
 
-Returns the block index within the cluster as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based.
+    Returns the block index within the cluster as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based.
 """ blockIdxInCluster
 @inline blockIdxInCluster() = (x=blockIdxInCluster_x(), y=blockIdxInCluster_y(), z=blockIdxInCluster_z())
 
 @doc """
     clusterDim()::NamedTuple
 
-Returns the dimensions (in blocks) of the cluster as a `NamedTuple` with keys `x`, `y`, and `z`.
+    Returns the dimensions (in blocks) of the cluster as a `NamedTuple` with keys `x`, `y`, and `z`.
 """ clusterDim
 @inline clusterDim() = (x=clusterDim_x(), y=clusterDim_y(), z=clusterDim_z())
 
 @doc """
     clusterIdx()::NamedTuple
 
-Returns the cluster index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based.
+    Returns the cluster index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based.
 """ clusterIdx
 @inline clusterIdx() = (x=clusterIdx_x(), y=clusterIdx_y(), z=clusterIdx_z())
 
 @doc """
     gridClusterDim()::NamedTuple
 
-Returns the dimensions (in clusters) of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    Returns the dimensions (in clusters) of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
 """ gridClusterDim
 @inline gridClusterDim() = (x=gridClusterDim_x(), y=gridClusterDim_y(), z=gridClusterDim_z())
 
@@ -155,7 +155,7 @@ Returns the dimensions (in clusters) of the grid as a `NamedTuple` with keys `x`
     linearBlockIdxInCluster()::Int32
 
 Returns the linear block index within the cluster.
-These indices are 1-based.
+    These indices are 1-based.
 """ linearBlockIdxInCluster
 @eval @inline $(:linearBlockIdxInCluster)() = _index($(Val(Symbol("cluster.ctarank"))), $(Val(0:max_cluster_length-1))) + 1i32
 
@@ -170,7 +170,7 @@ Returns the linear cluster size (in blocks).
     warpsize()::Int32
 
 Returns the warp size (in threads).
-This corresponds to the `warpSize` built-in variable in the C/C++ extension.
+    This corresponds to the `warpSize` built-in variable in the C/C++ extension.
 """ warpsize
 @inline warpsize() = ccall("llvm.nvvm.read.ptx.sreg.warpsize", llvmcall, Int32, ())
 
@@ -178,7 +178,7 @@ This corresponds to the `warpSize` built-in variable in the C/C++ extension.
     laneid()::Int32
 
 Returns the thread's lane within the warp.
-This ID is 1-based.
+    This ID is 1-based.
 """ laneid
 @inline laneid() = ccall("llvm.nvvm.read.ptx.sreg.laneid", llvmcall, Int32, ()) + 1i32

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `f3c0846`	Previous: `f7b7929`	Ratio
`latency/precompile`	`45002121512.5` ns	`45018748344` ns	`1.00`
`latency/ttfp`	`12927195194` ns	`12770284486` ns	`1.01`
`latency/import`	`3542151579` ns	`3541917719` ns	`1.00`
`integration/volumerhs`	`9440693` ns	`9450947.5` ns	`1.00`
`integration/byval/slices=1`	`146099` ns	`146127` ns	`1.00`
`integration/byval/slices=3`	`423506` ns	`423159` ns	`1.00`
`integration/byval/reference`	`144164` ns	`143932` ns	`1.00`
`integration/byval/slices=2`	`284988` ns	`284759.5` ns	`1.00`
`integration/cudadevrt`	`102829` ns	`102551` ns	`1.00`
`kernel/indexing`	`13480` ns	`13204` ns	`1.02`
`kernel/indexing_checked`	`14372` ns	`13977` ns	`1.03`
`kernel/occupancy`	`674.9496855345911` ns	`664.05625` ns	`1.02`
`kernel/launch`	`2231.4444444444443` ns	`2163.9444444444443` ns	`1.03`
`kernel/rand`	`14964` ns	`18131` ns	`0.83`
`array/reverse/1d`	`18806` ns	`18471` ns	`1.02`
`array/reverse/2dL_inplace`	`66144` ns	`65988` ns	`1.00`
`array/reverse/1dL`	`69375` ns	`69022` ns	`1.01`
`array/reverse/2d`	`21151` ns	`20733` ns	`1.02`
`array/reverse/1d_inplace`	`10469.666666666666` ns	`8573` ns	`1.22`
`array/reverse/2d_inplace`	`10540` ns	`10232` ns	`1.03`
`array/reverse/2dL`	`73156.5` ns	`72825` ns	`1.00`
`array/reverse/1dL_inplace`	`66136` ns	`65937` ns	`1.00`
`array/copy`	`19107` ns	`18988` ns	`1.01`
`array/iteration/findall/int`	`150518` ns	`150059` ns	`1.00`
`array/iteration/findall/bool`	`132933.5` ns	`132365.5` ns	`1.00`
`array/iteration/findfirst/int`	`83958` ns	`83639` ns	`1.00`
`array/iteration/findfirst/bool`	`81654` ns	`81468` ns	`1.00`
`array/iteration/scalar`	`67751` ns	`66443.5` ns	`1.02`
`array/iteration/logical`	`204171` ns	`200236` ns	`1.02`
`array/iteration/findmin/1d`	`87930.5` ns	`86614.5` ns	`1.02`
`array/iteration/findmin/2d`	`118171` ns	`117241` ns	`1.01`
`array/reductions/reduce/Int64/1d`	`44225` ns	`42766` ns	`1.03`
`array/reductions/reduce/Int64/dims=1`	`42675.5` ns	`52907` ns	`0.81`
`array/reductions/reduce/Int64/dims=2`	`60126` ns	`60231` ns	`1.00`
`array/reductions/reduce/Int64/dims=1L`	`88052` ns	`87828` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`84785` ns	`84956.5` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`35344.5` ns	`34964` ns	`1.01`
`array/reductions/reduce/Float32/dims=1`	`49647.5` ns	`40442.5` ns	`1.23`
`array/reductions/reduce/Float32/dims=2`	`57300` ns	`57125` ns	`1.00`
`array/reductions/reduce/Float32/dims=1L`	`52478` ns	`52000` ns	`1.01`
`array/reductions/reduce/Float32/dims=2L`	`70397` ns	`69982.5` ns	`1.01`
`array/reductions/mapreduce/Int64/1d`	`43614` ns	`42509` ns	`1.03`
`array/reductions/mapreduce/Int64/dims=1`	`53101` ns	`42334` ns	`1.25`
`array/reductions/mapreduce/Int64/dims=2`	`60233` ns	`59835` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=1L`	`88035` ns	`87864` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`85256` ns	`85164` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`35067` ns	`34719` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1`	`40061` ns	`45273` ns	`0.88`
`array/reductions/mapreduce/Float32/dims=2`	`57092` ns	`56959` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`52232.5` ns	`52179` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`69921` ns	`69729` ns	`1.00`
`array/broadcast`	`20979` ns	`20464` ns	`1.03`
`array/copyto!/gpu_to_gpu`	`11512` ns	`11261` ns	`1.02`
`array/copyto!/cpu_to_gpu`	`217133` ns	`216266` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`284447` ns	`282685.5` ns	`1.01`
`array/accumulate/Int64/1d`	`119559` ns	`119363` ns	`1.00`
`array/accumulate/Int64/dims=1`	`80810.5` ns	`80474` ns	`1.00`
`array/accumulate/Int64/dims=2`	`157639` ns	`157437.5` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1707311.5` ns	`1706725` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`962530.5` ns	`962008` ns	`1.00`
`array/accumulate/Float32/1d`	`102000.5` ns	`101483` ns	`1.01`
`array/accumulate/Float32/dims=1`	`78118` ns	`77247` ns	`1.01`
`array/accumulate/Float32/dims=2`	`145130.5` ns	`143932` ns	`1.01`
`array/accumulate/Float32/dims=1L`	`1586913` ns	`1593993` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`658901` ns	`660832` ns	`1.00`
`array/construct`	`1311.9` ns	`1332.6` ns	`0.98`
`array/random/randn/Float32`	`38856` ns	`38567.5` ns	`1.01`
`array/random/randn!/Float32`	`29569` ns	`31716` ns	`0.93`
`array/random/rand!/Int64`	`27138` ns	`34263.5` ns	`0.79`
`array/random/rand!/Float32`	`8569.333333333334` ns	`8628` ns	`0.99`
`array/random/rand/Int64`	`35000.5` ns	`30788.5` ns	`1.14`
`array/random/rand/Float32`	`13213` ns	`13144` ns	`1.01`
`array/permutedims/4d`	`52756.5` ns	`52096` ns	`1.01`
`array/permutedims/2d`	`53280.5` ns	`52583` ns	`1.01`
`array/permutedims/3d`	`53462` ns	`53461` ns	`1.00`
`array/sorting/1d`	`2737046.5` ns	`2734388` ns	`1.00`
`array/sorting/by`	`3305342` ns	`3327876` ns	`0.99`
`array/sorting/2d`	`1069795` ns	`1072450` ns	`1.00`
`cuda/synchronization/stream/auto`	`1044.5` ns	`1031.7` ns	`1.01`
`cuda/synchronization/stream/nonblocking`	`7392` ns	`7628.4` ns	`0.97`
`cuda/synchronization/stream/blocking`	`837.3461538461538` ns	`827.9` ns	`1.01`
`cuda/synchronization/context/auto`	`1181.2` ns	`1165.1` ns	`1.01`
`cuda/synchronization/context/nonblocking`	`7662.6` ns	`7638.9` ns	`1.00`
`cuda/synchronization/context/blocking`	`948.551724137931` ns	`925.0566037735849` ns	`1.03`

This comment was automatically generated by workflow using github-action-benchmark.

codecov · 2026-02-16T00:59:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.47%. Comparing base (5472295) to head (0aef286).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3030      +/-   ##
==========================================
- Coverage   89.48%   89.47%   -0.01%     
==========================================
  Files         148      148              
  Lines       13043    13043              
==========================================
- Hits        11671    11670       -1     
- Misses       1372     1373       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maleadt · 2026-03-06T11:12:47Z

src/device/intrinsics/indexing.jl

-""" threadIdx
-@inline threadIdx() = (x=threadIdx_x(), y=threadIdx_y(), z=threadIdx_z())
+Returns the dimensions of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+These dimensions have the same starting index as the `gridDim` built-in variable in the C/C++ extension.


gridDim returns a dimension/size, not an index.

Replaced "index" with "dimension" here.

starting dimension doesn't make much sense to me. What else could a size() query return? 0 vs 1-based indexing doesn't apply here.

That said, I'm okay with this if you think this clarifies things.

Maybe it could be phrased along the lines of:

Unlike the `*Idx` intrinsics `gridDim` returns the same value as its C/C++ extension counterpart.

I do think this should be mentioned in form though. The indexing intrinsics being offset while the dim intrinsics not makes sense when you think about it, but I've also gotten confused by this, and not everyone will think/know to check the source code to confirm.

Either way, the same edits the gridDim receives should also be mirrored to blockDim

Co-authored-by: Christian Guinard <28689358+christiangnrd@users.noreply.github.com>

giordano · 2026-03-18T10:29:20Z

Bump? I expanded also doscstrings introduced in #3017.

giordano added 3 commits February 13, 2026 15:00

Update link to new CUDA programming guide

964b509

[docs] Make it crystal clear that some indices different from corresp…

17ea30f

…onding C/C++ variables

[docs] Add more troubleshooting information for Nsight Compute

f6d638b

github-actions bot reviewed Feb 13, 2026

View reviewed changes

[docs] Try to fix reference

337b7a7

maleadt reviewed Mar 6, 2026

View reviewed changes

giordano and others added 8 commits March 6, 2026 13:24

"index" -> "dimension" for `gridDim

60f4172

Merge remote-tracking branch 'origin/master' into mg/docs

c931395

Clarify language around dimensions

f4c1569

Co-authored-by: Christian Guinard <28689358+christiangnrd@users.noreply.github.com>

Merge remote-tracking branch 'origin/master' into mg/docs

d15abef

Merge branch 'master' into mg/docs

0aef286

Merge branch 'master' into mg/docs

a3411fd

Merge remote-tracking branch 'origin/master' into mg/docs

7120172

Expand docstrings for new dimensions/indices

f3c0846

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various improvements to the docs#3030

Various improvements to the docs#3030
giordano wants to merge 12 commits intoJuliaGPU:masterfrom
giordano:mg/docs

giordano commented Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026 •

edited

Loading

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

codecov bot commented Feb 16, 2026 •

edited

Loading

Uh oh!

maleadt Mar 6, 2026

Uh oh!

giordano Mar 6, 2026

Uh oh!

maleadt Mar 6, 2026

Uh oh!

christiangnrd Mar 7, 2026 •

edited

Loading

Uh oh!

giordano Mar 7, 2026

Uh oh!

giordano commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

giordano commented Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

codecov bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maleadt Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

giordano Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

maleadt Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

christiangnrd Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

giordano Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

giordano commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Feb 13, 2026 •

edited

Loading

github-actions bot left a comment •

edited

Loading

codecov bot commented Feb 16, 2026 •

edited

Loading

christiangnrd Mar 7, 2026 •

edited

Loading