blx.x, blx.y
(M, N): 16384, 16384
Copy with TMA load and store -- no swizzling.
smem size: 32896.
Trial 0 Completed in 0.554967ms (1934.78 GB/s)
Trial 1 Completed in 0.182786ms (5874.31 GB/s)
Trial 2 Completed in 0.179789ms (5972.23 GB/s)
Trial 3 Completed in 0.180768ms (5939.89 GB/s)
Trial 4 Completed in 0.181476ms (5916.72 GB/s)
Trial 5 Completed in 0.181638ms (5911.44 GB/s)
Trial 6 Completed in 0.180911ms (5935.19 GB/s)
Trial 7 Completed in 0.18125ms (5924.09 GB/s)
Trial 8 Completed in 0.179573ms (5979.42 GB/s)
Trial 9 Completed in 0.180553ms (5946.96 GB/s)
Success 2097152, Fail 0
blx.x, 0
(M, N): 16384, 16384
Copy with TMA load and store -- no swizzling.
smem size: 32896.
Trial 0 Completed in 0.6632ms (1619.03 GB/s)
Trial 1 Completed in 0.293118ms (3663.17 GB/s)
Trial 2 Completed in 0.291583ms (3682.46 GB/s)
Trial 3 Completed in 0.292431ms (3671.78 GB/s)
Trial 4 Completed in 0.292064ms (3676.39 GB/s)
Trial 5 Completed in 0.292127ms (3675.6 GB/s)
Trial 6 Completed in 0.29137ms (3685.15 GB/s)
Trial 7 Completed in 0.292178ms (3674.96 GB/s)
Trial 8 Completed in 0.29203ms (3676.82 GB/s)
Trial 9 Completed in 0.292341ms (3672.91 GB/s)
Success 2097152, Fail 0
When writing the final results to global memory, if using a conventional STORE, the results should be written to the address corresponding to blx.x, blx.y. However, since we are performing a reduction, the results should be written to the address (blx.x, 0), as the entire row is being reduced to one block.
Surprisingly, using the (blx.x, blx.y) address is much faster (5946.96 GB/s vs. 3672.91 GB/s) and the results are also correct, based on multiple measurements (with dimensions M = N = 16384).
However, I'm concerned that using (blx.x, blx.y) might write to incorrect variables, despite the performance improvement.
When writing the final results to global memory, if using a conventional STORE, the results should be written to the address corresponding to blx.x, blx.y. However, since we are performing a reduction, the results should be written to the address (blx.x, 0), as the entire row is being reduced to one block.
Surprisingly, using the (blx.x, blx.y) address is much faster (5946.96 GB/s vs. 3672.91 GB/s) and the results are also correct, based on multiple measurements (with dimensions M = N = 16384).
However, I'm concerned that using (blx.x, blx.y) might write to incorrect variables, despite the performance improvement.