Performance Issue in Writing Reduced Results to Global Memory

```
blx.x, blx.y
(M, N): 16384, 16384
Copy with TMA load and store -- no swizzling.
smem size: 32896.
Trial 0 Completed in 0.554967ms (1934.78 GB/s)
Trial 1 Completed in 0.182786ms (5874.31 GB/s)
Trial 2 Completed in 0.179789ms (5972.23 GB/s)
Trial 3 Completed in 0.180768ms (5939.89 GB/s)
Trial 4 Completed in 0.181476ms (5916.72 GB/s)
Trial 5 Completed in 0.181638ms (5911.44 GB/s)
Trial 6 Completed in 0.180911ms (5935.19 GB/s)
Trial 7 Completed in 0.18125ms (5924.09 GB/s)
Trial 8 Completed in 0.179573ms (5979.42 GB/s)
Trial 9 Completed in 0.180553ms (5946.96 GB/s)
Success 2097152, Fail 0


blx.x, 0
(M, N): 16384, 16384
Copy with TMA load and store -- no swizzling.
smem size: 32896.
Trial 0 Completed in 0.6632ms (1619.03 GB/s)
Trial 1 Completed in 0.293118ms (3663.17 GB/s)
Trial 2 Completed in 0.291583ms (3682.46 GB/s)
Trial 3 Completed in 0.292431ms (3671.78 GB/s)
Trial 4 Completed in 0.292064ms (3676.39 GB/s)
Trial 5 Completed in 0.292127ms (3675.6 GB/s)
Trial 6 Completed in 0.29137ms (3685.15 GB/s)
Trial 7 Completed in 0.292178ms (3674.96 GB/s)
Trial 8 Completed in 0.29203ms (3676.82 GB/s)
Trial 9 Completed in 0.292341ms (3672.91 GB/s)
Success 2097152, Fail 0
```
When writing the final results to global memory, if using a conventional STORE, the results should be written to the address corresponding to blx.x, blx.y. However, since we are performing a reduction, the results should be written to the address (blx.x, 0), as the entire row is being reduced to one block.

Surprisingly, using the (blx.x, blx.y) address is much faster (5946.96 GB/s vs. 3672.91 GB/s) and the results are also correct, based on multiple measurements (with dimensions M = N = 16384).

However, I'm concerned that using (blx.x, blx.y) might write to incorrect variables, despite the performance improvement.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issue in Writing Reduced Results to Global Memory #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Issue in Writing Reduced Results to Global Memory #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions