|
// I don't understand why this __threadfence_block is needed, but it is. |
|
// Everything in the kernel should be warp synchronous but for some reason removing |
|
// this threadfence causes tests to fail. I assume this has something to do with the |
|
// branching above. I was under the impression that __threadfence_block basically didn't |
|
// do anything, but it is needed for correctness here and results in a minor performance loss. |
|
// Substituting in a syncthreads causes significant performance loss. |
|
__threadfence_block(); |
The threadfence is needed to ensure that the write has finished before the read is issued. Without it, the compiler has no reason to assume that there is a dependence between them, and is free to reorder them. An alternative (possibly preferred) solution is to mark the shared memory as volatile (discussed here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#volatile-qualifier).
As of CUDA 9, however, the correct approach will be to call __syncwarp() before the memory operations, which will guarantee the ordering of the instructions as well as the convergence of the warp (on Volta hardware, https://devblogs.nvidia.com/parallelforall/inside-volta/).
beanfarmer/beanfarmer_dp4a_noshfl_k.cu
Lines 183 to 189 in 923cc4a
The threadfence is needed to ensure that the write has finished before the read is issued. Without it, the compiler has no reason to assume that there is a dependence between them, and is free to reorder them. An alternative (possibly preferred) solution is to mark the shared memory as volatile (discussed here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#volatile-qualifier).
As of CUDA 9, however, the correct approach will be to call __syncwarp() before the memory operations, which will guarantee the ordering of the instructions as well as the convergence of the warp (on Volta hardware, https://devblogs.nvidia.com/parallelforall/inside-volta/).