If I build mnist.cu.cpp with -DGINN_ENABLE_GPU=0 and run, single epoch takes ~3.5s. If I build it with -DGINN_ENABLE_GPU=1 and run using the same docker instance (no gpu, falling back to cpu), single epoch takes ~30s.
Could be:
- Optimizer flags are not properly set / sufficient
- Although, I verified with verbose build that pxtas and compiler options are set to at least O3, is there anything else missing?
- nvcc is doing a poor job somehow
- Maybe test using cuda > 11.1
If I build
mnist.cu.cppwith-DGINN_ENABLE_GPU=0and run, single epoch takes ~3.5s. If I build it with-DGINN_ENABLE_GPU=1and run using the same docker instance (no gpu, falling back to cpu), single epoch takes ~30s.Could be: