Problem: When I test copy_bw on A100 SXM,I noticed some issue that need some help on explaination:
- H2D is persistently faster than cuMemcpy in small data size as expected, while for large data size it fails to grow when reaches 20GB/s. What is the reason for this degradation?
- D2H is much slower than H2D and never reaches high bandwidth even when data size is large enough. I guess the reason for the first phenomenon is that CPU need another dispatch trip to let GPU starts D2H, while I have no idea for the second one.
Anyone could help with the above questions?

Problem: When I test copy_bw on A100 SXM,I noticed some issue that need some help on explaination:
Anyone could help with the above questions?