For OpenCL see [here](http://developer.amd.com/resources/articles-whitepapers/opencl-optimization-case-study-simple-reductions/) For CUDA see [here](https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/) First functions could be: - sum (and thereby also mean) - min, max and maybe minmax