Need reduction backends

For OpenCL see [here](http://developer.amd.com/resources/articles-whitepapers/opencl-optimization-case-study-simple-reductions/)
For CUDA see [here](https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/)

First functions could be:
- sum (and thereby also mean)
- min, max and maybe minmax