Most work in GPU computing over the last few years
has been performed using NVIDIA’s CUDA architecture.
The NVIDIA CUDA Programming Guide lists many
optimization strategies useful for extracting peak performance
on NVIDIA GPUs [11]. In [5], [6], [8], [9], Ryoo et
al. present optimization principles of a GPU using CUDA.
They conclude that though the optimizations assist in improving
performance, the optimization space is large and
tedious to explore by hand. In [12], Volkov et al. argue that
the GPU should be viewed as being composed of multithreaded
vector units and infer that one should make explicit
use of registers as primary on-chip memory as well as using
short vectors to conserve bandwidth.