In contrast, GPGPU languages model the GPU as a manycore architecture (as shown in Figure 1(b)), provide C/C++-like interfaces, and expose hardware features for general-purpose computation. For example, CUDA exposes hardware features including the fast interprocessor communication via the local memory, as well as the massive thread parallelism. The GPU has a large amount of device memory, which has high bandwidth and high access latency. Recently, primitives as the building blocks for higher-level applications have been proposed and implemented [20, 22, 33]. These GPU-based primitives further reduce the complexity of GPU programming.