At the hardware level, CUDA-enabled GPU is a set of SIMD stream multiprocessors
(SMs) with 8 stream processors (SPs) each. GeForce 8800GTX has 128 SPs and
Tesla C1060 has 240 SPs. Each SM contains a fast shared memory, which is shared
by all of its SPs as shown in Fig. 1. It also has a read-only constant cache and texture
cache which is shared by all the SPs on the GPU. A set of local 32-bit registers
is available for each SP. The SMs communicate through the global/device memory.
The global memory can be read or written by the host, and is persistent across kernel
launches by the same application. Shared memory is managed explicitly by the
programmers. Compared to the CPU, more transistors on the GPU are devoted to
computing, so the peak floating-point capability of the GPU is an order of magnitude
higher than that of the CPU, as well as the memory bandwidth due to NVIDIA’s
efforts on optimization.