Figure 2a shows an overview of the GPU Compute Unit
(CU)—called a streaming multiprocessor (SM) in NVIDIA
terminology. Within each CU is a set of lanes—called
shader processors (SPs) or CUDA cores by NVIDIA and
stream processors by AMD—which are functional units that
can execute one lane instruction per cycle. Instructions are
fetched, decoded and scheduled by the instruction fetch unit
which is shared by all lanes of the CU. The lanes on each
CU also share a large, banked, register file. Each lane is
associated with a scalar thread, and a set of concurrently
executing threads on the CU lanes is a warp. We model a
32-thread warp similar to NVIDIA GPU architecture.