In GPUs, threads are organized into warps and threads in a warp execute in lock-step.
GPUs deliver massive parallelism by alternating the execution of many concurrent warps and over-
lapping the long latency off-chip memory accesses of some warps with the computation of other
warps.