Under the lock-step execution model, a warp is not ready for execution
until all of its threads are ready (e.g., no thread has outstanding
memory request). Meanwhile, cache-sensitive GPGPU workloads
often have high intra-warp locality [32, 33], which means
data blocks are re-referenced by their fetching warps